Re: KVM PMU virtualization
On Thu, 2010-03-04 at 09:00 +0800, Zhang, Yanmin wrote: > On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote: > > On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: > > > -#ifndef perf_misc_flags > > > -#define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER > > > : \ > > > -PERF_RECORD_MISC_KERNEL) > > > -#define perf_instruction_pointer(regs) instruction_pointer(regs) > > > -#endif > > > > Ah, that #ifndef is for powerpc, which I think you just broke. > Thanks for the reminder. I deleted powerpc codes when building cscope > lib. > > It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to > change them to a callback function struct and kvm registers its version to > perf. > > Such like: > struct perf_guest_info_callbacks { > int (*is_in_guest)(); > u64 (*get_guest_ip)(); > int (*copy_guest_stack)(); > int (*reset_in_guest)(); > ... > }; > int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *); > int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *); > > It's more scalable and neater. In case you guys might lose patience, I worked out a new patch against 2.6.34-rc1. It could work with: #perf kvm --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules /guest/os/proc/modules top It also support to collect both host side and guest side at the same time: #perf kvm --host --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules /guest/os/proc/modules top The first output line of top has guest kernel/user space percentage. Or just host side: #perf kvm --host As tool perf source codes have lots of changes, I am still working on perf kvm record and report. --- diff -Nraup linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h --- linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h 2010-03-09 13:04:20.730596079 +0800 +++ linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h 2010-03-10 17:06:34.228953260 +0800 @@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re #endif } +static inline int user_mode_cs(u16 cs) +{ +#ifdef CONFIG_X86_32 + return (cs & SEGMENT_RPL_MASK) == USER_RPL; +#else + return !!(cs & 3); +#endif +} + static inline int user_mode_vm(struct pt_regs *regs) { #ifdef CONFIG_X86_32 diff -Nraup linux-2.6.34-rc1/arch/x86/kvm/vmx.c linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c --- linux-2.6.34-rc1/arch/x86/kvm/vmx.c 2010-03-09 13:04:20.758593132 +0800 +++ linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c2010-03-10 17:11:49.709019136 +0800 @@ -26,6 +26,7 @@ #include #include #include +#include #include "kvm_cache_regs.h" #include "x86.h" @@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct vmcs_write32(TPR_THRESHOLD, irr); } +DEFINE_PER_CPU(int, kvm_in_guest) = {0}; + +static void kvm_set_in_guest(void) +{ + percpu_write(kvm_in_guest, 1); +} + +static int kvm_is_in_guest(void) +{ + return percpu_read(kvm_in_guest); +} + +static int kvm_is_user_mode(void) +{ + int user_mode; + user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR)); + return user_mode; +} + +static u64 kvm_get_guest_ip(void) +{ + return vmcs_readl(GUEST_RIP); +} + +static void kvm_reset_in_guest(void) +{ + if (percpu_read(kvm_in_guest)) + percpu_write(kvm_in_guest, 0); +} + +static struct perf_guest_info_callbacks kvm_guest_cbs = { + .is_in_guest= kvm_is_in_guest, + .is_user_mode = kvm_is_user_mode, + .get_guest_ip = kvm_get_guest_ip, + .reset_in_guest = kvm_reset_in_guest +}; + static void vmx_complete_interrupts(struct vcpu_vmx *vmx) { u32 exit_intr_info; @@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru /* We need to handle NMIs before interrupts are enabled */ if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR && - (exit_intr_info & INTR_INFO_VALID_MASK)) + (exit_intr_info & INTR_INFO_VALID_MASK)) { + kvm_set_in_guest(); asm("int $2"); + kvm_reset_in_guest(); + } idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK; @@ -4251,6 +4292,8 @@ static int __init vmx_init(void) if (bypass_guest_pf) kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull); + perf_register_guest_info_callbacks(&kvm_guest_cbs); + return 0; out3: @@ -4266,6 +4309,8 @@ out: static void __exit vmx_exit(void) { + perf_unregister_guest_info_callbacks(&kvm_guest_cbs); + free_page((unsigned long)vmx_msr_bitmap_legacy); free_page((unsigned long)vmx_msr_bitmap_longmode); free_page((unsigned long)vmx_io_bitmap_b); diff -Nraup linux-2.6.34-rc1/include/linux/perf_event.h linux-2.6.34-rc1_work/include/linux/perf_event.h --- linux-2.6.34-rc1/include/li
Re: KVM PMU virtualization
On 02/26/2010 04:42 PM, Peter Zijlstra wrote: Also, intel debugstore things requires a host linear address, It requires a linear address, not a host linear address. Of course, it might not like the linear address mappings changing under its feet. If it has a private tlb, then this won't work. again, not something a vcpu can easily provide (although that might be worked around with an msr trap, but that still limits you to 1 page data sizes, not a limitation all software will respect). If you're willing to pin pages, you can map the guest's buffer. That won't work if BTS can happen in parallel with a #VMEXIT, or if there are interactions with npt/ept. Will have to ask the vendors. All that said, what we really want is for Intel+AMD to come up with proper hw PMU virtualization support that makes it easy to rotate the full PMU in and out for a guest. Then this whole discussion will become a non issue. As it stands there simply are a number of PMU features that defy being virtualized, simply because the virt stuff doesn't do system topology. So even if they were to support a virtualized pmu, it would likely be a different beast than the native hardware is, and it will be several hardware models in the future, coming up with a paravirt interface and getting !linux hosts to adapt and !linux guests to use is probably as 'easy'. !linux hosts are someone else's problem, but how would be get !linux guests to use a soft pmu? The only way I see that happening is if a soft pmu is standardized across hypervisors, which is unfortunately unlikely. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 03/01/2010 07:17 PM, Peter Zijlstra wrote: 2. For every emulated performance counter the guest activates kvm allocates a perf_event and configures it for the guest (we may allow kvm to specify the counter index, the guest would be able to use rdpmc unintercepted then). Event filtering is also done in this step. rdpmc can never be used unintercepted, for perf might be multiplexing the actual hw. How often is rdpmc used? If it is invoked on high frequency software-only events (like context switches), then this may be a performance issue. If it is only issued on perf interrupts, we may be able to live with it (since we already took an exit for the interrupt). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote: > On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: > > -#ifndef perf_misc_flags > > -#define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER : \ > > -PERF_RECORD_MISC_KERNEL) > > -#define perf_instruction_pointer(regs) instruction_pointer(regs) > > -#endif > > Ah, that #ifndef is for powerpc, which I think you just broke. Thanks for the reminder. I deleted powerpc codes when building cscope lib. It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to change them to a callback function struct and kvm registers its version to perf. Such like: struct perf_guest_info_callbacks { int (*is_in_guest)(); u64 (*get_guest_ip)(); int (*copy_guest_stack)(); int (*reset_in_guest)(); ... }; int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *); int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *); It's more scalable and neater. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:13 +0100, Peter Zijlstra wrote: > On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: > > +static inline u64 perf_instruction_pointer(struct pt_regs *regs) > > +{ > > + u64 ip; > > + ip = percpu_read(perf_virt_ip.ip); > > + if (!ip) > > + ip = instruction_pointer(regs); > > + else > > + perf_reset_virt_ip(); > > + return ip; > > +} > > + > > +static inline unsigned int perf_misc_flags(struct pt_regs *regs) > > +{ > > + if (percpu_read(perf_virt_ip.ip)) { > > + return percpu_read(perf_virt_ip.user_mode) ? > > + PERF_RECORD_MISC_GUEST_USER : > > + PERF_RECORD_MISC_GUEST_KERNEL; > > + } else > > + return user_mode(regs) ? PERF_RECORD_MISC_USER : > > +PERF_RECORD_MISC_KERNEL; > > +} > > This codes in the assumption that perf_misc_flags() must only be called > before perf_instruction_pointer(), which is currently true, but you > might want to put a comment near to remind us of this. I will change the logic with a clear reset operation in caller. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: > -#ifndef perf_misc_flags > -#define perf_misc_flags(regs) (user_mode(regs) ? PERF_RECORD_MISC_USER : \ > -PERF_RECORD_MISC_KERNEL) > -#define perf_instruction_pointer(regs) instruction_pointer(regs) > -#endif Ah, that #ifndef is for powerpc, which I think you just broke. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote: > +static inline u64 perf_instruction_pointer(struct pt_regs *regs) > +{ > + u64 ip; > + ip = percpu_read(perf_virt_ip.ip); > + if (!ip) > + ip = instruction_pointer(regs); > + else > + perf_reset_virt_ip(); > + return ip; > +} > + > +static inline unsigned int perf_misc_flags(struct pt_regs *regs) > +{ > + if (percpu_read(perf_virt_ip.ip)) { > + return percpu_read(perf_virt_ip.user_mode) ? > + PERF_RECORD_MISC_GUEST_USER : > + PERF_RECORD_MISC_GUEST_KERNEL; > + } else > + return user_mode(regs) ? PERF_RECORD_MISC_USER : > +PERF_RECORD_MISC_KERNEL; > +} This codes in the assumption that perf_misc_flags() must only be called before perf_instruction_pointer(), which is currently true, but you might want to put a comment near to remind us of this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Wed, 2010-03-03 at 11:32 +0800, Zhang, Yanmin wrote: > On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote: > > * Zhang, Yanmin wrote: > > > > > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: > > > > > > My suggestion, as always, would be to start very simple and very > > > > minimal: > > > > > > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel > > > > image both as a host and as guest (for testing), to not have to deal > > > > with > > > > the symbol space transport problem initially. Enable 'perf kvm record' > > > > to > > > > only record guest events by default. Etc. > > > > > > > > This alone will be a quite useful result already - and gives a basis > > > > for > > > > further work. No need to spend months to do the big grand design > > > > straight > > > > away, all of this can be done gradually and in the order of usefulness > > > > - > > > > and you'll always have something that actually works (and helps your > > > > other > > > > KVM projects) along the way. > > > > > > It took me for a couple of hours to read the emails on the topic. Based > > > on > > > above idea, I worked out a prototype which is ugly, but does work with > > > top/record when both guest side and host side use the same kernel image, > > > while compiling most needed modules into kernel directly.. > > > > > > The commands are: > > > perf kvm top > > > perf kvm record > > > perf kvm report > > > > > > They just collect guest kernel hot functions. > > > > Fantastic, and there's some really interesting KVM guest/host comparison > > profiles you've done with this prototype! > > > > > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical > > > cpu). > > > > > > 1) Vanilla host kernel (6G memory): > > > > > >PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 > > > CPUs) > > > > > > > > > samples pcnt functionDSO > > > ___ _ ___ > > > > > > > > > 99376.00 40.5% ext3_test_allocatable > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 41239.00 16.8% bitmap_search_next_usable_block > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 7019.00 2.9% __ticket_spin_lock > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 5350.00 2.2% copy_user_generic_string > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 5208.00 2.1% do_get_write_access > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 4484.00 1.8% journal_dirty_metadata > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 4078.00 1.7% ext3_free_blocks_sb > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 3856.00 1.6% ext3_new_blocks > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 3485.00 1.4% journal_get_undo_access > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 2803.00 1.1% ext3_try_to_allocate > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 2241.00 0.9% __find_get_block > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 1957.00 0.8% find_revoke_record > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > > > > 2) guest os: start one guest os with 4GB memory. > > > > > >PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 > > > CPUs) > > > > > > > > > samples pcnt functionDSO > > > ___ _ ___ > > > > > > > > > 41701.00 28.1% __ticket_spin_lock > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 33843.00 22.8% ext3_test_allocatable > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 16862.00 11.4% bitmap_search_next_usable_block > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 3278.00 2.2% native_flush_tlb_others > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 3200.00 2.2% copy_user_generic_string > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 3009.00 2.0% do_get_write_access > > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > 2834.00 1.9% journal_d
Re: KVM PMU virtualization
On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote: > * Zhang, Yanmin wrote: > > > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: > > > > My suggestion, as always, would be to start very simple and very minimal: > > > > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel > > > image both as a host and as guest (for testing), to not have to deal with > > > the symbol space transport problem initially. Enable 'perf kvm record' to > > > only record guest events by default. Etc. > > > > > > This alone will be a quite useful result already - and gives a basis for > > > further work. No need to spend months to do the big grand design straight > > > away, all of this can be done gradually and in the order of usefulness - > > > and you'll always have something that actually works (and helps your > > > other > > > KVM projects) along the way. > > > > It took me for a couple of hours to read the emails on the topic. Based on > > above idea, I worked out a prototype which is ugly, but does work with > > top/record when both guest side and host side use the same kernel image, > > while compiling most needed modules into kernel directly.. > > > > The commands are: > > perf kvm top > > perf kvm record > > perf kvm report > > > > They just collect guest kernel hot functions. > > Fantastic, and there's some really interesting KVM guest/host comparison > profiles you've done with this prototype! > > > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical > > cpu). > > > > 1) Vanilla host kernel (6G memory): > > > >PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) > > > > > > samples pcnt functionDSO > > ___ _ ___ > > > > > > 99376.00 40.5% ext3_test_allocatable > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 41239.00 16.8% bitmap_search_next_usable_block > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 7019.00 2.9% __ticket_spin_lock > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 5350.00 2.2% copy_user_generic_string > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 5208.00 2.1% do_get_write_access > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 4484.00 1.8% journal_dirty_metadata > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 4078.00 1.7% ext3_free_blocks_sb > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 3856.00 1.6% ext3_new_blocks > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 3485.00 1.4% journal_get_undo_access > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 2803.00 1.1% ext3_try_to_allocate > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 2241.00 0.9% __find_get_block > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 1957.00 0.8% find_revoke_record > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > > 2) guest os: start one guest os with 4GB memory. > > > >PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 CPUs) > > > > > > samples pcnt functionDSO > > ___ _ ___ > > > > > > 41701.00 28.1% __ticket_spin_lock > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 33843.00 22.8% ext3_test_allocatable > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 16862.00 11.4% bitmap_search_next_usable_block > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 3278.00 2.2% native_flush_tlb_others > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 3200.00 2.2% copy_user_generic_string > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 3009.00 2.0% do_get_write_access > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 2834.00 1.9% journal_dirty_metadata > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 1965.00 1.3% journal_get_undo_access > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 1907.00 1.3% ext3_new_blocks > > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 1790.00 1.2%
Re: KVM PMU virtualization
On Tue, 2010-03-02 at 15:09 +0800, Zhang, Yanmin wrote: > With vanilla host kernel, perf top data is stable and spinlock doesn't take > too much cpu time. > With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it > fluctuates between 9%~28%. > > Another interesting finding is aim7. If I start aim7 on tmpfs testing in > guest os with 1GB memory, > the login hangs and cpu is busy. With the new patch, I could check what > happens in guest os, where > spinlock is busy and kernel is shrinking memory mostly from slab. Hehe, you've just discovered the reason for paravirt spinlocks ;-) But neat stuff, although I don't think you need PERF_SAMPLE_KVM, it should simply always report the guest sample if it came from the guest, you can extend PERF_RECORD_MISC_CPUMODE_MASK to add guest states. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Zhang, Yanmin wrote: > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: > > My suggestion, as always, would be to start very simple and very minimal: > > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel > > image both as a host and as guest (for testing), to not have to deal with > > the symbol space transport problem initially. Enable 'perf kvm record' to > > only record guest events by default. Etc. > > > > This alone will be a quite useful result already - and gives a basis for > > further work. No need to spend months to do the big grand design straight > > away, all of this can be done gradually and in the order of usefulness - > > and you'll always have something that actually works (and helps your other > > KVM projects) along the way. > > It took me for a couple of hours to read the emails on the topic. Based on > above idea, I worked out a prototype which is ugly, but does work with > top/record when both guest side and host side use the same kernel image, > while compiling most needed modules into kernel directly.. > > The commands are: > perf kvm top > perf kvm record > perf kvm report > > They just collect guest kernel hot functions. Fantastic, and there's some really interesting KVM guest/host comparison profiles you've done with this prototype! > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical > cpu). > > 1) Vanilla host kernel (6G memory): > >PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) > > > samples pcnt functionDSO > ___ _ ___ > > > 99376.00 40.5% ext3_test_allocatable > /lib/modules/2.6.33-kvmymz/build/vmlinux > 41239.00 16.8% bitmap_search_next_usable_block > /lib/modules/2.6.33-kvmymz/build/vmlinux > 7019.00 2.9% __ticket_spin_lock > /lib/modules/2.6.33-kvmymz/build/vmlinux > 5350.00 2.2% copy_user_generic_string > /lib/modules/2.6.33-kvmymz/build/vmlinux > 5208.00 2.1% do_get_write_access > /lib/modules/2.6.33-kvmymz/build/vmlinux > 4484.00 1.8% journal_dirty_metadata > /lib/modules/2.6.33-kvmymz/build/vmlinux > 4078.00 1.7% ext3_free_blocks_sb > /lib/modules/2.6.33-kvmymz/build/vmlinux > 3856.00 1.6% ext3_new_blocks > /lib/modules/2.6.33-kvmymz/build/vmlinux > 3485.00 1.4% journal_get_undo_access > /lib/modules/2.6.33-kvmymz/build/vmlinux > 2803.00 1.1% ext3_try_to_allocate > /lib/modules/2.6.33-kvmymz/build/vmlinux > 2241.00 0.9% __find_get_block > /lib/modules/2.6.33-kvmymz/build/vmlinux > 1957.00 0.8% find_revoke_record > /lib/modules/2.6.33-kvmymz/build/vmlinux > > 2) guest os: start one guest os with 4GB memory. > >PerfTop: 827 irqs/sec kernel: 0.0% [1000Hz cycles], (all, 16 CPUs) > > > samples pcnt functionDSO > ___ _ ___ > > > 41701.00 28.1% __ticket_spin_lock > /lib/modules/2.6.33-kvmymz/build/vmlinux > 33843.00 22.8% ext3_test_allocatable > /lib/modules/2.6.33-kvmymz/build/vmlinux > 16862.00 11.4% bitmap_search_next_usable_block > /lib/modules/2.6.33-kvmymz/build/vmlinux > 3278.00 2.2% native_flush_tlb_others > /lib/modules/2.6.33-kvmymz/build/vmlinux > 3200.00 2.2% copy_user_generic_string > /lib/modules/2.6.33-kvmymz/build/vmlinux > 3009.00 2.0% do_get_write_access > /lib/modules/2.6.33-kvmymz/build/vmlinux > 2834.00 1.9% journal_dirty_metadata > /lib/modules/2.6.33-kvmymz/build/vmlinux > 1965.00 1.3% journal_get_undo_access > /lib/modules/2.6.33-kvmymz/build/vmlinux > 1907.00 1.3% ext3_new_blocks > /lib/modules/2.6.33-kvmymz/build/vmlinux > 1790.00 1.2% ext3_free_blocks_sb > /lib/modules/2.6.33-kvmymz/build/vmlinux > 1741.00 1.2% find_revoke_record > /lib/modules/2.6.33-kvmymz/build/vmlinux > > > With vanilla host kernel, perf top data is stable and spinlock d
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote: > * Joerg Roedel wrote: > > > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: > > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: > > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: > > > > > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > > > >host. > > > > > > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be > > > > configured to count only when in guest mode. Perf needs to be aware of > > > > that and fetch the rip from a different place when monitoring a guest. > > > > > The idea is we want to measure both host and guest at the same time, and > > > compare all the hot functions fairly. > > > > So you want to measure while the guest vcpu is running and the vmexit > > path of that vcpu (including qemu userspace part) together? The > > challenge here is to find out if a performance event originated in guest > > mode or in host mode. > > But we can check for that in the nmi-protected part of the vmexit path. > > As far as instrumentation goes, virtualization is simply another 'PID > dimension' of measurement. > > Today we can isolate system performance measurements/events to the following > domains: > > - per system > - per cpu > - per task > > ( Note that PowerPC already supports certain sorts of > 'hypervisor/kernel/user' > domain separation, and we have some ABI details for all that but it's by no > means complete. Anton is using the PowerPC bits AFAIK, so it already works > to a certain degree. ) > > When extending measurements to KVM, we want two things: > > - user friendliness: instead of having to check 'ps' and figure out which >Qemu thread is the KVM thread we want to profile, just give a convenience >namespace to access guest profiling info. -G ought to map to the first >currently running KVM guest it can find. (which would match like 90% of the >cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something >useful by default the whole effort is for naught. > > - Extend core facilities and enable the following measurement dimensions: > > host-kernel-space > host-user-space > guest-kernel-space > guest-user-space > >on a per guest basis. We want to be able to measure just what the guest >does, and we want to be able to measure just what the host does. > >Some of this the hardware helps us with (say only measuring host kernel >events is possible), some has to be done by fiddling with event >enable/disable at vm-exit / vm-entry time. > > My suggestion, as always, would be to start very simple and very minimal: > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image > both as a host and as guest (for testing), to not have to deal with the > symbol > space transport problem initially. Enable 'perf kvm record' to only record > guest events by default. Etc. > > This alone will be a quite useful result already - and gives a basis for > further work. No need to spend months to do the big grand design straight > away, all of this can be done gradually and in the order of usefulness - and > you'll always have something that actually works (and helps your other KVM > projects) along the way. It took me for a couple of hours to read the emails on the topic. Based on above idea, I worked out a prototype which is ugly, but does work with top/record when both guest side and host side use the same kernel image, while compiling most needed modules into kernel directly.. The commands are: perf kvm top perf kvm record perf kvm report They just collect guest kernel hot functions. > > [ And, as so often, once you walk that path, that grand scheme you are > thinking about right now might easily become last year's really bad idea > ;-) ] > > So please start walking the path and experience the challenges first-hand. With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu). 1) Vanilla host kernel (6G memory): PerfTop: 15491 irqs/sec kernel:93.6% [1000Hz cycles], (all, 16 CPUs) samples pcnt functionDSO ___ _ ___ 99376.00 40.5% ext3_test_allocatable /lib/modules/2.6.33-kvmymz/build/vmlinux 41239.00 16.8% bitmap_search_next_usable_block /lib/modules/2.6.33-kvmymz/build/vmlinux 7019.00 2.9% __ticket_spin_lock /lib/modules/2.6.33-kvmymz/build/vmlinux 5350.00 2.2% copy_user_generic_string /lib/modules/2.6.33-kvmymz/build/vmlinux 520
Re: KVM PMU virtualization
On 02/26/2010 05:55 AM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote: On 02/26/2010 05:08 PM, Peter Zijlstra wrote: That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? Good question. Would be interesting to try out VTune with the non-arch pmu masked out. Also, the ANY bit is part of the intel arch pmu, but you still have to mask it out. BTW, just wondering, why would a developer be running VTune in a guest anyway? I'd think that a developer that windows oriented would simply run windows on his desktop and VTune there. What if you want to run on 10 different variations of Windows 32 / 64 / server / desktop configurations. Do you maintain 10 installed pieces of hardware? A virtual machine is a better solution. And you might want to performance tune all 10 of those configurations as well. Be nice if it were possible. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:37 AM, Jes Sorensen wrote: On 02/26/10 14:31, Ingo Molnar wrote: You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That would only work if you are guaranteed to be able to emulate old hardware on new hardware. Not going to be feasible, so then we are in a real mess. With the 'steal the PMU' messy approach the guest OS has to be upgraded to the new CPU type all the time. Ad infinitum. The way the Perfmon architecture is specified by Intel, that is what we are stuck with. It's not going to be possible via software emulation to count cache misses, unless you run it in a micro architecture emulator. Sure you can count cache misses. Step 1. Declare KVM to possess a virtual cache hereto unseen to guest VCPUs. Step 2. Use micro architecture rules to add to cache misses in an undefined micro-architecture specific way Step 3. Step 4. PROFIT! The point being, there are no rules required to follow for architecturally unspecified events. Instructions issued is well defined architecturally, one of very few such counters, while things like cache strides and organization are deliberately left to the implementation. So returning zero is a perfectly valid choice for emulating cache misses. Zach -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Mon, Mar 01, 2010 at 06:17:40PM +0100, Peter Zijlstra wrote: > On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote: > > > > 1. Enhance perf to count pmu events only when cpu is in guest mode. > > No enhancements needed, only hardware support for Intel doesn't provide > this iirc. At least the guest-bit for AMD perfctl registers is not supported yet ;-) Implementing this eliminates the requirement to write the perfctl msrs on every vmrun and every vmexit. But thats a minor change. > > 4. Some additional magic to reinject pmu events into the guest > > Right, that is needed, and might be 'interesting' since we get them from > NMI context. I imagine some kind of callback which sets a flag in the kvm vcpu structure. Since the NMI already triggered a vmexit the kvm code checks for this bit on its path to re-entry. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote: > > 1. Enhance perf to count pmu events only when cpu is in guest mode. No enhancements needed, only hardware support for Intel doesn't provide this iirc. > 2. For every emulated performance counter the guest activates kvm >allocates a perf_event and configures it for the guest (we may allow >kvm to specify the counter index, the guest would be able to use >rdpmc unintercepted then). Event filtering is also done in this step. rdpmc can never be used unintercepted, for perf might be multiplexing the actual hw. > 3. Before vmrun the guest activates all its counters, Right, this is could be used to approximate the guest only counting. I'm not sure how the OS and USR bits interact with guest stuff - if the PMU isn't aware of the virtualized priv levels then those will not work as expected. > this can fail if >the host uses them or the requested pmc index is not available for some >reason. perf doesn't know about pmc indexes at the interface level, nor is that needed I think. > 4. Some additional magic to reinject pmu events into the guest Right, that is needed, and might be 'interesting' since we get them from NMI context. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 01:42 AM, Ingo Molnar wrote: * Jes Sorensen wrote: On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) I have to agree strongly with Ingo here. If you can't reset, restore or offset the perf counters in hardware, then you can't expose them to the guest. There is too much rich information about host state that can be derived and considered an information leak or covert channel, and you can't allow the guest to trample host PMU state. On some architectures, to bank switch these perf counters is possible since you can read and write the full size counter MSRs. However, it is a cumbersome task that must be done at every preemption point. There are many ways to do so as lazily as possible so that overhead only happens in a guest which actively uses the PMU. With careful bookkeeping, you can even compound the guest PMU counters back into the host counters if the host is using the PMU. Sorting out the details about who to deliver the PMU exception to however, the host or the guest, when an overflow occurs, is a nasty, ugly dilemma, as is properly programming the counters so that overflow happens in a controlled fashion when both the host and the guest are attempting to use this feature. So supporting "step ahead 13 instructions and then give me an interrupt so I can signal my debugger" simultaneously and correctly in both the host and guest is a very hard task, perhaps untenable. Zach -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Mon, Mar 01, 2010 at 09:44:50AM +0100, Ingo Molnar wrote: > There's a world of a difference between "will not use in certain usecases" > and > "cannot use at all because we've designed it so". By doing the latter we > guarantee that sane shared usage of the PMU will never occur - which is bad. I think we can emulate a real hardware pmu for guests using the perf infrastructure. The emulation will not be complete but powerful enough for most usecases. Some steps towards this might be: 1. Enhance perf to count pmu events only when cpu is in guest mode. 2. For every emulated performance counter the guest activates kvm allocates a perf_event and configures it for the guest (we may allow kvm to specify the counter index, the guest would be able to use rdpmc unintercepted then). Event filtering is also done in this step. 3. Before vmrun the guest activates all its counters, this can fail if the host uses them or the requested pmc index is not available for some reason. 4. Some additional magic to reinject pmu events into the guest > Think about it: this whole Linux thing is about 'sharing' resources. That > concet really works and permeates everything we do in the kernel. Yes, it's > somewhat hard for the PMU but we've done it on the host side via perf events > and we really dont want to look back ... As I learnt at the university this whole operating system thing is about managing resource sharing and hardware abstraction ;-) > My experience is that once the right profiling/tracing tools are there, > people > will use them in every which way. The bigger a box is, the more likely shared > usage will occur - just statistically. Which coincides with KVM's "the bigger > the box, the better for virtualization" general mantra. With the above approach the only point of conflict occurs when the host wants to monitor the qemu-processes executing the vcpus which want to do performance monitoring of their own or cpu-wide counting. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote: > > > What do you mean by software events? > > > > Things like: > > > > aldebaran:~> perf stat -a sleep 1 > > > > Performance counter stats for 'sleep 1': > > > >15995.719133 task-clock-msecs # 15.981 CPUs > >5787 context-switches # 0.000 M/sec > > 210 CPU-migrations # 0.000 M/sec > > 193909 page-faults # 0.012 M/sec > > 28704833507 cycles # 1794.532 M/sec (scaled from > > 78.69%) > > 14387445668 instructions # 0.501 IPC(scaled from > > 90.71%) > > 736644616 branches # 46.053 M/sec (scaled from > > 90.52%) > > 695884659 branch-misses# 94.467 % (scaled from > > 90.70%) > > 727070678 cache-references # 45.454 M/sec (scaled from > > 88.11%) > > 1305560420 cache-misses # 81.619 M/sec (scaled from > > 52.00%) > > > > 1.000942399 seconds time elapsed > > > > These lines: > > > >15995.719133 task-clock-msecs # 15.981 CPUs > >5787 context-switches # 0.000 M/sec > > 210 CPU-migrations # 0.000 M/sec > > 193909 page-faults # 0.012 M/sec > > > > Are software events of the host - a subset of which could be transparently > > exposed to the guest. Same for tracepoints, probes, etc. Those are not > > exposed > > by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt > > channel to perf events) we gain a lot more than just raw PMU functionality. > > > > 'performance events' are about a lot more than just the PMU, it's a > > coherent > > system health / system events / structured logging framework. > > Yeah I know. But these event should be available in the guest already, no? > [...] How would an old Linux or Windows guest know about them? Also, even for new-Linux guests, they'd only have access to their own internal events - not to any host events. My suggestion (admittedly not explained in any detail) was to allow guest access to certain _host_ events. I.e. a guest could profile its own impact on the host (such as VM exits, IO done on the host side, scheduling, etc.), without it having any (other) privileged access to the host. This would be a powerful concept: you could profile your guest for host efficiency, _without_ having access to the host - beyond those events themselves. (which would be set up in a carefully filtered-to-guest manner.) > [...] They don't need any kind of hardware support from the pmu. A paravirt > perf channel from the guest to the host would be definitly a win. It would > be a powerful tool for kvm/linux-guest analysis (e.g. trace host-kvm and > guest-events together on the host) Yeah. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote: > > What do you mean by software events? > > Things like: > > aldebaran:~> perf stat -a sleep 1 > > Performance counter stats for 'sleep 1': > >15995.719133 task-clock-msecs # 15.981 CPUs >5787 context-switches # 0.000 M/sec > 210 CPU-migrations # 0.000 M/sec > 193909 page-faults # 0.012 M/sec > 28704833507 cycles # 1794.532 M/sec (scaled from > 78.69%) > 14387445668 instructions # 0.501 IPC(scaled from > 90.71%) > 736644616 branches # 46.053 M/sec (scaled from > 90.52%) > 695884659 branch-misses# 94.467 % (scaled from > 90.70%) > 727070678 cache-references # 45.454 M/sec (scaled from > 88.11%) > 1305560420 cache-misses # 81.619 M/sec (scaled from > 52.00%) > > 1.000942399 seconds time elapsed > > These lines: > >15995.719133 task-clock-msecs # 15.981 CPUs >5787 context-switches # 0.000 M/sec > 210 CPU-migrations # 0.000 M/sec > 193909 page-faults # 0.012 M/sec > > Are software events of the host - a subset of which could be transparently > exposed to the guest. Same for tracepoints, probes, etc. Those are not > exposed > by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt > channel to perf events) we gain a lot more than just raw PMU functionality. > > 'performance events' are about a lot more than just the PMU, it's a coherent > system health / system events / structured logging framework. Yeah I know. But these event should be available in the guest already, no? They don't need any kind of hardware support from the pmu. A paravirt perf channel from the guest to the host would be definitly a win. It would be a powerful tool for kvm/linux-guest analysis (e.g. trace host-kvm and guest-events together on the host) Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote: > > - A paravirt event driver is more compatible and more transparent in the > > long > >run: it allows hardware upgrade and upgraded PMU functionality (for > > Linux) > >without having to upgrade the guest OS. Via that a guest OS could even be > >live-migrated to a different PMU, without noticing anything about it. > > > >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS > >always assumes the guest OS is upgraded to the host. Also, 'raw' PMU > > state > >cannot be live-migrated. (save/restore doesnt help) > > I agree with your arguments, having this soft-pmu for the guest has some > advantages over raw pmu access. It has a lot of advantages if the guest > migrated between hosts with different hardware. > > But I still think we should have both, a soft-pmu and a pmu-emulation (which > is a more accurate term than 'raw guest pmu access') that looks to the guest > as a real hardware pmu would look like. On a linux host that is dedicated > to executing virtual kvm machines there is little point in sharing the pmu > between guest and host because the host will probably never use it. There's a world of a difference between "will not use in certain usecases" and "cannot use at all because we've designed it so". By doing the latter we guarantee that sane shared usage of the PMU will never occur - which is bad. Really, similar arguments have been made in the past about different domains of system usage: "one profiling session per system is more than enough, who needs transparent, per user profilers", etc. Such restrictions have been broken through again and again. Think about it: this whole Linux thing is about 'sharing' resources. That concet really works and permeates everything we do in the kernel. Yes, it's somewhat hard for the PMU but we've done it on the host side via perf events and we really dont want to look back ... My experience is that once the right profiling/tracing tools are there, people will use them in every which way. The bigger a box is, the more likely shared usage will occur - just statistically. Which coincides with KVM's "the bigger the box, the better for virtualization" general mantra. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > > - It's more secure: the host can have a finegrained policy about what > > kinds of > >events it exposes to the guest. It might chose to only expose software > >events for example. > > What do you mean by software events? Things like: aldebaran:~> perf stat -a sleep 1 Performance counter stats for 'sleep 1': 15995.719133 task-clock-msecs # 15.981 CPUs 5787 context-switches # 0.000 M/sec 210 CPU-migrations # 0.000 M/sec 193909 page-faults # 0.012 M/sec 28704833507 cycles # 1794.532 M/sec (scaled from 78.69%) 14387445668 instructions # 0.501 IPC(scaled from 90.71%) 736644616 branches # 46.053 M/sec (scaled from 90.52%) 695884659 branch-misses# 94.467 % (scaled from 90.70%) 727070678 cache-references # 45.454 M/sec (scaled from 88.11%) 1305560420 cache-misses # 81.619 M/sec (scaled from 52.00%) 1.000942399 seconds time elapsed These lines: 15995.719133 task-clock-msecs # 15.981 CPUs 5787 context-switches # 0.000 M/sec 210 CPU-migrations # 0.000 M/sec 193909 page-faults # 0.012 M/sec Are software events of the host - a subset of which could be transparently exposed to the guest. Same for tracepoints, probes, etc. Those are not exposed by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt channel to perf events) we gain a lot more than just raw PMU functionality. 'performance events' are about a lot more than just the PMU, it's a coherent system health / system events / structured logging framework. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 04:14:08PM +0100, Peter Zijlstra wrote: > On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote: > > > If you give a full PMU to a guest it's a whole different dimension and > > > quality > > > of information. Literally hundreds of different events about all sorts of > > > aspects of the CPU and the hardware in general. > > > > > > > Well, we filter out the bad events then. > > Which requires trapping the MSR access, at which point a soft-PMU is > almost there, right? The perfctl msrs need to be trapped anyway. Otherwise the guest could generate NMIs in host context. But access to the perfctr registers could be given to the guest. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 03:12:29PM +0100, Ingo Molnar wrote: > You mean Windows? > > For heaven's sake, why dont you think like Linus thought 20 years ago. To the > hell with Windows suckiness and lets make sure our stuff works well. Then the > users will come, developers will come, and people will profile Linux under > Linux and maybe the tools will be so good that they'll profile under Linux > using Wine just to be able to use those good tools... Thats not a good comparison. Linux is nothing completly new, it was, and still is, a new implemenation of an existing operating system concept and thus at least mostly source-compatible to other operating systems implementing this concept. Linux would never had this success if it were not posix compliant and could run applications like X or gcc which were written for other operating systems. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote: > - A paravirt event driver is more compatible and more transparent in the > long >run: it allows hardware upgrade and upgraded PMU functionality (for Linux) >without having to upgrade the guest OS. Via that a guest OS could even be >live-migrated to a different PMU, without noticing anything about it. > >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS >always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state >cannot be live-migrated. (save/restore doesnt help) I agree with your arguments, having this soft-pmu for the guest has some advantages over raw pmu access. It has a lot of advantages if the guest migrated between hosts with different hardware. But I still think we should have both, a soft-pmu and a pmu-emulation (which is a more accurate term than 'raw guest pmu access') that looks to the guest as a real hardware pmu would look like. On a linux host that is dedicated to executing virtual kvm machines there is little point in sharing the pmu between guest and host because the host will probably never use it. This pmu-emulation will still use the perf-infrastructure for scheduling the pmu registers, programming the pmu registers and things like that. This could be used for example to emulate 48bit counters for the guest even if the host only supports 32 bit counters. We even need the perf infrastructure when we need to reinject pmu events into the guest. > - It's more secure: the host can have a finegrained policy about what kinds > of >events it exposes to the guest. It might chose to only expose software >events for example. What do you mean by software events? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 06:03 PM, Avi Kivity wrote: Note, I'll be away for a week, so will not be responsive for a while -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 05:55 PM, Peter Zijlstra wrote: BTW, just wondering, why would a developer be running VTune in a guest anyway? I'd think that a developer that windows oriented would simply run windows on his desktop and VTune there. Cloud. You have an app running somewhere on a cloud, internally or externally (you may not even know). It's running a production workload and it isn't doing well. You can't reproduce it on your desktop ("works for me, now go away"). So you rdesktop to your guest and monitor it. You can't run anything on the host - you don't have access to it, you don't know who admins it (it's a program anyway), "the host" doesn't even exist, the guest moves around whenever the cloud feels like it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:37 PM, Ingo Molnar wrote: * Avi Kivity wrote: Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? Not at all. Virtualization is a hardware compatibility game. To see what happens if you don't play it, see Xen. Eventually they to implemented hardware support even though the pv approach is so wonderful. That's not quite equivalent though. KVM used to be the clean, integrate-code-with-Linux virtualization approach, designed specifically for CPUs that can be virtualized properly. (VMX support first, then SVM, etc.) KVM virtualized ages-old concepts with relatively straightforward hardware ABIs: x86 execution, IRQ abstractions, device abstractions, etc. Now you are in essence turning that all around: - the PMU is by no means properly virtualized nor really virtualizable by direct access. There's no virtual PMU that ticks independently of the host PMU. There's no guest debug registers that can be programmed independently of the host debug registers, but we manage somehow. It's not perfect, but better than nothing. For the common case of host-only or guest-only monitoring, things will work, perhaps without socketwide counters in security concious environments. When both are used at the same time, something will have to give. - the PMU hardware itself is not a well standardized piece of hardware. It's very vendor dependent and very limiting. That's life. If we force standardization by having a soft pmu, we'll be very limited as well. If we don't, we reduce hardware independence which is a strong point of virtualization. Clearly we need to make a trade-off here. In favour of hardware dependence is that tools and users are already used to it. There is also the architectural pmu that can provide a limited form of hardware independence. Going pv trades off hardware dependence for software dependence. Suddenly only guests that you have control over can use the pmu. So to some degree you are playing the role of Xen in this specific affair. You are pushing for something that shouldnt be done in that form. You want to interfere with the host PMU by going via the fast& easy short-term hack to just let the guest OS have the PMU, without any regard to how this impacts long-term feasible solutions. Maybe. And maybe the vendors will improve virtualization support for the pmu, rendering the pv approach obsolete on new hardware. I.e. you are a bit like the guy who would have told Linus in 1994: " Dude, why dont you use the Windows APIs? It's far more compatible and that's the only way you could run any serious apps. Besides, it requires no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our installed base after all. " Hey, maybe we'd have significant desktop market share if he'd done this (though a replay of the wine history is much more likely). But what are you suggesting? That we make Windows a second class guest? Most users run a mix of workloads, that will not go down well with them. The choice is between first-class Windows support vs becoming a hobby hypervisor. Let's make a kerner/user analogy again. Would you be in favour of GPL-only-ing new syscalls, to give open source applications an edge over proprietary apps (technically known as "crap" among some)? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote: > On 02/26/2010 05:08 PM, Peter Zijlstra wrote: > >> That's 7 more than what we support now, and 7 more than what we can > >> guarantee without it. > >> > > Again, what windows software uses only those 7? Does it pay to only have > > access to those 7 or does it limit the usability to exactly the same > > subset a paravirt interface would? > > > > Good question. Would be interesting to try out VTune with the non-arch > pmu masked out. Also, the ANY bit is part of the intel arch pmu, but you still have to mask it out. BTW, just wondering, why would a developer be running VTune in a guest anyway? I'd think that a developer that windows oriented would simply run windows on his desktop and VTune there. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote: > On 02/26/2010 05:08 PM, Peter Zijlstra wrote: > >> That's 7 more than what we support now, and 7 more than what we can > >> guarantee without it. > >> > > Again, what windows software uses only those 7? Does it pay to only have > > access to those 7 or does it limit the usability to exactly the same > > subset a paravirt interface would? > > > > Good question. Would be interesting to try out VTune with the non-arch > pmu masked out. >From what I understood VTune uses PEBS+LBR, although I suppose they have simple PMU modes too, never actually seen the software. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote: > > If you give a full PMU to a guest it's a whole different dimension and > > quality > > of information. Literally hundreds of different events about all sorts of > > aspects of the CPU and the hardware in general. > > > > Well, we filter out the bad events then. Which requires trapping the MSR access, at which point a soft-PMU is almost there, right? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 05:08 PM, Peter Zijlstra wrote: That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? Good question. Would be interesting to try out VTune with the non-arch pmu masked out. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 16:54 +0200, Avi Kivity wrote: > On 02/26/2010 04:27 PM, Peter Zijlstra wrote: > > On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: > > > >> That actually works on the Intel-only architectural pmu. I'm beginning > >> to like it more and more. > >> > > Only for the arch defined events, all _7_ of them. > > > > That's 7 more than what we support now, and 7 more than what we can > guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:27 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. Only for the arch defined events, all _7_ of them. That's 7 more than what we support now, and 7 more than what we can guarantee without it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:12 PM, Ingo Molnar wrote: Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. You mean like bus transactions on a multicore? Well, we're already exposed to cache timing attacks. If you give a full PMU to a guest it's a whole different dimension and quality of information. Literally hundreds of different events about all sorts of aspects of the CPU and the hardware in general. Well, we filter out the bad events then. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. What about Windows? What is your question? Why should i limit Linux kernel design decisions based on any aspect of Windows? You might want to support it, but _please_ dont let the design be dictated by it ... In our case the quality of implementation is judged by how well we support workloads that users run, and that means we have to support Windows well. And that more or less means we can't have a pv-only pmu. Which part of this do you disagree with? In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) Why not? So long as the source and destination are compatible? 'As long as it works' is certainly a good enough filter for quality ;-) We already have this. If you expose sse4.2 to the guest, you can't migrate to a host which doesn't support it. If you expose a Nehalem pmu to the guest, you can't migrate to a host which supports it. Users and tools already understand this. It's true that the pmu case is more difficult since you can't migrate forwards as well as backwards, but that's life. No, we can hide insecure events with a full pmu. Trap the control register and don't pass it on to the hardware. So you basically concede partial emulation ... Yes. Still appears to follow the spec to the guest, though. And with the option of full emulation for those who need it and sign on the dotted line. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. We only control half of the equation. That's very different compared to tools/perf. You mean Windows? For heaven's sake, why dont you think like Linus thought 20 years ago. To the hell with Windows suckiness and lets make sure our stuff works well. In our case, making our stuff work well means making sure guests of the user's choice run well. Not ours. Currently users mostly choose Windows and Linux, so we have to make them both work. (btw, the analogy would be, 'To hell with Unix suckiness, let's make sure our stuff works well'; where Linux reimplemented the Unix APIs, ensuring source compatibility with applications, kvm reimplements the hardware interface, ensuring binary compatibility with guests). Then the users will come, developers will come, and people will profile Linux under Linux and maybe the tools will be so good that they'll profile under Linux using Wine just to be able to use those good tools... If we don't support Windows well, users will walk away, followed by starving developers. If you gut Linux capabilities like that to accomodate for the suckiness of Windows, without giving a technological edge to Linux, and then we are bound to fail in the long run ... I'm all for abusing the tight relationship between Linux-as-a-host and Linux-as-a-guest to gain an advantage for both. One fruitful area would be asynchronous page faults, which has the potential to increase memory overcommit, for example. But first of all we need to make sure that there is a baseline of support for all commonly used guests. I think of it this way: once kvm deployment becomes widespread, Linux-as-a-guest gains an advantage. But in order for kvm deployment to become widespread, it needs excellent support for all guests users actually use. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote: > > Scheduling at event granularity would be a good thing. However we need > to be able to handle the guest using the full pmu. Does the full PMU include things like LBR, PEBS and uncore? in that case, there is no way you're going to get that properly and securely virtualized by using raw access. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote: > > Even if there were no security considerations, if the guest can observe > host data in the pmu, it means the pmu is inaccurate. We should expose > guest data only in the guest pmu. That's not difficult to do, you stop > the pmu on exit and swap the counters on context switches. That's not enough, memory node wide counters are impossible to isolate like that, the same for core wide (ANY flag) counters. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 14:51 +0100, Jes Sorensen wrote: > > > Furthermore, when KVM doesn't virtualize the physical system topology, > > some PMU features cannot even be sanely used from a vcpu. > > That is definitely an issue, and there is nothing we can really do about > that. Having two guests running in parallel under KVM means that they > are going to see more cache misses than they would if they ran barebone > on the hardware. > > However even with all of this, we have to keep in mind who is going to > use the performance monitoring in a guest. It is going to be application > writers, mostly people writing analytical/scientific applications. They > rarely have control over the OS they are running on, but are given > systems and told to work on what they are given. Driver upgrades and > things like that don't come quickly. However they also tend to > understand limitations like these and will be able to still benefit from > perf on a system like that. What I meant was things like memory controller bound counters, intel uncore and amd northbridge, without knowing what node the vcpu got scheduled to there is no way they can program the raw hardware in a meaningful way, amd nb in particular is interesting in that you could choose not to offer the intel uncore msrs, but the amd nb are shadowed over the generic pmcs, so you have no way to filter those out. Same goes for stuff like the intel ANY flag, LBR filter control and similar muck, a vcpu can't make use of those things in a meaningful manner. Also, intel debugstore things requires a host linear address, again, not something a vcpu can easily provide (although that might be worked around with an msr trap, but that still limits you to 1 page data sizes, not a limitation all software will respect). > All that said, what we really want is for Intel+AMD to come up with > proper hw PMU virtualization support that makes it easy to rotate the > full PMU in and out for a guest. Then this whole discussion will become > a non issue. As it stands there simply are a number of PMU features that defy being virtualized, simply because the virt stuff doesn't do system topology. So even if they were to support a virtualized pmu, it would likely be a different beast than the native hardware is, and it will be several hardware models in the future, coming up with a paravirt interface and getting !linux hosts to adapt and !linux guests to use is probably as 'easy'. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > >> Certainly guests that we don't port won't be able to use this. I doubt > >> we'll be able to make Windows work with this - the only performance tool > >> I'm > >> familiar with on Windows is Intel's VTune, and that's proprietary. > > > > Dont you see the extreme irony of your wish to limit Linux kernel design > > decisions and features based on ... Windows and other proprietary > > software? > > Not at all. Virtualization is a hardware compatibility game. To see what > happens if you don't play it, see Xen. Eventually they to implemented > hardware support even though the pv approach is so wonderful. That's not quite equivalent though. KVM used to be the clean, integrate-code-with-Linux virtualization approach, designed specifically for CPUs that can be virtualized properly. (VMX support first, then SVM, etc.) KVM virtualized ages-old concepts with relatively straightforward hardware ABIs: x86 execution, IRQ abstractions, device abstractions, etc. Now you are in essence turning that all around: - the PMU is by no means properly virtualized nor really virtualizable by direct access. There's no virtual PMU that ticks independently of the host PMU. - the PMU hardware itself is not a well standardized piece of hardware. It's very vendor dependent and very limiting. So to some degree you are playing the role of Xen in this specific affair. You are pushing for something that shouldnt be done in that form. You want to interfere with the host PMU by going via the fast & easy short-term hack to just let the guest OS have the PMU, without any regard to how this impacts long-term feasible solutions. I.e. you are a bit like the guy who would have told Linus in 1994: " Dude, why dont you use the Windows APIs? It's far more compatible and that's the only way you could run any serious apps. Besides, it requires no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our installed base after all. " Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: > > That actually works on the Intel-only architectural pmu. I'm beginning > to like it more and more. Only for the arch defined events, all _7_ of them. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:01 PM, Ingo Molnar wrote: * Avi Kivity wrote: On 02/26/2010 03:31 PM, Ingo Molnar wrote: * Avi Kivity wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. May be one too many, for certain guests. Of course it may be argued that if the guest wants performance monitoring that much, they will upgrade. Yes, that can certainly be argued. Note another logical inconsistency: you are assuming reluctance to upgrade for a set of users who are doing _performance analysis_. In fact those types of users are amongst the most upgrade-happy. Often they'll run modern hardware and modern software. Most of the time they are developers themselves who try to make sure their stuff works on the latest& greatest hardware _and_ software. I wouldn't go as far, but I agree there is less resistance to change here. A Windows user certainly ought to be willing to install a new VTune release, and a RHEL user can be convinced to upgrade from (say) 5.4 to 5.6 with new backported paravirt pmu support. I wouldn't like to force them to upgrade to 2.6.3x though. Many of those users will be developers of in-house applications who are trying to understand their applications under production loads. Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? Not at all. Virtualization is a hardware compatibility game. To see what happens if you don't play it, see Xen. Eventually they to implemented hardware support even though the pv approach is so wonderful. If we go the pv route, we'll limit the usefulness of Linux in this scenario to a subset of guests. Users will simply walk away and choose a hypervisor whose authors have less interest in irony and more in providing the features they want. A pv approach can come after we have a baseline that is useful to all users. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That also works for the architectural pmu, of course that's Intel only. And there you don't need to upgrade the guest even once. Besides being Intel only, it only exposes a limited sub-set of hw events. (far fewer than the generic ones offered by perf events) Things aren't mutually exclusive. Offer the arch pmu for maximum future compatibility (Intel only, alas), the full pmu for maximum features, and the pv pmu for flexibility. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > On 02/26/2010 03:44 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >>On 02/26/2010 03:06 PM, Ingo Molnar wrote: > >Firstly, an emulated PMU was only the second-tier option i suggested. By > >far > >the best approach is native API to the host regarding performance events > >and > >good guest side integration. > > > >Secondly, the PMU cannot be 'given' to the guest in the general case. > >Those > >are privileged registers. They can expose sensitive host execution > >details, > >etc. etc. So if you emulate a PMU you have to exit out of most PMU > >accesses > >anyway for a secure solution. (RDPMC can still be supported, but in close > >cooperation with the host) > There is nothing secret in the host PMU, and it's easy to clear out the > counters before passing them off to the guest. > >>>That's wrong. On some CPUs the host PMU can be used to say sample aspects > >>>of > >>>another CPU, allowing statistical attacks to recover crypto keys. It can be > >>>used to sample memory access patterns of another node. > >>> > >>>There's a good reason PMU configuration registers are privileged and > >>>there's > >>>good value in only giving a certain sub-set to less privileged entities by > >>>default. > >>Even if there were no security considerations, if the guest can observe host > >>data in the pmu, it means the pmu is inaccurate. We should expose guest > >>data only in the guest pmu. That's not difficult to do, you stop the pmu on > >>exit and swap the counters on context switches. > >Again you are making an incorrect assumption: that information leakage via > >the > >PMU only occurs while the host is running on that CPU. It does not - the PMU > >can leak general system details _while the guest is running_. > > You mean like bus transactions on a multicore? Well, we're already > exposed to cache timing attacks. If you give a full PMU to a guest it's a whole different dimension and quality of information. Literally hundreds of different events about all sorts of aspects of the CPU and the hardware in general. > >So for this and for the many other reasons we dont want to give a raw PMU to > >guests: > > > > - A paravirt event driver is more compatible and more transparent in the > > long > >run: it allows hardware upgrade and upgraded PMU functionality (for > > Linux) > >without having to upgrade the guest OS. Via that a guest OS could even be > >live-migrated to a different PMU, without noticing anything about it. > > What about Windows? What is your question? Why should i limit Linux kernel design decisions based on any aspect of Windows? You might want to support it, but _please_ dont let the design be dictated by it ... > >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS > >always assumes the guest OS is upgraded to the host. Also, 'raw' PMU > > state > >cannot be live-migrated. (save/restore doesnt help) > > Why not? So long as the source and destination are compatible? 'As long as it works' is certainly a good enough filter for quality ;-) > > - It's far cleaner on the host side as well: more granular, per event usage > >is possible. The guest can use portion of the PMU (managed by the host), > >and the host can use a portion too. > > > >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS > >precludes the host OS from running some different piece of > > instrumentation > >at the same time. > > Right, time slicing is something we want. > > > - It's more secure: the host can have a finegrained policy about what > > kinds of > >events it exposes to the guest. It might chose to only expose software > >events for example. > > > >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is > >an all-or-nothing policy affair: either you fully allow the guest (and > > live > >with whatever consequences the piece of hardware that takes up a fair > > chunk > >on the CPU die causes), or you allow none of it. > > No, we can hide insecure events with a full pmu. Trap the control register > and don't pass it on to the hardware. So you basically concede partial emulation ... > > - A proper paravirt event driver gives more features as well: it can > > exposes > >host software events and tracepoints, probes - not restricting itself to > >the 'hardware PMU' abstraction. > > But it is limited to whatever the host stack supports. At least > that's our control, but things like PEBS will take a ton of work. PEBS support is being implemented for perf, as a transparent feature. So once it's available, PEBS support will magically improve the quality of guest OS samples, if a paravirt driver approach is used and if sys_perf_event_open() is taught about that driver. Without any other change needed on the guest side. > > - There's proper event scheduling and event alloca
Re: KVM PMU virtualization
On 02/26/2010 04:07 PM, Jes Sorensen wrote: On 02/26/10 14:27, Ingo Molnar wrote: * Jes Sorensen wrote: You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, whereas Nehalem and Atom are v3 if I remember correctly. [...] Of course you can emulate a good portion of it, as long as there's perf support on the host side for P4. Actually P4 is pretty uninteresting in this discussion due to the lack of VMX support, it's the same issue for Nehalem vs Core2. The problem is the same though, we cannot tell the guest that yes P4 has this event, but no, we are going to feed you bogus data. The Pentium D which is a P4 derivative has vmx support. However it is so slow I'm fine with ignoring it for this feature. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:27, Ingo Molnar wrote: * Jes Sorensen wrote: You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, whereas Nehalem and Atom are v3 if I remember correctly. [...] Of course you can emulate a good portion of it, as long as there's perf support on the host side for P4. Actually P4 is pretty uninteresting in this discussion due to the lack of VMX support, it's the same issue for Nehalem vs Core2. The problem is the same though, we cannot tell the guest that yes P4 has this event, but no, we are going to feed you bogus data. If the guest programs a cachemiss event, you program a cachemiss perf event on the host and feed its values to the emulated MSR state. You _dont_ program the raw PMU on the host side - just use the API i outlined to get struct perf_event. The emulation wont be perfect: not all events will count and not all events will be available in a P4 (and some Core2 events might not even make sense in a P4), but that is reality as well: often documented events dont count, and often non-documented events count. What matters to 99.9% of people who actually use this stuff is a few core sets of events - which are available in P4s and in Core2 as well. Cycles, instructions, branches, maybe cache-misses. Sometimes FPU stuff. I really do not like to make guesses about how people use this stuff. The things you and I look for as kernel hackers are often very different than application authors look for and use. That is one thing I learned from being expose to strange Fortran programmers at SGI. It makes me very uncomfortable telling a guest OS that we offer features X, Y, Z and then start lying feeding back numbers that do not match what was requested, and there is no way to to tell the guest that. For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open() on the guest side over to the host, transparently, via a paravirt driver. Paravirt is a nice optimization, but is and will always be an optimization. Fact of the matter is that the bulk of usage of virtualization is for running distributions with slow kernel upgrade rates, like SLES and RHEL, and other proprietary operating systems which we have no control over. Para-virt will do little good for either of these groups. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > On 02/26/2010 03:31 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >>Or do you mean to define a new, kvm-specific pmu model and feed it off the > >>host pmu? In this case all the guests will need to be taught about it, > >>which raises the compatibility problem. > >You are missing two big things wrt. compatibility here: > > > > 1) The first upgrade overhead a one time overhead only. > > May be one too many, for certain guests. Of course it may be argued > that if the guest wants performance monitoring that much, they will > upgrade. Yes, that can certainly be argued. Note another logical inconsistency: you are assuming reluctance to upgrade for a set of users who are doing _performance analysis_. In fact those types of users are amongst the most upgrade-happy. Often they'll run modern hardware and modern software. Most of the time they are developers themselves who try to make sure their stuff works on the latest & greatest hardware _and_ software. So people running P4's trying to tune their stuff under Red Hat Linux 9 and trying to use the PMU uner KVM is not really a concern rooted overly deeply in reality. > Certainly guests that we don't port won't be able to use this. I doubt > we'll be able to make Windows work with this - the only performance tool I'm > familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? > > 2) Once a Linux guest has upgraded, it will work in the future, with _any_ > > future CPU - _without_ having to upgrade the guest! > > > >Dont you see the advantage of that? You can instrument an old system on new > >hardware, without having to upgrade that guest for the new CPU support. > > That also works for the architectural pmu, of course that's Intel > only. And there you don't need to upgrade the guest even once. Besides being Intel only, it only exposes a limited sub-set of hw events. (far fewer than the generic ones offered by perf events) Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:37 PM, Jes Sorensen wrote: On 02/26/10 14:31, Ingo Molnar wrote: You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That would only work if you are guaranteed to be able to emulate old hardware on new hardware. Not going to be feasible, so then we are in a real mess. That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:44 PM, Ingo Molnar wrote: * Avi Kivity wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. You mean like bus transactions on a multicore? Well, we're already exposed to cache timing attacks. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. What about Windows? In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) Why not? So long as the source and destination are compatible? - It's far cleaner on the host side as well: more granular, per event usage is possible. The guest can use portion of the PMU (managed by the host), and the host can use a portion too. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS precludes the host OS from running some different piece of instrumentation at the same time. Right, time slicing is something we want. - It's more secure: the host can have a finegrained policy about what kinds of events it exposes to the guest. It might chose to only expose software events for example. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is an all-or-nothing policy affair: either you fully allow the guest (and live with whatever consequences the piece of hardware that takes up a fair chunk on the CPU die causes), or you allow none of it. No, we can hide insecure events with a full pmu. Trap the control register and don't pass it on to the hardware. - A proper paravirt event driver gives more features as well: it can exposes host software events and tracepoints, probes - not restricting itself to the 'hardware PMU' abstraction. But it is limited to whatever the host stack supports. At least that's our control, but things like PEBS will take a ton of work. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. We only control half of the equation. That's very different compared to tools/perf. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:28, Peter Zijlstra wrote: On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. What I don't like here is that without rewriting the guest OS, there will be two layers of time-slicing and extrapolation. That is going to make the reported numbers close to useless. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. That is the problem. Today there is a large install base out there of core2 users who wish to measure their stuff on the hardware they have. The same will be true for Nehalem based stuff, when whatever replaces Nehalem comes out makes that incompatible. Since we are unable to emulate Core2 on Nehalem, and almost certainly will be unable to emulate Nehalem on it's successor, we are stuck with this. A para-virt interface is a nice idea, but since we cannot emulate an old CPU properly it still means there isn't much we can do as we're stuck with the same limitations. I simply see the value of introducing a para-virt interface for this. Furthermore, when KVM doesn't virtualize the physical system topology, some PMU features cannot even be sanely used from a vcpu. That is definitely an issue, and there is nothing we can really do about that. Having two guests running in parallel under KVM means that they are going to see more cache misses than they would if they ran barebone on the hardware. However even with all of this, we have to keep in mind who is going to use the performance monitoring in a guest. It is going to be application writers, mostly people writing analytical/scientific applications. They rarely have control over the OS they are running on, but are given systems and told to work on what they are given. Driver upgrades and things like that don't come quickly. However they also tend to understand limitations like these and will be able to still benefit from perf on a system like that. So while currently a root user can already tie up all of the pmu using perf, simply using that to hand the full pmu off to the guest still leaves lots of issues. Well isn't that the case with the current setup anyway? If enough user apps start requesting PMU resources, the hw is going to run out of counters very quickly anyway. The real issue here IMHO is whether or not is it possible to use a PMU to count anything on different CPU? If that is really possible, sharing the PMU is not an option :( All that said, what we really want is for Intel+AMD to come up with proper hw PMU virtualization support that makes it easy to rotate the full PMU in and out for a guest. Then this whole discussion will become a non issue. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:28 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. True. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. Right, this is pretty bad. For Windows it's probably acceptable to upgrade your performance tools (since that's separate from the OS). In Linux it is integrated into the kernel, and it's fairly unacceptable to demand a kernel upgrade when your host is upgraded underneath you. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. Still that leaves us with no Windows / non-Linux solution. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > On 02/26/2010 03:06 PM, Ingo Molnar wrote: > > > >>>Firstly, an emulated PMU was only the second-tier option i suggested. By > >>>far > >>>the best approach is native API to the host regarding performance events > >>>and > >>>good guest side integration. > >>> > >>>Secondly, the PMU cannot be 'given' to the guest in the general case. Those > >>>are privileged registers. They can expose sensitive host execution details, > >>>etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses > >>>anyway for a secure solution. (RDPMC can still be supported, but in close > >>>cooperation with the host) > >>There is nothing secret in the host PMU, and it's easy to clear out the > >>counters before passing them off to the guest. > >That's wrong. On some CPUs the host PMU can be used to say sample aspects of > >another CPU, allowing statistical attacks to recover crypto keys. It can be > >used to sample memory access patterns of another node. > > > >There's a good reason PMU configuration registers are privileged and there's > >good value in only giving a certain sub-set to less privileged entities by > >default. > > Even if there were no security considerations, if the guest can observe host > data in the pmu, it means the pmu is inaccurate. We should expose guest > data only in the guest pmu. That's not difficult to do, you stop the pmu on > exit and swap the counters on context switches. Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) - It's far cleaner on the host side as well: more granular, per event usage is possible. The guest can use portion of the PMU (managed by the host), and the host can use a portion too. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS precludes the host OS from running some different piece of instrumentation at the same time. - It's more secure: the host can have a finegrained policy about what kinds of events it exposes to the guest. It might chose to only expose software events for example. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is an all-or-nothing policy affair: either you fully allow the guest (and live with whatever consequences the piece of hardware that takes up a fair chunk on the CPU die causes), or you allow none of it. - A proper paravirt event driver gives more features as well: it can exposes host software events and tracepoints, probes - not restricting itself to the 'hardware PMU' abstraction. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:31 PM, Ingo Molnar wrote: * Avi Kivity wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. May be one too many, for certain guests. Of course it may be argued that if the guest wants performance monitoring that much, they will upgrade. Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That also works for the architectural pmu, of course that's Intel only. And there you don't need to upgrade the guest even once. The arch pmu seems nicely done - there's a bit for every counter that can be enabled and disabled at will, and the number of counters is also determined from cpuid. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:31, Ingo Molnar wrote: You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That would only work if you are guaranteed to be able to emulate old hardware on new hardware. Not going to be feasible, so then we are in a real mess. With the 'steal the PMU' messy approach the guest OS has to be upgraded to the new CPU type all the time. Ad infinitum. The way the Perfmon architecture is specified by Intel, that is what we are stuck with. It's not going to be possible via software emulation to count cache misses, unless you run it in a micro architecture emulator. Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:18, Ingo Molnar wrote: * Avi Kivity wrote: Can you emulate the Core 2 pmu on, say, a P4? [...] How about the Pentium? Or the i486? As long as there's perf events support, the CPU can be supported in a soft PMU. You can even cross-map exotic hw events if need to be - but most of the tooling (in just about any OS) uses just a handful of core events ... This is only possible if all future CPU perfmon events are guaranteed to be a superset of previous versions. Otherwise you end up emulating events and providing randomly generated numbers back. The perfmon revision and size we present to a guest has to match the current host. Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: > It would be the other way round - the host would steal the pmu from the > guest. Later we can try to time-slice and extrapolate, though that's > not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. Furthermore, when KVM doesn't virtualize the physical system topology, some PMU features cannot even be sanely used from a vcpu. So while currently a root user can already tie up all of the pmu using perf, simply using that to hand the full pmu off to the guest still leaves lots of issues. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:30, Avi Kivity wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) That's doable if all counters are steerable. IIRC some counters are fixed function, but I'm not certain about that. I am not an expert, but from what I learned from Peter, there are constraints on some of the counters. Ie. certain types of events can only be counted on certain counters, which limits the already very limited number of counters even further. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:27 PM, Ingo Molnar wrote: For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open() on the guest side over to the host, transparently, via a paravirt driver. Let us for the purpose of this discussion assume that we are also interested in supporting Windows and older Linux. Paravirt optimizations can be added after we have the basic functionality, if they prove necessary. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > Or do you mean to define a new, kvm-specific pmu model and feed it off the > host pmu? In this case all the guests will need to be taught about it, > which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. With the 'steal the PMU' messy approach the guest OS has to be upgraded to the new CPU type all the time. Ad infinitum. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:06, Ingo Molnar wrote: * Jes Sorensen wrote: Well you cannot steal the PMU without collaborating with perf_event.c, but thats quite feasible. Sharing the PMU between the guest and the host is very costly and guarantees incorrect results in the host. Unless you completely emulate the PMU by faking it and then allocating PMU counters one by one at the host level. However that means trapping a lot of MSR access. It's not that many MSR accesses. Well it's more than enough to double the number of MSRs KVM has to track on switches. There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. If a PMU can really count stuff on another CPU, then we shouldn't allow PMU access to any application at all. It's more than just a KVM guest vs a KVM guest issue then, but also a thread to thread issue. My idea was obviously not to expose host timings to a guest. Save the counters when a guest exits, and reload them when it's restarted. Not just when switching to another task, but also when entering KVM, to avoid the guest seeing overhead spent within KVM. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. As I wrote earlier, at that level we have to do it all emulated. In this case, providing any of this to a guest seems to be a waste of time since the interface will cost way too much in trapping back and forth and you have contention with the very limited resources in the PMU with just 5 counters to pick from on Core2. The guest PMU will think it's running on top of real hardware, and scaling/estimating numbers like the perf_event.c code does today, except that it will be using already scaled and estimated numbers for it's calculations. Application users will have little use for this. Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) Well either you allow access to the PMU or you don't. If you allow direct access to the PMU counters, but not the control registers, you have to specify the counter sizes to match that of the host, making it impossible to really emulate core2 on a non core2 architecture etc. Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. We are rehashing the whole 'perfmon versus perf events/counters' design arguments again here really. Scheduling at event granularity would be a good thing. However we need to be able to handle the guest using the full pmu. Note that scheduling is only needed if both the guest and host want the pmu at the same time - and that should be a rare case and not the one to optimize for. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) That's doable if all counters are steerable. IIRC some counters are fixed function, but I'm not certain about that. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen wrote: > > Agree about favouring modern processors. > > You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, > whereas Nehalem and Atom are v3 if I remember correctly. [...] Of course you can emulate a good portion of it, as long as there's perf support on the host side for P4. If the guest programs a cachemiss event, you program a cachemiss perf event on the host and feed its values to the emulated MSR state. You _dont_ program the raw PMU on the host side - just use the API i outlined to get struct perf_event. The emulation wont be perfect: not all events will count and not all events will be available in a P4 (and some Core2 events might not even make sense in a P4), but that is reality as well: often documented events dont count, and often non-documented events count. What matters to 99.9% of people who actually use this stuff is a few core sets of events - which are available in P4s and in Core2 as well. Cycles, instructions, branches, maybe cache-misses. Sometimes FPU stuff. For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open() on the guest side over to the host, transparently, via a paravirt driver. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > Can you emulate the Core 2 pmu on, say, a P4? [...] How about the Pentium? Or the i486? As long as there's perf events support, the CPU can be supported in a soft PMU. You can even cross-map exotic hw events if need to be - but most of the tooling (in just about any OS) uses just a handful of core events ... Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:04, Avi Kivity wrote: On 02/26/2010 02:38 PM, Ingo Molnar wrote: Yes, something like Core2 with 2 generic events. That would leave 2 extra generic events on Nehalem and better. (which is really the target CPU type for any new feature we are talking about right now. Plus performance analysis tends to skew towards more modern CPU types as well.) Can you emulate the Core 2 pmu on, say, a P4? Those P4s have very different instruction caches so I imagine the events are very different as well. Agree about favouring modern processors. You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, whereas Nehalem and Atom are v3 if I remember correctly. I am not even 100% sure a v3 is capable of emulating a v2, though I expect v3 to have bigger counters then v2, but I don't think that is guaranteed. I can only handle so many hours of reading Intel manuals per day, before I end up in a padded cell, so I could be wrong on some of this. Plus the emulation can be smart about it and only use up a given number. Most guest OSs dont use the full PMU - they use a single counter. But you have to expose all of the counters, no? Unless you go with a kvm-specific pmu as described below. You have to, at least all the fixed ones (3 on Core2) and the two arch ones. Thats the minimum and any guest being told it's running on a Core2 will expect to find those. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen wrote: > On 02/26/10 12:42, Ingo Molnar wrote: > > > >* Jes Sorensen wrote: > >> > >> I have to say I disagree on that. When you run perfmon on a system, it is > >> normally to measure a specific application. You want to see accurate > >> numbers for cache misses, mul instructions or whatever else is selected. > > > > You can still get those. You can even enable RDPMC access and avoid VM > > exits. > > > > What you _cannot_ do is to 'steal' the PMU and just give it to the guest. > > Well you cannot steal the PMU without collaborating with perf_event.c, but > thats quite feasible. Sharing the PMU between the guest and the host is very > costly and guarantees incorrect results in the host. Unless you completely > emulate the PMU by faking it and then allocating PMU counters one by one at > the host level. However that means trapping a lot of MSR access. It's not that many MSR accesses. > >Firstly, an emulated PMU was only the second-tier option i suggested. By far > >the best approach is native API to the host regarding performance events and > >good guest side integration. > > > >Secondly, the PMU cannot be 'given' to the guest in the general case. Those > >are privileged registers. They can expose sensitive host execution details, > >etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses > >anyway for a secure solution. (RDPMC can still be supported, but in close > >cooperation with the host) > > There is nothing secret in the host PMU, and it's easy to clear out the > counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. > >>We can do this in a reasonable way today, if we allow to take the PMU away > >>from the host, and only let guests access it when it's in use. [...] > > > >You get my sure-fire NAK for that kind of crap though. Interfering with the > >host PMU and stealing it, is not a technical approach that has acceptable > >quality. > > Having an allocation scheme and sharing it with the host, is a perfectly > legitimate and very clean way to do it. Once it's given to the guest, the > host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. We are rehashing the whole 'perfmon versus perf events/counters' design arguments again here really. > > You need to integrate it properly so that host PMU functionality still > > works fine. (Within hardware constraints) > > Well with the hardware currently available, there is no such thing as clean > sharing between the host and the guest. It cannot be done without messing up > the host measurements, which effectively renders measuring at the host side > useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 02:38 PM, Ingo Molnar wrote: * Avi Kivity wrote: On 02/26/2010 02:07 PM, Ingo Molnar wrote: * Avi Kivity wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. You are making the incorrect assumption that the emulated PMU uses up all host PMU resources ... Well, in the general case, it may? If it doesn't, the host may use them. We do a similar thing with debug breakpoints. Sharing the pmu will mean trapping control msr writes at least, though. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. Yes, something like Core2 with 2 generic events. That would leave 2 extra generic events on Nehalem and better. (which is really the target CPU type for any new feature we are talking about right now. Plus performance analysis tends to skew towards more modern CPU types as well.) Can you emulate the Core 2 pmu on, say, a P4? Those P4s have very different instruction caches so I imagine the events are very different as well. Agree about favouring modern processors. Plus the emulation can be smart about it and only use up a given number. Most guest OSs dont use the full PMU - they use a single counter. But you have to expose all of the counters, no? Unless you go with a kvm-specific pmu as described below. Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates events on an as-needed basis. Or we could watch the control register and see how the guest programs it, provided it doesn't do that a lot. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. I meant the guest installed base. Hosts can be upgraded transparently to the guests (not even a shutdown/reboot). The irony: this time guest-transparent solutions that need no configuration are good? ;-) The very same argument holds for the file server thing: a guest transparent solution is easier wrt. the upgrade path. If we add pmu support, guests can begin to use if immediately. If we add the file server support, guests need to install drivers before they can use it, while guest admins have no motivation to do so (it helps the host, not the guest). Is something wrong with just using sshfs? Seems a lot less hassle to me. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 13:20, Avi Kivity wrote: On 02/26/2010 02:07 PM, Ingo Molnar wrote: ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. You cannot do this, as you say there is no guarantee that there are no overlaps, and the current host may have different counter sizes two which makes emulating it even more costly. The cpuid bits basically tells you which version of the counters are available, how many counters are there, word size of the counters and I believe there are bits also stating which optional features are available to be counted. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. Cannot be done in a reasonable manner due to the above. The key to all of this is that guests OSes, including that other OS, should be able to use the performance counters without needing special para virt drivers or other OS modifications. If we start requering that kind of stuff, the whole point of having the feature goes down the toilet. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:42, Ingo Molnar wrote: * Jes Sorensen wrote: I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Well you cannot steal the PMU without collaborating with perf_event.c, but thats quite feasible. Sharing the PMU between the guest and the host is very costly and guarantees incorrect results in the host. Unless you completely emulate the PMU by faking it and then allocating PMU counters one by one at the host level. However that means trapping a lot of MSR access. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > On 02/26/2010 02:07 PM, Ingo Molnar wrote: > >* Avi Kivity wrote: > > > >>A native API to the host will lock out 100% of the install base now, and a > >>large section of any future install base. > >... which is why i suggested the soft-PMU approach. > > Not sure I understand it completely. > > Do you mean to take the model specific host pmu events, and expose them to > the guest via trap'n'emulate? In that case we may as well assign the host > pmu to the guest if the host isn't using it, and avoid the traps. You are making the incorrect assumption that the emulated PMU uses up all host PMU resources ... > Do you mean to choose some older pmu and emulate it using whatever pmu model > the host has? I haven't checked, but aren't there mutually exclusive events > in every model pair? The closest thing would be the architectural pmu > thing. Yes, something like Core2 with 2 generic events. That would leave 2 extra generic events on Nehalem and better. (which is really the target CPU type for any new feature we are talking about right now. Plus performance analysis tends to skew towards more modern CPU types as well.) Plus the emulation can be smart about it and only use up a given number. Most guest OSs dont use the full PMU - they use a single counter. Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates events on an as-needed basis. > Or do you mean to define a new, kvm-specific pmu model and feed it off the > host pmu? In this case all the guests will need to be taught about it, > which raises the compatibility problem. > > > And note that _any_ solution we offer locks out 100% of the installed base > > right now, as no solution is in the kernel yet. The only question is what > > kind of upgrade effort is needed for users to make use of the feature. > > I meant the guest installed base. Hosts can be upgraded transparently to > the guests (not even a shutdown/reboot). The irony: this time guest-transparent solutions that need no configuration are good? ;-) The very same argument holds for the file server thing: a guest transparent solution is easier wrt. the upgrade path. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 02:07 PM, Ingo Molnar wrote: * Avi Kivity wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. I meant the guest installed base. Hosts can be upgraded transparently to the guests (not even a shutdown/reboot). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > A native API to the host will lock out 100% of the install base now, and a > large section of any future install base. ... which is why i suggested the soft-PMU approach. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 01:42 PM, Ingo Molnar wrote: * Jes Sorensen wrote: On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Agreed - if both the host and guest want the pmu, the host wins. This is what we do with debug registers - if both the host and guest contend for them, the host wins. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. A native API to the host will lock out 100% of the install base now, and a large section of any future install base. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) No, stop and restart the counters on every exit/entry, so the guest doesn't observe any host data. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 01:26 PM, Ingo Molnar wrote: By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. In a way, virtualization as a whole is a step backwards. We take the nice firesystem/timer/network/scheduler APIs, and expose them as raw hardware. The pmu isn't any different. Uhm, it's obviously very different. A fake NE2000 will work on both Intel and AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor though. So there's no "generic hardware" to emulate. That's true, and it reduces the usability of the feature (you have to restrict your migration pools or not expose the pmu), but the general points still stand. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen wrote: > On 02/26/10 11:44, Ingo Molnar wrote: > >Direct access to counters is not something that is a big issue. [ Given that > >i > >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this > >is the biggest of performance challenges right now ;-) ] > > > >By far the biggest instrumentation issue is: > > > > - availability > > - usability > > - flexibility > > > >Exposing the raw hw is a step backwards in many regards. The same way we dont > >want to expose chipsets to the guest to allow them to do RAS. The same way we > >dont want to expose most raw PCI devices to guest in general, but have all > >these virt driver abstractions. > > I have to say I disagree on that. When you run perfmon on a system, it is > normally to measure a specific application. You want to see accurate numbers > for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. > Emulating the PMU rather than using the real one, makes the numbers far less > useful. The most useful way to provide PMU support in a guest is to expose > the real PMU and let the guest OS program it. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) > We can do this in a reasonable way today, if we allow to take the PMU away > from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > On 02/26/2010 12:44 PM, Ingo Molnar wrote: > >>>Far cleaner would be to expose it via hypercalls to guest OSs that are > >>>interested in instrumentation. > >>It's also slower - you can give the guest direct access to the various > >>counters so no exits are taken when reading the counters (though perhaps > >>many tools are only interested in the interrupts, not the counter values). > >Direct access to counters is not something that is a big issue. [ Given that > >i > >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this > >is the biggest of performance challenges right now ;-) ] > > Outside 4-bit vga mode, this shouldn't happen. Can you describe > your scenario? > > >By far the biggest instrumentation issue is: > > > > - availability > > - usability > > - flexibility > > > >Exposing the raw hw is a step backwards in many regards. > > In a way, virtualization as a whole is a step backwards. We take the nice > firesystem/timer/network/scheduler APIs, and expose them as raw hardware. > The pmu isn't any different. Uhm, it's obviously very different. A fake NE2000 will work on both Intel and AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor though. So there's no "generic hardware" to emulate. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:24, Ingo Molnar wrote: There is a way to query the CPU for 'architectural perfmon' though, via CPUID alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic is: if (c->cpuid_level> 9) { unsigned eax = cpuid_eax(10); /* Check for version and the number of counters */ if ((eax& 0xff)&& (((eax>>8)& 0xff)> 1)) set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON); } But emulating that doesnt solve the problem: as OSs generally dont key their PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but based on much higher level CPUID attributes. (like Intel/AMD) Right, there is far more to it than just the arch-perfmon feature. They still need to query cpuid 0x0a for counter size, number of counters and stuff like that. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen wrote: > On 02/26/10 12:06, Joerg Roedel wrote: > > > Isn't there a cpuid bit indicating the availability of architectural > > perfmon? > > Nope, the perfmon flag is a fake Linux flag, set based on the contents on > cpuid 0x0a There is a way to query the CPU for 'architectural perfmon' though, via CPUID alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic is: if (c->cpuid_level > 9) { unsigned eax = cpuid_eax(10); /* Check for version and the number of counters */ if ((eax & 0xff) && (((eax>>8) & 0xff) > 1)) set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON); } But emulating that doesnt solve the problem: as OSs generally dont key their PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but based on much higher level CPUID attributes. (like Intel/AMD) Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. Hopefully Intel and AMD will come up with proper hw PMU virtualization support that allows us to do it 100% guest and host at some point. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote: > > > > * Joerg Roedel wrote: > > > > > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: > > > > On 02/26/2010 10:42 AM, Ingo Molnar wrote: > > > >> Note that the 'soft PMU' still sucks from a design POV as there's no > > > >> generic > > > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a > > > >> 'soft > > > >> Intel' PMU driver at minimum. > > > >> > > > > > > > > Right, this will severely limit migration domains to hosts of the same > > > > vendor and processor generation. There is a middle ground, though, > > > > Intel has recently moved to define an "architectural pmu" which is not > > > > model specific. I don't know if AMD adopted it. We could offer both > > > > options - native host capabilities, with a loss of compatibility, and > > > > the architectural pmu, with loss of model specific counters. > > > > > > I only had a quick look yet on the architectural pmu from intel but it > > > looks > > > like it can be emulated for a guest on amd using existing features. > > > > AMD CPUs dont have enough events for that, they cannot do the 3 fixed > > events > > in addition to the 2 generic ones. > > Good point. Maybe we can emulate that with some counter round-robin > usage if the guest really uses all 5 counters. > > > Nor do you really want to standardize on KVM guests on returning > > 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel > > PMU > > drivers, right? > > Isn't there a cpuid bit indicating the availability of architectural > perfmon? there is, but can you rely on all guest OSs keying off their PMU drivers based purely on the CPUID bit and not on any other CPUID aspects? Guest OSs like ... Linux v2.6.33: void __init init_hw_perf_events(void) { int err; pr_info("Performance Events: "); switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: err = intel_pmu_init(); break; case X86_VENDOR_AMD: err = amd_pmu_init(); break; default: Really, if you want to emulate a single Intel PMU driver model you need to pretend that you are an Intel CPU, throughout. This cannot be had both ways. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:06, Joerg Roedel wrote: Isn't there a cpuid bit indicating the availability of architectural perfmon? Nope, the perfmon flag is a fake Linux flag, set based on the contents on cpuid 0x0a Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 12:44 PM, Ingo Molnar wrote: Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] Outside 4-bit vga mode, this shouldn't happen. Can you describe your scenario? By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. In a way, virtualization as a whole is a step backwards. We take the nice firesystem/timer/network/scheduler APIs, and expose them as raw hardware. The pmu isn't any different. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. Whenever we have a choice, we expose raw hardware (usually emulated, but in some cases real). Raw hardware has the huge advantage of being already supported. Write a software abstraction, and you get to (a) write and maintain the spec (b) write drivers for all guests (c) mumble something to users of OSes to which you haven't ported your driver (d) explain to users that they need to install those drivers. For networking and block, it is simply impossible to obtain good performance without introducing a new interface, but for other stuff, that may not be the case. That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux<->Linux guest/host combos before gutting the design just to fit Windows into the picture ... "gutting the design"? Yes, gutting the design of a sane instrumentation API and moving it back 10-20 years by squeezing it through non-standardized and incompatible PMU drivers. Any new interface will be incompatible to all the exiting guests out there; and unlike networking, you can't retrofit a pmu interface to an existing guest. When it comes to design my main interest is the Linux<->Linux combo. My main interest is the OSes that users actually install, and those are Windows and non-bleeding-edge Linux. Look at guests as you do at userspace: you don't want to inflict changes upon them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote: > > * Joerg Roedel wrote: > > > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: > > > On 02/26/2010 10:42 AM, Ingo Molnar wrote: > > >> Note that the 'soft PMU' still sucks from a design POV as there's no > > >> generic > > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a > > >> 'soft > > >> Intel' PMU driver at minimum. > > >> > > > > > > Right, this will severely limit migration domains to hosts of the same > > > vendor and processor generation. There is a middle ground, though, > > > Intel has recently moved to define an "architectural pmu" which is not > > > model specific. I don't know if AMD adopted it. We could offer both > > > options - native host capabilities, with a loss of compatibility, and > > > the architectural pmu, with loss of model specific counters. > > > > I only had a quick look yet on the architectural pmu from intel but it > > looks > > like it can be emulated for a guest on amd using existing features. > > AMD CPUs dont have enough events for that, they cannot do the 3 fixed events > in addition to the 2 generic ones. Good point. Maybe we can emulate that with some counter round-robin usage if the guest really uses all 5 counters. > Nor do you really want to standardize on KVM guests on returning > 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU > drivers, right? Isn't there a cpuid bit indicating the availability of architectural perfmon? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/25/10 18:34, Joerg Roedel wrote: The biggest problem I see here is teaching the guest about the available events. The available event sets are dependent on the processor family (at least on AMD). A simple approach would be shadowing the perf msrs which is a simple thing to do. More problematic is the reinjection of performance interrupts and performance nmis. IMHO the only real solution here is to map it to the host CPU, and require -cpu host for PMU support. There is no point in trying to emulate PMU features which we don't have in the hardware. Ie. you cannot count cache misses if the hardware doesn't support it. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/25/10 17:26, Ingo Molnar wrote: Given that perf can apply the PMU to individual host tasks, I don't see fundamental problems multiplexing it between individual guests (which can then internally multiplex it again). In terms of how to expose it to guests, a 'soft PMU' might be a usable approach. Although to Linux guests you could expose much more functionality and an non-PMU-limited number of instrumentation events, via a more intelligent interface. But note that in terms of handling it on the host side the PMU approach is not acceptable: instead it should map to proper perf_events, not try to muck with the PMU itself. I am not keen on emulating the PMU, if we do that we end up having to emulate a large number of MSR accesses, which is really costly. It makes a lot more sense to give the guest direct access to the PMU. The problem here is how to manage it without too much overhead. Cheers, Jes -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote: > > My suggestion, as always, would be to start very simple and very minimal: > > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel > > image > > both as a host and as guest (for testing), to not have to deal with the > > symbol > > space transport problem initially. Enable 'perf kvm record' to only record > > guest events by default. Etc. > > > > This alone will be a quite useful result already - and gives a basis for > > further work. No need to spend months to do the big grand design straight > > away, all of this can be done gradually and in the order of usefulness - > > and > > you'll always have something that actually works (and helps your other KVM > > projects) along the way. > > > > [ And, as so often, once you walk that path, that grand scheme you are > > thinking about right now might easily become last year's really bad idea > > ;-) ] > > > > So please start walking the path and experience the challenges first-hand. > > That sounds like a good approach for the 'measure-guest-from-host' > problem. It is also not very hard to implement. Where does perf fetch > the rip of the nmi from, stack only or is this configurable? The host semantics are that it takes the stack from the regs, and with call-graph recording (perf record -g) it will walk down the exception stack, irq stack, kernel stack, and user-space stack as well. (up to the point the pages are present - it stops on a non-present page. An app that is being profiled has its stack present so it's not an issue in practice.) I'd suggest to leave out call graph sampling initially, and just get 'perf kvm top' to work with guest RIPs, simply sampled from the VM exit state. See arch/x86/kernel/cpu/perf_event.c: static void perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry) { callchain_store(entry, PERF_CONTEXT_KERNEL); callchain_store(entry, regs->ip); dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry); } If you have easy access to the VM state from NMI context right there then just hack in the guest RIP and you should have some prototype that samples the guest. (assuming you use the same kernel image for both the host an the guest) This would be the easiest way to prototype it all. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 12:46 PM, Ingo Molnar wrote: Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an "architectural pmu" which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? No - that would only work if AMD also adopted the architectural pmu. Note virtualization clusters are typically split into 'migration pools' consisting of hosts with similar processor features, so that you can expose those features and yet live migrate guests at will. It's likely that all hosts have the same pmu anyway, so the only downside is that we now have to expose the host's processor family and model. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: > > On 02/26/2010 10:42 AM, Ingo Molnar wrote: > >> Note that the 'soft PMU' still sucks from a design POV as there's no > >> generic > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft > >> Intel' PMU driver at minimum. > >> > > > > Right, this will severely limit migration domains to hosts of the same > > vendor and processor generation. There is a middle ground, though, > > Intel has recently moved to define an "architectural pmu" which is not > > model specific. I don't know if AMD adopted it. We could offer both > > options - native host capabilities, with a loss of compatibility, and > > the architectural pmu, with loss of model specific counters. > > I only had a quick look yet on the architectural pmu from intel but it looks > like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity wrote: > Right, this will severely limit migration domains to hosts of the same > vendor and processor generation. There is a middle ground, though, Intel > has recently moved to define an "architectural pmu" which is not model > specific. I don't know if AMD adopted it. [...] Nope. It's "architectural" the following way: Intel wont change it with future CPU models, outside of the definitions of the hw-ABI. PMUs were model specific prior that time. I'd say there's near zero chance the MSR spaces will unify. All the 'advanced' PMU features are wildly incompatible, and the gap is increasing not decreasing. > > Far cleaner would be to expose it via hypercalls to guest OSs that are > > interested in instrumentation. > > It's also slower - you can give the guest direct access to the various > counters so no exits are taken when reading the counters (though perhaps > many tools are only interested in the interrupts, not the counter values). Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. > > That way it could also transparently integrate with tracing, probes, etc. > > It would also be wiser to first concentrate on improving Linux<->Linux > > guest/host combos before gutting the design just to fit Windows into the > > picture ... > > "gutting the design"? Yes, gutting the design of a sane instrumentation API and moving it back 10-20 years by squeezing it through non-standardized and incompatible PMU drivers. When it comes to design my main interest is the Linux<->Linux combo. Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote: > My suggestion, as always, would be to start very simple and very minimal: > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image > both as a host and as guest (for testing), to not have to deal with the > symbol > space transport problem initially. Enable 'perf kvm record' to only record > guest events by default. Etc. > > This alone will be a quite useful result already - and gives a basis for > further work. No need to spend months to do the big grand design straight > away, all of this can be done gradually and in the order of usefulness - and > you'll always have something that actually works (and helps your other KVM > projects) along the way. > > [ And, as so often, once you walk that path, that grand scheme you are > thinking about right now might easily become last year's really bad idea > ;-) ] > > So please start walking the path and experience the challenges first-hand. That sounds like a good approach for the 'measure-guest-from-host' problem. It is also not very hard to implement. Where does perf fetch the rip of the nmi from, stack only or is this configurable? Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: > On 02/26/2010 10:42 AM, Ingo Molnar wrote: >> Note that the 'soft PMU' still sucks from a design POV as there's no generic >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft >> Intel' PMU driver at minimum. >> > > Right, this will severely limit migration domains to hosts of the same > vendor and processor generation. There is a middle ground, though, > Intel has recently moved to define an "architectural pmu" which is not > model specific. I don't know if AMD adopted it. We could offer both > options - native host capabilities, with a loss of compatibility, and > the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 10:42 AM, Ingo Molnar wrote: * Joerg Roedel wrote: I personally don't like a self-defined event-set as the only solution because that would probably only work with linux and perf. [...] The 'soft-PMU' i suggested is transparent on the guest side - if you want to enable non-Linux and legacy-Linux. It's basically a PMU interface provided to the guest by catching the right MSR accesses, implemented via perf_event_create_kernel_counter()/etc. on the host side. That only works if the software interface is 100% lossless - we can recreate every single hardware configuration through the API. Is this the case? Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an "architectural pmu" which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux<->Linux guest/host combos before gutting the design just to fit Windows into the picture ... "gutting the design"? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: > > > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > > >host. > > > > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be > > > configured to count only when in guest mode. Perf needs to be aware of > > > that and fetch the rip from a different place when monitoring a guest. > > > The idea is we want to measure both host and guest at the same time, and > > compare all the hot functions fairly. > > So you want to measure while the guest vcpu is running and the vmexit > path of that vcpu (including qemu userspace part) together? The > challenge here is to find out if a performance event originated in guest > mode or in host mode. > But we can check for that in the nmi-protected part of the vmexit path. As far as instrumentation goes, virtualization is simply another 'PID dimension' of measurement. Today we can isolate system performance measurements/events to the following domains: - per system - per cpu - per task ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' domain separation, and we have some ABI details for all that but it's by no means complete. Anton is using the PowerPC bits AFAIK, so it already works to a certain degree. ) When extending measurements to KVM, we want two things: - user friendliness: instead of having to check 'ps' and figure out which Qemu thread is the KVM thread we want to profile, just give a convenience namespace to access guest profiling info. -G ought to map to the first currently running KVM guest it can find. (which would match like 90% of the cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something useful by default the whole effort is for naught. - Extend core facilities and enable the following measurement dimensions: host-kernel-space host-user-space guest-kernel-space guest-user-space on a per guest basis. We want to be able to measure just what the guest does, and we want to be able to measure just what the host does. Some of this the hardware helps us with (say only measuring host kernel events is possible), some has to be done by fiddling with event enable/disable at vm-exit / vm-entry time. My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. [ And, as so often, once you walk that path, that grand scheme you are thinking about right now might easily become last year's really bad idea ;-) ] So please start walking the path and experience the challenges first-hand. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > >host. > > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be > > configured to count only when in guest mode. Perf needs to be aware of > > that and fetch the rip from a different place when monitoring a guest. > The idea is we want to measure both host and guest at the same time, and > compare all the hot functions fairly. So you want to measure while the guest vcpu is running and the vmexit path of that vcpu (including qemu userspace part) together? The challenge here is to find out if a performance event originated in guest mode or in host mode. But we can check for that in the nmi-protected part of the vmexit path. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Zhang, Yanmin wrote: > On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote: > > * Jan Kiszka wrote: > > > > > Jes Sorensen wrote: > > > > Hi, > > > > > > > > It looks like several of us have been looking at how to use the PMU > > > > for virtualization. Rather than continuing to have discussions in > > > > smaller groups, I think it is a good idea we move it to the mailing > > > > lists to see what we can share and avoid duplicate efforts. > > > > > > > > There are really two separate things to handle: > > > > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > > >host. > > > > > > > > 2) Allow guests access to the PMU (or an emulated PMU), making it > > > >possible to run perf on applications running within the guest. > > > > > > > > I know some of you have been looking at 1) and I am currently working > > > > on 2). I have been looking at various approaches, including whether it > > > > is feasible to share the PMU between the host and multiple guests. For > > > > now I am going to focus on allowing one guest to take control of the > > > > PMU, then later hopefully adding support for multiplexing it between > > > > multiple guests. > > > > > > Given that perf can apply the PMU to individual host tasks, I don't see > > > fundamental problems multiplexing it between individual guests (which can > > > then internally multiplex it again). > > > > In terms of how to expose it to guests, a 'soft PMU' might be a usable > > approach. Although to Linux guests you could expose much more > > functionality and an non-PMU-limited number of instrumentation events, via > > a more intelligent interface. > > > > But note that in terms of handling it on the host side the PMU approach is > > not acceptable: instead it should map to proper perf_events, not try to > > muck with the PMU itself. > > > > That, besides integrating properly with perf usage on the host, will also > > allow interesting 'PMU' features on guests: you could set up the host side > > to trace block IO requests (or VM exits) for example, and expose that as > > 'PMC > > #0' on the guest side. > > So virtualization becomes non-transparent to guest os? I know virtio is an > optimization on guest side. The 'soft PMU' is transparent. The 'count IO events' kind of feature could be transparent too: you could re-configure (on the host) a given 'hardware' event to really count some software event. That would make it compatible with whatever guest side tooling (without having to change that tooling) - while still allowing interesting new things to be measured. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel wrote: > I personally don't like a self-defined event-set as the only solution > because that would probably only work with linux and perf. [...] The 'soft-PMU' i suggested is transparent on the guest side - if you want to enable non-Linux and legacy-Linux. It's basically a PMU interface provided to the guest by catching the right MSR accesses, implemented via perf_event_create_kernel_counter()/etc. on the host side. Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux<->Linux guest/host combos before gutting the design just to fit Windows into the picture ... Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: > > > 1) Add support to perf to allow it to monitor a KVM guest from the > >host. > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be > configured to count only when in guest mode. Perf needs to be aware of > that and fetch the rip from a different place when monitoring a guest. The idea is we want to measure both host and guest at the same time, and compare all the hot functions fairly. > > > 2) Allow guests access to the PMU (or an emulated PMU), making it > >possible to run perf on applications running within the guest. > > The biggest problem I see here is teaching the guest about the available > events. The available event sets are dependent on the processor family > (at least on AMD). > A simple approach would be shadowing the perf msrs which is a simple > thing to do. More problematic is the reinjection of performance > interrupts and performance nmis. > > I personally don't like a self-defined event-set as the only solution > because that would probably only work with linux and perf. I think we > should have a way (additionally to a soft-event interface) which allows > to expose the host pmu events to the guest. > > Joerg > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote: > * Jan Kiszka wrote: > > > Jes Sorensen wrote: > > > Hi, > > > > > > It looks like several of us have been looking at how to use the PMU > > > for virtualization. Rather than continuing to have discussions in > > > smaller groups, I think it is a good idea we move it to the mailing > > > lists to see what we can share and avoid duplicate efforts. > > > > > > There are really two separate things to handle: > > > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > > >host. > > > > > > 2) Allow guests access to the PMU (or an emulated PMU), making it > > >possible to run perf on applications running within the guest. > > > > > > I know some of you have been looking at 1) and I am currently working > > > on 2). I have been looking at various approaches, including whether it > > > is feasible to share the PMU between the host and multiple guests. For > > > now I am going to focus on allowing one guest to take control of the > > > PMU, then later hopefully adding support for multiplexing it between > > > multiple guests. > > > > Given that perf can apply the PMU to individual host tasks, I don't see > > fundamental problems multiplexing it between individual guests (which can > > then internally multiplex it again). > > In terms of how to expose it to guests, a 'soft PMU' might be a usable > approach. Although to Linux guests you could expose much more functionality > and an non-PMU-limited number of instrumentation events, via a more > intelligent interface. > > But note that in terms of handling it on the host side the PMU approach is > not > acceptable: instead it should map to proper perf_events, not try to muck with > the PMU itself. > > That, besides integrating properly with perf usage on the host, will also > allow interesting 'PMU' features on guests: you could set up the host side to > trace block IO requests (or VM exits) for example, and expose that as 'PMC > #0' on the guest side. So virtualization becomes non-transparent to guest os? I know virtio is an optimization on guest side. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jan Kiszka wrote: > Jes Sorensen wrote: > > Hi, > > > > It looks like several of us have been looking at how to use the PMU > > for virtualization. Rather than continuing to have discussions in > > smaller groups, I think it is a good idea we move it to the mailing > > lists to see what we can share and avoid duplicate efforts. > > > > There are really two separate things to handle: > > > > 1) Add support to perf to allow it to monitor a KVM guest from the > >host. > > > > 2) Allow guests access to the PMU (or an emulated PMU), making it > >possible to run perf on applications running within the guest. > > > > I know some of you have been looking at 1) and I am currently working > > on 2). I have been looking at various approaches, including whether it > > is feasible to share the PMU between the host and multiple guests. For > > now I am going to focus on allowing one guest to take control of the > > PMU, then later hopefully adding support for multiplexing it between > > multiple guests. > > Given that perf can apply the PMU to individual host tasks, I don't see > fundamental problems multiplexing it between individual guests (which can > then internally multiplex it again). In terms of how to expose it to guests, a 'soft PMU' might be a usable approach. Although to Linux guests you could expose much more functionality and an non-PMU-limited number of instrumentation events, via a more intelligent interface. But note that in terms of handling it on the host side the PMU approach is not acceptable: instead it should map to proper perf_events, not try to muck with the PMU itself. That, besides integrating properly with perf usage on the host, will also allow interesting 'PMU' features on guests: you could set up the host side to trace block IO requests (or VM exits) for example, and expose that as 'PMC #0' on the guest side. That's a neat feature: the guest profiling tools would immediately (and transparently) be able to measure VM exits or IO heaviness, on a per guest basis, as seen on the host side. More would be possible too. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
Jes Sorensen wrote: > Hi, > > It looks like several of us have been looking at how to use the PMU > for virtualization. Rather than continuing to have discussions in > smaller groups, I think it is a good idea we move it to the mailing > lists to see what we can share and avoid duplicate efforts. > > There are really two separate things to handle: > > 1) Add support to perf to allow it to monitor a KVM guest from the >host. > > 2) Allow guests access to the PMU (or an emulated PMU), making it >possible to run perf on applications running within the guest. > > I know some of you have been looking at 1) and I am currently working > on 2). I have been looking at various approaches, including whether it > is feasible to share the PMU between the host and multiple guests. For > now I am going to focus on allowing one guest to take control of the > PMU, then later hopefully adding support for multiplexing it between > multiple guests. Given that perf can apply the PMU to individual host tasks, I don't see fundamental problems multiplexing it between individual guests (which can then internally multiplex it again). Then the next challenge might be how to handle the case of both host and guest trying to use PMU resources at the same time. For the sparse debug registers resources I simply disable the effect of guest injected breakpoints once the host wants to use them. The guest still sees its programmed values, though. One could try to schedule free registers between both, but given how rare such use cases are, I decided to go for a simple approach. Probably the situation is not that different for the PMU. > > Eventually we will see proper hardware PMU virtualization from Intel and > AMD (admittedly I have only looked at the Intel specs so far), and by > then be able to allow the host as well as the guests to share the PMU. > > If anybody else is working on this, I'd love to hear about it so we can > coordinate our efforts. The main purpose with this mail was really to > being the discussion to the mailing list to avoid duplicated efforts. I thought I've seen quite some code for PMU virtualization in Xen's HVM code. Might be worth studying what they do already and adopt/extend it for KVM. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM PMU virtualization
Hi, It looks like several of us have been looking at how to use the PMU for virtualization. Rather than continuing to have discussions in smaller groups, I think it is a good idea we move it to the mailing lists to see what we can share and avoid duplicate efforts. There are really two separate things to handle: 1) Add support to perf to allow it to monitor a KVM guest from the host. 2) Allow guests access to the PMU (or an emulated PMU), making it possible to run perf on applications running within the guest. I know some of you have been looking at 1) and I am currently working on 2). I have been looking at various approaches, including whether it is feasible to share the PMU between the host and multiple guests. For now I am going to focus on allowing one guest to take control of the PMU, then later hopefully adding support for multiplexing it between multiple guests. Eventually we will see proper hardware PMU virtualization from Intel and AMD (admittedly I have only looked at the Intel specs so far), and by then be able to allow the host as well as the guests to share the PMU. If anybody else is working on this, I'd love to hear about it so we can coordinate our efforts. The main purpose with this mail was really to being the discussion to the mailing list to avoid duplicated efforts. Cheers, Jes PS: I'll be AFK all of next week, so it may take a few days for me to reply to follow-up discussions. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html