Re: KVM PMU virtualization

2010-03-10 Thread Zhang, Yanmin
On Thu, 2010-03-04 at 09:00 +0800, Zhang, Yanmin wrote:
> On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > > -#ifndef perf_misc_flags
> > > -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER 
> > > : \
> > > -PERF_RECORD_MISC_KERNEL)
> > > -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> > > -#endif 
> > 
> > Ah, that #ifndef is for powerpc, which I think you just broke.
> Thanks for the reminder. I deleted powerpc codes when building cscope
> lib.
> 
> It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
> change them to a callback function struct and kvm registers its version to 
> perf.
> 
> Such like:
> struct perf_guest_info_callbacks {
>   int (*is_in_guest)();
>   u64 (*get_guest_ip)();
>   int (*copy_guest_stack)();
>   int (*reset_in_guest)();
>   ...
> };
> int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
> int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);
> 
> It's more scalable and neater.
In case you guys might lose patience, I worked out a new patch against 
2.6.34-rc1.

It could work with:
#perf kvm --guest --guestkallsyms /guest/os/kernel/proc/kallsyms --guestmodules
/guest/os/proc/modules top
It also support to collect both host side and guest side at the same time:
#perf kvm --host --guest --guestkallsyms /guest/os/kernel/proc/kallsyms 
--guestmodules
/guest/os/proc/modules top

The first output line of top has guest kernel/user space percentage.

Or just host side:
#perf kvm --host

As tool perf source codes have lots of changes, I am still working on perf kvm 
record
and report.

---

diff -Nraup linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h 
linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h
--- linux-2.6.34-rc1/arch/x86/include/asm/ptrace.h  2010-03-09 
13:04:20.730596079 +0800
+++ linux-2.6.34-rc1_work/arch/x86/include/asm/ptrace.h 2010-03-10 
17:06:34.228953260 +0800
@@ -167,6 +167,15 @@ static inline int user_mode(struct pt_re
 #endif
 }
 
+static inline int user_mode_cs(u16 cs)
+{
+#ifdef CONFIG_X86_32
+   return (cs & SEGMENT_RPL_MASK) == USER_RPL;
+#else
+   return !!(cs & 3);
+#endif
+}
+
 static inline int user_mode_vm(struct pt_regs *regs)
 {
 #ifdef CONFIG_X86_32
diff -Nraup linux-2.6.34-rc1/arch/x86/kvm/vmx.c 
linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c
--- linux-2.6.34-rc1/arch/x86/kvm/vmx.c 2010-03-09 13:04:20.758593132 +0800
+++ linux-2.6.34-rc1_work/arch/x86/kvm/vmx.c2010-03-10 17:11:49.709019136 
+0800
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "kvm_cache_regs.h"
 #include "x86.h"
 
@@ -3632,6 +3633,43 @@ static void update_cr8_intercept(struct 
vmcs_write32(TPR_THRESHOLD, irr);
 }
 
+DEFINE_PER_CPU(int, kvm_in_guest) = {0};
+
+static void kvm_set_in_guest(void)
+{
+   percpu_write(kvm_in_guest, 1);
+}
+
+static int kvm_is_in_guest(void)
+{
+   return percpu_read(kvm_in_guest);
+}
+
+static int kvm_is_user_mode(void)
+{
+   int user_mode;
+   user_mode = user_mode_cs(vmcs_read16(GUEST_CS_SELECTOR));
+   return user_mode;
+}
+
+static u64 kvm_get_guest_ip(void)
+{
+   return vmcs_readl(GUEST_RIP);
+}
+
+static void kvm_reset_in_guest(void)
+{
+   if (percpu_read(kvm_in_guest))
+   percpu_write(kvm_in_guest, 0);
+}
+
+static struct perf_guest_info_callbacks kvm_guest_cbs = {
+   .is_in_guest= kvm_is_in_guest,
+   .is_user_mode   = kvm_is_user_mode,
+   .get_guest_ip   = kvm_get_guest_ip,
+   .reset_in_guest = kvm_reset_in_guest
+};
+
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
u32 exit_intr_info;
@@ -3653,8 +3691,11 @@ static void vmx_complete_interrupts(stru
 
/* We need to handle NMIs before interrupts are enabled */
if ((exit_intr_info & INTR_INFO_INTR_TYPE_MASK) == INTR_TYPE_NMI_INTR &&
-   (exit_intr_info & INTR_INFO_VALID_MASK))
+   (exit_intr_info & INTR_INFO_VALID_MASK)) {
+   kvm_set_in_guest();
asm("int $2");
+   kvm_reset_in_guest();
+   }
 
idtv_info_valid = idt_vectoring_info & VECTORING_INFO_VALID_MASK;
 
@@ -4251,6 +4292,8 @@ static int __init vmx_init(void)
if (bypass_guest_pf)
kvm_mmu_set_nonpresent_ptes(~0xffeull, 0ull);
 
+   perf_register_guest_info_callbacks(&kvm_guest_cbs);
+
return 0;
 
 out3:
@@ -4266,6 +4309,8 @@ out:
 
 static void __exit vmx_exit(void)
 {
+   perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
+
free_page((unsigned long)vmx_msr_bitmap_legacy);
free_page((unsigned long)vmx_msr_bitmap_longmode);
free_page((unsigned long)vmx_io_bitmap_b);
diff -Nraup linux-2.6.34-rc1/include/linux/perf_event.h 
linux-2.6.34-rc1_work/include/linux/perf_event.h
--- linux-2.6.34-rc1/include/li

Re: KVM PMU virtualization

2010-03-08 Thread Avi Kivity

On 02/26/2010 04:42 PM, Peter Zijlstra wrote:


Also, intel debugstore things requires a host linear address,


It requires a linear address, not a host linear address.  Of course, it 
might not like the linear address mappings changing under its feet.  If 
it has a private tlb, then this won't work.



  again, not
something a vcpu can easily provide (although that might be worked
around with an msr trap, but that still limits you to 1 page data sizes,
not a limitation all software will respect).
   


If you're willing to pin pages, you can map the guest's buffer.  That 
won't work if BTS can happen in parallel with a #VMEXIT, or if there are 
interactions with npt/ept.  Will have to ask the vendors.



All that said, what we really want is for Intel+AMD to come up with
proper hw PMU virtualization support that makes it easy to rotate the
full PMU in and out for a guest. Then this whole discussion will become
a non issue.
 

As it stands there simply are a number of PMU features that defy being
virtualized, simply because the virt stuff doesn't do system topology.
So even if they were to support a virtualized pmu, it would likely be a
different beast than the native hardware is, and it will be several
hardware models in the future, coming up with a paravirt interface and
getting !linux hosts to adapt and !linux guests to use is probably as
'easy'.
   


!linux hosts are someone else's problem, but how would be get !linux 
guests to use a soft pmu?


The only way I see that happening is if a soft pmu is standardized 
across hypervisors, which is unfortunately unlikely.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-08 Thread Avi Kivity

On 03/01/2010 07:17 PM, Peter Zijlstra wrote:




2. For every emulated performance counter the guest activates kvm
allocates a perf_event and configures it for the guest (we may allow
kvm to specify the counter index, the guest would be able to use
rdpmc unintercepted then). Event filtering is also done in this step.
 

rdpmc can never be used unintercepted, for perf might be multiplexing
the actual hw.
   


How often is rdpmc used?  If it is invoked on high frequency 
software-only events (like context switches), then this may be a 
performance issue.  If it is only issued on perf interrupts, we may be 
able to live with it (since we already took an exit for the interrupt).


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:15 +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > -#ifndef perf_misc_flags
> > -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
> > -PERF_RECORD_MISC_KERNEL)
> > -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> > -#endif 
> 
> Ah, that #ifndef is for powerpc, which I think you just broke.
Thanks for the reminder. I deleted powerpc codes when building cscope
lib.

It seems perf_save_virt_ip/perf_reset_virt_ip interfaces are ugly. I plan to
change them to a callback function struct and kvm registers its version to perf.

Such like:
struct perf_guest_info_callbacks {
int (*is_in_guest)();
u64 (*get_guest_ip)();
int (*copy_guest_stack)();
int (*reset_in_guest)();
...
};
int perf_register_guest_info_callbacks(struct perf_guest_info_callbacks *);
int perf_unregister_guest_info_callbacks(struct perf_guest_info_callbacks *);

It's more scalable and neater.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:13 +0100, Peter Zijlstra wrote:
> On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> > +static inline u64 perf_instruction_pointer(struct pt_regs *regs)
> > +{
> > +   u64 ip;
> > +   ip = percpu_read(perf_virt_ip.ip);
> > +   if (!ip)
> > +   ip = instruction_pointer(regs);
> > +   else
> > +   perf_reset_virt_ip();
> > +   return ip;
> > +}
> > +
> > +static inline unsigned int perf_misc_flags(struct pt_regs *regs)
> > +{
> > +   if (percpu_read(perf_virt_ip.ip)) {
> > +   return percpu_read(perf_virt_ip.user_mode) ?
> > +   PERF_RECORD_MISC_GUEST_USER :
> > +   PERF_RECORD_MISC_GUEST_KERNEL;
> > +   } else
> > +   return user_mode(regs) ? PERF_RECORD_MISC_USER :
> > +PERF_RECORD_MISC_KERNEL;
> > +} 
> 
> This codes in the assumption that perf_misc_flags() must only be called
> before perf_instruction_pointer(), which is currently true, but you
> might want to put a comment near to remind us of this.
I will change the logic with a clear reset operation in caller.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Peter Zijlstra
On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> -#ifndef perf_misc_flags
> -#define perf_misc_flags(regs)  (user_mode(regs) ? PERF_RECORD_MISC_USER : \
> -PERF_RECORD_MISC_KERNEL)
> -#define perf_instruction_pointer(regs) instruction_pointer(regs)
> -#endif 

Ah, that #ifndef is for powerpc, which I think you just broke.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Peter Zijlstra
On Wed, 2010-03-03 at 17:27 +0800, Zhang, Yanmin wrote:
> +static inline u64 perf_instruction_pointer(struct pt_regs *regs)
> +{
> +   u64 ip;
> +   ip = percpu_read(perf_virt_ip.ip);
> +   if (!ip)
> +   ip = instruction_pointer(regs);
> +   else
> +   perf_reset_virt_ip();
> +   return ip;
> +}
> +
> +static inline unsigned int perf_misc_flags(struct pt_regs *regs)
> +{
> +   if (percpu_read(perf_virt_ip.ip)) {
> +   return percpu_read(perf_virt_ip.user_mode) ?
> +   PERF_RECORD_MISC_GUEST_USER :
> +   PERF_RECORD_MISC_GUEST_KERNEL;
> +   } else
> +   return user_mode(regs) ? PERF_RECORD_MISC_USER :
> +PERF_RECORD_MISC_KERNEL;
> +} 

This codes in the assumption that perf_misc_flags() must only be called
before perf_instruction_pointer(), which is currently true, but you
might want to put a comment near to remind us of this.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-03 Thread Zhang, Yanmin
On Wed, 2010-03-03 at 11:32 +0800, Zhang, Yanmin wrote:
> On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
> > * Zhang, Yanmin  wrote:
> > 
> > > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> > 
> > > > My suggestion, as always, would be to start very simple and very 
> > > > minimal:
> > > > 
> > > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > > > image both as a host and as guest (for testing), to not have to deal 
> > > > with 
> > > > the symbol space transport problem initially. Enable 'perf kvm record' 
> > > > to 
> > > > only record guest events by default. Etc.
> > > > 
> > > > This alone will be a quite useful result already - and gives a basis 
> > > > for 
> > > > further work. No need to spend months to do the big grand design 
> > > > straight 
> > > > away, all of this can be done gradually and in the order of usefulness 
> > > > - 
> > > > and you'll always have something that actually works (and helps your 
> > > > other 
> > > > KVM projects) along the way.
> > >
> > > It took me for a couple of hours to read the emails on the topic. Based 
> > > on 
> > > above idea, I worked out a prototype which is ugly, but does work with 
> > > top/record when both guest side and host side use the same kernel image, 
> > > while compiling most needed modules into kernel directly..
> > > 
> > > The commands are:
> > > perf kvm top
> > > perf kvm record
> > > perf kvm report
> > > 
> > > They just collect guest kernel hot functions.
> > 
> > Fantastic, and there's some really interesting KVM guest/host comparison 
> > profiles you've done with this prototype!
> > 
> > > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> > > cpu).
> > >
> > > 1) Vanilla host kernel (6G memory):
> > > 
> > >PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 
> > > CPUs)
> > > 
> > > 
> > >  samples  pcnt functionDSO
> > >  ___ _ ___ 
> > > 
> > > 
> > > 99376.00 40.5% ext3_test_allocatable   
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 41239.00 16.8% bitmap_search_next_usable_block 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  7019.00  2.9% __ticket_spin_lock  
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  5350.00  2.2% copy_user_generic_string
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  5208.00  2.1% do_get_write_access 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  4484.00  1.8% journal_dirty_metadata  
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  4078.00  1.7% ext3_free_blocks_sb 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  3856.00  1.6% ext3_new_blocks 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  3485.00  1.4% journal_get_undo_access 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  2803.00  1.1% ext3_try_to_allocate
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  2241.00  0.9% __find_get_block
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  1957.00  0.8% find_revoke_record  
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 
> > > 2) guest os: start one guest os with 4GB memory.
> > > 
> > >PerfTop: 827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 
> > > CPUs)
> > > 
> > > 
> > >  samples  pcnt functionDSO
> > >  ___ _ ___ 
> > > 
> > > 
> > > 41701.00 28.1% __ticket_spin_lock  
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 33843.00 22.8% ext3_test_allocatable   
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > > 16862.00 11.4% bitmap_search_next_usable_block 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  3278.00  2.2% native_flush_tlb_others 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  3200.00  2.2% copy_user_generic_string
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  3009.00  2.0% do_get_write_access 
> > > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > >  2834.00  1.9% journal_d

Re: KVM PMU virtualization

2010-03-02 Thread Zhang, Yanmin
On Tue, 2010-03-02 at 10:36 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin  wrote:
> 
> > On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> 
> > > My suggestion, as always, would be to start very simple and very minimal:
> > > 
> > > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > > image both as a host and as guest (for testing), to not have to deal with 
> > > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > > only record guest events by default. Etc.
> > > 
> > > This alone will be a quite useful result already - and gives a basis for 
> > > further work. No need to spend months to do the big grand design straight 
> > > away, all of this can be done gradually and in the order of usefulness - 
> > > and you'll always have something that actually works (and helps your 
> > > other 
> > > KVM projects) along the way.
> >
> > It took me for a couple of hours to read the emails on the topic. Based on 
> > above idea, I worked out a prototype which is ugly, but does work with 
> > top/record when both guest side and host side use the same kernel image, 
> > while compiling most needed modules into kernel directly..
> > 
> > The commands are:
> > perf kvm top
> > perf kvm record
> > perf kvm report
> > 
> > They just collect guest kernel hot functions.
> 
> Fantastic, and there's some really interesting KVM guest/host comparison 
> profiles you've done with this prototype!
> 
> > With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> > cpu).
> >
> > 1) Vanilla host kernel (6G memory):
> > 
> >PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> > 
> > 
> >  samples  pcnt functionDSO
> >  ___ _ ___ 
> > 
> > 
> > 99376.00 40.5% ext3_test_allocatable   
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 41239.00 16.8% bitmap_search_next_usable_block 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  7019.00  2.9% __ticket_spin_lock  
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  5350.00  2.2% copy_user_generic_string
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  5208.00  2.1% do_get_write_access 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  4484.00  1.8% journal_dirty_metadata  
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  4078.00  1.7% ext3_free_blocks_sb 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  3856.00  1.6% ext3_new_blocks 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  3485.00  1.4% journal_get_undo_access 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  2803.00  1.1% ext3_try_to_allocate
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  2241.00  0.9% __find_get_block
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  1957.00  0.8% find_revoke_record  
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 
> > 2) guest os: start one guest os with 4GB memory.
> > 
> >PerfTop: 827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> > 
> > 
> >  samples  pcnt functionDSO
> >  ___ _ ___ 
> > 
> > 
> > 41701.00 28.1% __ticket_spin_lock  
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 33843.00 22.8% ext3_test_allocatable   
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> > 16862.00 11.4% bitmap_search_next_usable_block 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  3278.00  2.2% native_flush_tlb_others 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  3200.00  2.2% copy_user_generic_string
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  3009.00  2.0% do_get_write_access 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  2834.00  1.9% journal_dirty_metadata  
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  1965.00  1.3% journal_get_undo_access 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  1907.00  1.3% ext3_new_blocks 
> > /lib/modules/2.6.33-kvmymz/build/vmlinux
> >  1790.00  1.2%

Re: KVM PMU virtualization

2010-03-02 Thread Peter Zijlstra
On Tue, 2010-03-02 at 15:09 +0800, Zhang, Yanmin wrote:
> With vanilla host kernel, perf top data is stable and spinlock doesn't take 
> too much cpu time.
> With guest os, __ticket_spin_lock consumes 28% cpu time, and sometimes it 
> fluctuates between 9%~28%.
> 
> Another interesting finding is aim7. If I start aim7 on tmpfs testing in 
> guest os with 1GB memory,
> the login hangs and cpu is busy. With the new patch, I could check what 
> happens in guest os, where
> spinlock is busy and kernel is shrinking memory mostly from slab.

Hehe, you've just discovered the reason for paravirt spinlocks ;-)

But neat stuff, although I don't think you need PERF_SAMPLE_KVM, it
should simply always report the guest sample if it came from the guest,
you can extend PERF_RECORD_MISC_CPUMODE_MASK to add guest states.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-02 Thread Ingo Molnar

* Zhang, Yanmin  wrote:

> On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:

> > My suggestion, as always, would be to start very simple and very minimal:
> > 
> > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > image both as a host and as guest (for testing), to not have to deal with 
> > the symbol space transport problem initially. Enable 'perf kvm record' to 
> > only record guest events by default. Etc.
> > 
> > This alone will be a quite useful result already - and gives a basis for 
> > further work. No need to spend months to do the big grand design straight 
> > away, all of this can be done gradually and in the order of usefulness - 
> > and you'll always have something that actually works (and helps your other 
> > KVM projects) along the way.
>
> It took me for a couple of hours to read the emails on the topic. Based on 
> above idea, I worked out a prototype which is ugly, but does work with 
> top/record when both guest side and host side use the same kernel image, 
> while compiling most needed modules into kernel directly..
> 
> The commands are:
> perf kvm top
> perf kvm record
> perf kvm report
> 
> They just collect guest kernel hot functions.

Fantastic, and there's some really interesting KVM guest/host comparison 
profiles you've done with this prototype!

> With my patch, I collected dbench data on Nehalem machine (2*4*2 logical 
> cpu).
>
> 1) Vanilla host kernel (6G memory):
> 
>PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)
> 
> 
>  samples  pcnt functionDSO
>  ___ _ ___ 
> 
> 
> 99376.00 40.5% ext3_test_allocatable   
> /lib/modules/2.6.33-kvmymz/build/vmlinux
> 41239.00 16.8% bitmap_search_next_usable_block 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  7019.00  2.9% __ticket_spin_lock  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  5350.00  2.2% copy_user_generic_string
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  5208.00  2.1% do_get_write_access 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  4484.00  1.8% journal_dirty_metadata  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  4078.00  1.7% ext3_free_blocks_sb 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  3856.00  1.6% ext3_new_blocks 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  3485.00  1.4% journal_get_undo_access 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  2803.00  1.1% ext3_try_to_allocate
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  2241.00  0.9% __find_get_block
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  1957.00  0.8% find_revoke_record  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
> 
> 2) guest os: start one guest os with 4GB memory.
> 
>PerfTop: 827 irqs/sec  kernel: 0.0% [1000Hz cycles],  (all, 16 CPUs)
> 
> 
>  samples  pcnt functionDSO
>  ___ _ ___ 
> 
> 
> 41701.00 28.1% __ticket_spin_lock  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
> 33843.00 22.8% ext3_test_allocatable   
> /lib/modules/2.6.33-kvmymz/build/vmlinux
> 16862.00 11.4% bitmap_search_next_usable_block 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  3278.00  2.2% native_flush_tlb_others 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  3200.00  2.2% copy_user_generic_string
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  3009.00  2.0% do_get_write_access 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  2834.00  1.9% journal_dirty_metadata  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  1965.00  1.3% journal_get_undo_access 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  1907.00  1.3% ext3_new_blocks 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  1790.00  1.2% ext3_free_blocks_sb 
> /lib/modules/2.6.33-kvmymz/build/vmlinux
>  1741.00  1.2% find_revoke_record  
> /lib/modules/2.6.33-kvmymz/build/vmlinux
> 
> 
> With vanilla host kernel, perf top data is stable and spinlock d

Re: KVM PMU virtualization

2010-03-01 Thread Zhang, Yanmin
On Fri, 2010-02-26 at 10:17 +0100, Ingo Molnar wrote:
> * Joerg Roedel  wrote:
> 
> > On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > > 
> > > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > > >host.
> > > > 
> > > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > > configured to count only when in guest mode. Perf needs to be aware of
> > > > that and fetch the rip from a different place when monitoring a guest.
> > 
> > > The idea is we want to measure both host and guest at the same time, and
> > > compare all the hot functions fairly.
> > 
> > So you want to measure while the guest vcpu is running and the vmexit
> > path of that vcpu (including qemu userspace part) together? The
> > challenge here is to find out if a performance event originated in guest
> > mode or in host mode.
> > But we can check for that in the nmi-protected part of the vmexit path.
> 
> As far as instrumentation goes, virtualization is simply another 'PID 
> dimension' of measurement.
> 
> Today we can isolate system performance measurements/events to the following 
> domains:
> 
>  - per system
>  - per cpu
>  - per task
> 
> ( Note that PowerPC already supports certain sorts of 
> 'hypervisor/kernel/user' 
>   domain separation, and we have some ABI details for all that but it's by no 
>   means complete. Anton is using the PowerPC bits AFAIK, so it already works 
>   to a certain degree. )
> 
> When extending measurements to KVM, we want two things:
> 
>  - user friendliness: instead of having to check 'ps' and figure out which 
>Qemu thread is the KVM thread we want to profile, just give a convenience
>namespace to access guest profiling info. -G ought to map to the first
>currently running KVM guest it can find. (which would match like 90% of the
>cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
>useful by default the whole effort is for naught.
> 
>  - Extend core facilities and enable the following measurement dimensions:
> 
>  host-kernel-space
>  host-user-space
>  guest-kernel-space
>  guest-user-space
> 
>on a per guest basis. We want to be able to measure just what the guest 
>does, and we want to be able to measure just what the host does.
> 
>Some of this the hardware helps us with (say only measuring host kernel 
>events is possible), some has to be done by fiddling with event 
>enable/disable at vm-exit / vm-entry time.
> 
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the 
> symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
It took me for a couple of hours to read the emails on the topic.
Based on above idea, I worked out a prototype which is ugly, but does work
with top/record when both guest side and host side use the same kernel image,
while compiling most needed modules into kernel directly..

The commands are:
perf kvm top
perf kvm record
perf kvm report

They just collect guest kernel hot functions.

> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea 
> ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.
With my patch, I collected dbench data on Nehalem machine (2*4*2 logical cpu).
1) Vanilla host kernel (6G memory):

   PerfTop:   15491 irqs/sec  kernel:93.6% [1000Hz cycles],  (all, 16 CPUs)


 samples  pcnt functionDSO
 ___ _ ___ 


99376.00 40.5% ext3_test_allocatable   
/lib/modules/2.6.33-kvmymz/build/vmlinux
41239.00 16.8% bitmap_search_next_usable_block 
/lib/modules/2.6.33-kvmymz/build/vmlinux
 7019.00  2.9% __ticket_spin_lock  
/lib/modules/2.6.33-kvmymz/build/vmlinux
 5350.00  2.2% copy_user_generic_string
/lib/modules/2.6.33-kvmymz/build/vmlinux
 520

Re: KVM PMU virtualization

2010-03-01 Thread Zachary Amsden

On 02/26/2010 05:55 AM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
   

On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
 

That's 7 more than what we support now, and 7 more than what we can
guarantee without it.

 

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?

   

Good question.  Would be interesting to try out VTune with the non-arch
pmu masked out.
 

Also, the ANY bit is part of the intel arch pmu, but you still have to
mask it out.

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.
   


What if you want to run on 10 different variations of Windows 32 / 64 / 
server / desktop configurations.  Do you maintain 10 installed pieces of 
hardware?


A virtual machine is a better solution.  And you might want to 
performance tune all 10 of those configurations as well.  Be nice if it 
were possible.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Zachary Amsden

On 02/26/2010 03:37 AM, Jes Sorensen wrote:

On 02/26/10 14:31, Ingo Molnar wrote:

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.

  2) Once a Linux guest has upgraded, it will work in the future, 
with _any_

 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system 
on new

hardware, without having to upgrade that guest for the new CPU support.


That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.

With the 'steal the PMU' messy approach the guest OS has to be 
upgraded to the

new CPU type all the time. Ad infinitum.


The way the Perfmon architecture is specified by Intel, that is what we
are stuck with. It's not going to be possible via software emulation to
count cache misses, unless you run it in a micro architecture emulator.


Sure you can count cache misses.

Step 1. Declare KVM to possess a virtual cache hereto unseen to guest VCPUs.
Step 2. Use micro architecture rules to add to cache misses in an 
undefined micro-architecture specific way

Step 3. 
Step 4.  PROFIT!

The point being, there are no rules required to follow for 
architecturally unspecified events.  Instructions issued is well defined 
architecturally, one of very few such counters, while things like cache 
strides and organization are deliberately left to the implementation.


So returning zero is a perfectly valid choice for emulating cache misses.

Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Joerg Roedel
On Mon, Mar 01, 2010 at 06:17:40PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote:
> > 
> > 1. Enhance perf to count pmu events only when cpu is in guest mode.
> 
> No enhancements needed, only hardware support for Intel doesn't provide
> this iirc.

At least the guest-bit for AMD perfctl registers is not supported yet
;-) Implementing this eliminates the requirement to write the perfctl
msrs on every vmrun and every vmexit. But thats a minor change.

> > 4. Some additional magic to reinject pmu events into the guest
> 
> Right, that is needed, and might be 'interesting' since we get them from
> NMI context.

I imagine some kind of callback which sets a flag in the kvm vcpu
structure. Since the NMI already triggered a vmexit the kvm code checks
for this bit on its path to re-entry.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Peter Zijlstra
On Mon, 2010-03-01 at 12:11 +0100, Joerg Roedel wrote:
> 
> 1. Enhance perf to count pmu events only when cpu is in guest mode.

No enhancements needed, only hardware support for Intel doesn't provide
this iirc.

> 2. For every emulated performance counter the guest activates kvm
>allocates a perf_event and configures it for the guest (we may allow
>kvm to specify the counter index, the guest would be able to use
>rdpmc unintercepted then). Event filtering is also done in this step.

rdpmc can never be used unintercepted, for perf might be multiplexing
the actual hw.

> 3. Before vmrun the guest activates all its counters, 

Right, this is could be used to approximate the guest only counting. I'm
not sure how the OS and USR bits interact with guest stuff - if the PMU
isn't aware of the virtualized priv levels then those will not work as
expected.

> this can fail if
>the host uses them or the requested pmc index is not available for some
>reason.

perf doesn't know about pmc indexes at the interface level, nor is that
needed I think.

> 4. Some additional magic to reinject pmu events into the guest

Right, that is needed, and might be 'interesting' since we get them from
NMI context.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Zachary Amsden

On 02/26/2010 01:42 AM, Ingo Molnar wrote:

* Jes Sorensen  wrote:

   

On 02/26/10 11:44, Ingo Molnar wrote:
 

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.
   

I have to say I disagree on that. When you run perfmon on a system, it is
normally to measure a specific application. You want to see accurate numbers
for cache misses, mul instructions or whatever else is selected.
 

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.

   

Emulating the PMU rather than using the real one, makes the numbers far less
useful. The most useful way to provide PMU support in a guest is to expose
the real PMU and let the guest OS program it.
 

Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)

   

We can do this in a reasonable way today, if we allow to take the PMU away
from the host, and only let guests access it when it's in use. [...]
 

You get my sure-fire NAK for that kind of crap though. Interfering with the
host PMU and stealing it, is not a technical approach that has acceptable
quality.

You need to integrate it properly so that host PMU functionality still works
fine. (Within hardware constraints)
   


I have to agree strongly with Ingo here.

If you can't reset, restore or offset the perf counters in hardware, 
then you can't expose them to the guest.  There is too much rich 
information about host state that can be derived and considered an 
information leak or covert channel, and you can't allow the guest to 
trample host PMU state.


On some architectures, to bank switch these perf counters is possible 
since you can read and write the full size counter MSRs.  However, it is 
a cumbersome task that must be done at every preemption point.  There 
are many ways to do so as lazily as possible so that overhead only 
happens in a guest which actively uses the PMU.  With careful 
bookkeeping, you can even compound the guest PMU counters back into the 
host counters if the host is using the PMU.


Sorting out the details about who to deliver the PMU exception to 
however, the host or the guest, when an overflow occurs, is a nasty, 
ugly dilemma, as is properly programming the counters so that overflow 
happens in a controlled fashion when both the host and the guest are 
attempting to use this feature.  So supporting "step ahead 13 
instructions and then give me an interrupt so I can signal my debugger" 
simultaneously and correctly in both the host and guest is a very hard 
task, perhaps untenable.


Zach
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Joerg Roedel
On Mon, Mar 01, 2010 at 09:44:50AM +0100, Ingo Molnar wrote:

> There's a world of a difference between "will not use in certain usecases" 
> and 
> "cannot use at all because we've designed it so". By doing the latter we 
> guarantee that sane shared usage of the PMU will never occur - which is bad.

I think we can emulate a real hardware pmu for guests using the perf
infrastructure. The emulation will not be complete but powerful enough
for most usecases. Some steps towards this might be:

1. Enhance perf to count pmu events only when cpu is in guest mode.

2. For every emulated performance counter the guest activates kvm
   allocates a perf_event and configures it for the guest (we may allow
   kvm to specify the counter index, the guest would be able to use
   rdpmc unintercepted then). Event filtering is also done in this step.

3. Before vmrun the guest activates all its counters, this can fail if
   the host uses them or the requested pmc index is not available for some
   reason.

4. Some additional magic to reinject pmu events into the guest

 
> Think about it: this whole Linux thing is about 'sharing' resources. That 
> concet really works and permeates everything we do in the kernel. Yes, it's 
> somewhat hard for the PMU but we've done it on the host side via perf events 
> and we really dont want to look back ...

As I learnt at the university this whole operating system thing is about
managing resource sharing and hardware abstraction ;-)

> My experience is that once the right profiling/tracing tools are there, 
> people 
> will use them in every which way. The bigger a box is, the more likely shared 
> usage will occur - just statistically. Which coincides with KVM's "the bigger 
> the box, the better for virtualization" general mantra.

With the above approach the only point of conflict occurs when the host
wants to monitor the qemu-processes executing the vcpus which want to do
performance monitoring of their own or cpu-wide counting.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote:
> > > What do you mean by software events?
> > 
> > Things like:
> > 
> > aldebaran:~> perf stat -a sleep 1
> > 
> >  Performance counter stats for 'sleep 1':
> > 
> >15995.719133  task-clock-msecs # 15.981 CPUs 
> >5787  context-switches #  0.000 M/sec
> > 210  CPU-migrations   #  0.000 M/sec
> >  193909  page-faults  #  0.012 M/sec
> > 28704833507  cycles   #   1794.532 M/sec  (scaled from 
> > 78.69%)
> > 14387445668  instructions #  0.501 IPC(scaled from 
> > 90.71%)
> >   736644616  branches # 46.053 M/sec  (scaled from 
> > 90.52%)
> >   695884659  branch-misses# 94.467 %  (scaled from 
> > 90.70%)
> >   727070678  cache-references # 45.454 M/sec  (scaled from 
> > 88.11%)
> >  1305560420  cache-misses # 81.619 M/sec  (scaled from 
> > 52.00%)
> > 
> > 1.000942399  seconds time elapsed
> > 
> > These lines:
> > 
> >15995.719133  task-clock-msecs # 15.981 CPUs 
> >5787  context-switches #  0.000 M/sec
> > 210  CPU-migrations   #  0.000 M/sec
> >  193909  page-faults  #  0.012 M/sec
> > 
> > Are software events of the host - a subset of which could be transparently 
> > exposed to the guest. Same for tracepoints, probes, etc. Those are not 
> > exposed 
> > by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
> > channel to perf events) we gain a lot more than just raw PMU functionality.
> > 
> > 'performance events' are about a lot more than just the PMU, it's a 
> > coherent 
> > system health / system events / structured logging framework.
> 
> Yeah I know. But these event should be available in the guest already, no? 
> [...]

How would an old Linux or Windows guest know about them?

Also, even for new-Linux guests, they'd only have access to their own internal 
events - not to any host events.

My suggestion (admittedly not explained in any detail) was to allow guest 
access to certain _host_ events. I.e. a guest could profile its own impact on 
the host (such as VM exits, IO done on the host side, scheduling, etc.), 
without it having any (other) privileged access to the host.

This would be a powerful concept: you could profile your guest for host 
efficiency, _without_ having access to the host - beyond those events 
themselves. (which would be set up in a carefully filtered-to-guest manner.)

> [...] They don't need any kind of hardware support from the pmu. A paravirt 
> perf channel from the guest to the host would be definitly a win. It would 
> be a powerful tool for kvm/linux-guest analysis (e.g. trace host-kvm and 
> guest-events together on the host)

Yeah.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Joerg Roedel
On Mon, Mar 01, 2010 at 09:39:04AM +0100, Ingo Molnar wrote:
> > What do you mean by software events?
> 
> Things like:
> 
> aldebaran:~> perf stat -a sleep 1
> 
>  Performance counter stats for 'sleep 1':
> 
>15995.719133  task-clock-msecs # 15.981 CPUs 
>5787  context-switches #  0.000 M/sec
> 210  CPU-migrations   #  0.000 M/sec
>  193909  page-faults  #  0.012 M/sec
> 28704833507  cycles   #   1794.532 M/sec  (scaled from 
> 78.69%)
> 14387445668  instructions #  0.501 IPC(scaled from 
> 90.71%)
>   736644616  branches # 46.053 M/sec  (scaled from 
> 90.52%)
>   695884659  branch-misses# 94.467 %  (scaled from 
> 90.70%)
>   727070678  cache-references # 45.454 M/sec  (scaled from 
> 88.11%)
>  1305560420  cache-misses # 81.619 M/sec  (scaled from 
> 52.00%)
> 
> 1.000942399  seconds time elapsed
> 
> These lines:
> 
>15995.719133  task-clock-msecs # 15.981 CPUs 
>5787  context-switches #  0.000 M/sec
> 210  CPU-migrations   #  0.000 M/sec
>  193909  page-faults  #  0.012 M/sec
> 
> Are software events of the host - a subset of which could be transparently 
> exposed to the guest. Same for tracepoints, probes, etc. Those are not 
> exposed 
> by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
> channel to perf events) we gain a lot more than just raw PMU functionality.
> 
> 'performance events' are about a lot more than just the PMU, it's a coherent 
> system health / system events / structured logging framework.

Yeah I know. But these event should be available in the guest already,
no? They don't need any kind of hardware support from the pmu.
A paravirt perf channel from the guest to the host would be definitly a
win. It would be a powerful tool for kvm/linux-guest analysis (e.g.
trace host-kvm and guest-events together on the host)

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote:
> >  - A paravirt event driver is more compatible and more transparent in the 
> > long 
> >run: it allows hardware upgrade and upgraded PMU functionality (for 
> > Linux) 
> >without having to upgrade the guest OS. Via that a guest OS could even be
> >live-migrated to a different PMU, without noticing anything about it.
> > 
> >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
> >always assumes the guest OS is upgraded to the host. Also, 'raw' PMU 
> > state 
> >cannot be live-migrated. (save/restore doesnt help)
> 
> I agree with your arguments, having this soft-pmu for the guest has some 
> advantages over raw pmu access. It has a lot of advantages if the guest 
> migrated between hosts with different hardware.
> 
> But I still think we should have both, a soft-pmu and a pmu-emulation (which 
> is a more accurate term than 'raw guest pmu access') that looks to the guest 
> as a real hardware pmu would look like.  On a linux host that is dedicated 
> to executing virtual kvm machines there is little point in sharing the pmu 
> between guest and host because the host will probably never use it.

There's a world of a difference between "will not use in certain usecases" and 
"cannot use at all because we've designed it so". By doing the latter we 
guarantee that sane shared usage of the PMU will never occur - which is bad.

Really, similar arguments have been made in the past about different domains 
of system usage: "one profiling session per system is more than enough, who 
needs transparent, per user profilers", etc. Such restrictions have been 
broken through again and again.

Think about it: this whole Linux thing is about 'sharing' resources. That 
concet really works and permeates everything we do in the kernel. Yes, it's 
somewhat hard for the PMU but we've done it on the host side via perf events 
and we really dont want to look back ...

My experience is that once the right profiling/tracing tools are there, people 
will use them in every which way. The bigger a box is, the more likely shared 
usage will occur - just statistically. Which coincides with KVM's "the bigger 
the box, the better for virtualization" general mantra.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-03-01 Thread Ingo Molnar

* Joerg Roedel  wrote:

> >  - It's more secure: the host can have a finegrained policy about what 
> > kinds of
> >events it exposes to the guest. It might chose to only expose software 
> >events for example.
> 
> What do you mean by software events?

Things like:

aldebaran:~> perf stat -a sleep 1

 Performance counter stats for 'sleep 1':

   15995.719133  task-clock-msecs # 15.981 CPUs 
   5787  context-switches #  0.000 M/sec
210  CPU-migrations   #  0.000 M/sec
 193909  page-faults  #  0.012 M/sec
28704833507  cycles   #   1794.532 M/sec  (scaled from 
78.69%)
14387445668  instructions #  0.501 IPC(scaled from 
90.71%)
  736644616  branches # 46.053 M/sec  (scaled from 
90.52%)
  695884659  branch-misses# 94.467 %  (scaled from 
90.70%)
  727070678  cache-references # 45.454 M/sec  (scaled from 
88.11%)
 1305560420  cache-misses # 81.619 M/sec  (scaled from 
52.00%)

1.000942399  seconds time elapsed

These lines:

   15995.719133  task-clock-msecs # 15.981 CPUs 
   5787  context-switches #  0.000 M/sec
210  CPU-migrations   #  0.000 M/sec
 193909  page-faults  #  0.012 M/sec

Are software events of the host - a subset of which could be transparently 
exposed to the guest. Same for tracepoints, probes, etc. Those are not exposed 
by the hardware PMU. So by doing a 'soft' PMU (or even better: a paravirt 
channel to perf events) we gain a lot more than just raw PMU functionality.

'performance events' are about a lot more than just the PMU, it's a coherent 
system health / system events / structured logging framework.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-28 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 04:14:08PM +0100, Peter Zijlstra wrote:
> On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote:
> > > If you give a full PMU to a guest it's a whole different dimension and 
> > > quality
> > > of information. Literally hundreds of different events about all sorts of
> > > aspects of the CPU and the hardware in general.
> > >
> > 
> > Well, we filter out the bad events then. 
> 
> Which requires trapping the MSR access, at which point a soft-PMU is
> almost there, right?

The perfctl msrs need to be trapped anyway. Otherwise the guest could
generate NMIs in host context. But access to the perfctr registers could
be given to the guest.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-28 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 03:12:29PM +0100, Ingo Molnar wrote:
> You mean Windows?
> 
> For heaven's sake, why dont you think like Linus thought 20 years ago. To the 
> hell with Windows suckiness and lets make sure our stuff works well. Then the 
> users will come, developers will come, and people will profile Linux under 
> Linux and maybe the tools will be so good that they'll profile under Linux 
> using Wine just to be able to use those good tools...

Thats not a good comparison. Linux is nothing completly new, it was, and
still is, a new implemenation of an existing operating system concept and
thus at least mostly source-compatible to other operating systems
implementing this concept.
Linux would never had this success if it were not posix compliant and
could run applications like X or gcc which were written for other
operating systems.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-28 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 02:44:00PM +0100, Ingo Molnar wrote:
>  - A paravirt event driver is more compatible and more transparent in the 
> long 
>run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
>without having to upgrade the guest OS. Via that a guest OS could even be
>live-migrated to a different PMU, without noticing anything about it.
> 
>In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
>always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
>cannot be live-migrated. (save/restore doesnt help)

I agree with your arguments, having this soft-pmu for the guest has some
advantages over raw pmu access. It has a lot of advantages if the guest
migrated between hosts with different hardware.

But I still think we should have both, a soft-pmu and a pmu-emulation
(which is a more accurate term than 'raw guest pmu access') that looks
to the guest as a real hardware pmu would look like.  On a linux host
that is dedicated to executing virtual kvm machines there is little
point in sharing the pmu between guest and host because the host will
probably never use it.

This pmu-emulation will still use the perf-infrastructure for scheduling
the pmu registers, programming the pmu registers and things like that.
This could be used for example to emulate 48bit counters for the guest
even if the host only supports 32 bit counters. We even need the perf
infrastructure when we need to reinject pmu events into the guest.

>  - It's more secure: the host can have a finegrained policy about what kinds 
> of
>events it exposes to the guest. It might chose to only expose software 
>events for example.

What do you mean by software events?


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 06:03 PM, Avi Kivity wrote:

Note, I'll be away for a week, so will not be responsive for a while

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 05:55 PM, Peter Zijlstra wrote:

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.
   


Cloud.

You have an app running somewhere on a cloud, internally or externally 
(you may not even know).  It's running a production workload and it 
isn't doing well.  You can't reproduce it on your desktop ("works for 
me, now go away").  So you rdesktop to your guest and monitor it.


You can't run anything on the host - you don't have access to it, you 
don't know who admins it (it's a program anyway), "the host" doesn't 
even exist, the guest moves around whenever the cloud feels like it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:37 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

Certainly guests that we don't port won't be able to use this.  I doubt
we'll be able to make Windows work with this - the only performance tool I'm
familiar with on Windows is Intel's VTune, and that's proprietary.
 

Dont you see the extreme irony of your wish to limit Linux kernel design
decisions and features based on ... Windows and other proprietary
software?
   

Not at all.  Virtualization is a hardware compatibility game.  To see what
happens if you don't play it, see Xen.  Eventually they to implemented
hardware support even though the pv approach is so wonderful.
 

That's not quite equivalent though.

KVM used to be the clean, integrate-code-with-Linux virtualization approach,
designed specifically for CPUs that can be virtualized properly. (VMX support
first, then SVM, etc.)

KVM virtualized ages-old concepts with relatively straightforward hardware
ABIs: x86 execution, IRQ abstractions, device abstractions, etc.

Now you are in essence turning that all around:

  - the PMU is by no means properly virtualized nor really virtualizable by
direct access. There's no virtual PMU that ticks independently of the host
PMU.
   


There's no guest debug registers that can be programmed independently of 
the host debug registers, but we manage somehow.  It's not perfect, but 
better than nothing.


For the common case of host-only or guest-only monitoring, things will 
work, perhaps without socketwide counters in security concious 
environments.  When both are used at the same time, something will have 
to give.



  - the PMU hardware itself is not a well standardized piece of hardware. It's
very vendor dependent and very limiting.
   


That's life.  If we force standardization by having a soft pmu, we'll be 
very limited as well.  If we don't, we reduce hardware independence 
which is a strong point of virtualization.  Clearly we need to make a 
trade-off here.


In favour of hardware dependence is that tools and users are already 
used to it.  There is also the architectural pmu that can provide a 
limited form of hardware independence.


Going pv trades off hardware dependence for software dependence.  
Suddenly only guests that you have control over can use the pmu.



So to some degree you are playing the role of Xen in this specific affair. You
are pushing for something that shouldnt be done in that form. You want to
interfere with the host PMU by going via the fast&  easy short-term hack to
just let the guest OS have the PMU, without any regard to how this impacts
long-term feasible solutions.
   


Maybe.  And maybe the vendors will improve virtualization support for 
the pmu, rendering the pv approach obsolete on new hardware.



I.e. you are a bit like the guy who would have told Linus in 1994:

  " Dude, why dont you use the Windows APIs? It's far more compatible and
that's the only way you could run any serious apps. Besides, it requires
no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our
installed base after all. "
   


Hey, maybe we'd have significant desktop market share if he'd done this 
(though a replay of the wine history is much more likely).


But what are you suggesting?  That we make Windows a second class 
guest?  Most users run a mix of workloads, that will not go down well 
with them.  The choice is between first-class Windows support vs 
becoming a hobby hypervisor.


Let's make a kerner/user analogy again.  Would you be in favour of 
GPL-only-ing new syscalls, to give open source applications an edge over 
proprietary apps (technically known as "crap" among some)?


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
> On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
> >> That's 7 more than what we support now, and 7 more than what we can
> >> guarantee without it.
> >>  
> > Again, what windows software uses only those 7? Does it pay to only have
> > access to those 7 or does it limit the usability to exactly the same
> > subset a paravirt interface would?
> >
> 
> Good question.  Would be interesting to try out VTune with the non-arch 
> pmu masked out.

Also, the ANY bit is part of the intel arch pmu, but you still have to
mask it out.

BTW, just wondering, why would a developer be running VTune in a guest
anyway? I'd think that a developer that windows oriented would simply
run windows on his desktop and VTune there.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote:
> On 02/26/2010 05:08 PM, Peter Zijlstra wrote:
> >> That's 7 more than what we support now, and 7 more than what we can
> >> guarantee without it.
> >>  
> > Again, what windows software uses only those 7? Does it pay to only have
> > access to those 7 or does it limit the usability to exactly the same
> > subset a paravirt interface would?
> >
> 
> Good question.  Would be interesting to try out VTune with the non-arch 
> pmu masked out.

>From what I understood VTune uses PEBS+LBR, although I suppose they have
simple PMU modes too, never actually seen the software.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote:
> > If you give a full PMU to a guest it's a whole different dimension and 
> > quality
> > of information. Literally hundreds of different events about all sorts of
> > aspects of the CPU and the hardware in general.
> >
> 
> Well, we filter out the bad events then. 

Which requires trapping the MSR access, at which point a soft-PMU is
almost there, right?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 05:08 PM, Peter Zijlstra wrote:

That's 7 more than what we support now, and 7 more than what we can
guarantee without it.
 

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?
   


Good question.  Would be interesting to try out VTune with the non-arch 
pmu masked out.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 16:54 +0200, Avi Kivity wrote:
> On 02/26/2010 04:27 PM, Peter Zijlstra wrote:
> > On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
> >
> >> That actually works on the Intel-only architectural pmu.  I'm beginning
> >> to like it more and more.
> >>  
> > Only for the arch defined events, all _7_ of them.
> >
> 
> That's 7 more than what we support now, and 7 more than what we can 
> guarantee without it.

Again, what windows software uses only those 7? Does it pay to only have
access to those 7 or does it limit the usability to exactly the same
subset a paravirt interface would?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:27 PM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
   

That actually works on the Intel-only architectural pmu.  I'm beginning
to like it more and more.
 

Only for the arch defined events, all _7_ of them.
   


That's 7 more than what we support now, and 7 more than what we can 
guarantee without it.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:12 PM, Ingo Molnar wrote:


Again you are making an incorrect assumption: that information leakage via the
PMU only occurs while the host is running on that CPU. It does not - the PMU
can leak general system details _while the guest is running_.
   

You mean like bus transactions on a multicore?  Well, we're already
exposed to cache timing attacks.
 

If you give a full PMU to a guest it's a whole different dimension and quality
of information. Literally hundreds of different events about all sorts of
aspects of the CPU and the hardware in general.
   


Well, we filter out the bad events then.


So for this and for the many other reasons we dont want to give a raw PMU to
guests:

  - A paravirt event driver is more compatible and more transparent in the long
run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
without having to upgrade the guest OS. Via that a guest OS could even be
live-migrated to a different PMU, without noticing anything about it.
   

What about Windows?
 

What is your question? Why should i limit Linux kernel design decisions based
on any aspect of Windows? You might want to support it, but _please_ dont let
the design be dictated by it ...
   


In our case the quality of implementation is judged by how well we 
support workloads that users run, and that means we have to support 
Windows well.  And that more or less means we can't have a pv-only pmu.


Which part of this do you disagree with?


In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
cannot be live-migrated. (save/restore doesnt help)
   

Why not?  So long as the source and destination are compatible?
 

'As long as it works' is certainly a good enough filter for quality ;-)
   


We already have this.  If you expose sse4.2 to the guest, you can't 
migrate to a host which doesn't support it.  If you expose a Nehalem pmu 
to the guest, you can't migrate to a host which supports it.  Users and 
tools already understand this.


It's true that the pmu case is more difficult since you can't migrate 
forwards as well as backwards, but that's life.



No, we can hide insecure events with a full pmu.  Trap the control register
and don't pass it on to the hardware.
 

So you basically concede partial emulation ...
   


Yes.  Still appears to follow the spec to the guest, though.  And with 
the option of full emulation for those who need it and sign on the 
dotted line.



  - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the
perfmon vs. perfcounters discussions. There's really a big advantage to
proper abstractions, both on the host and on the guest side.
   

We only control half of the equation.  That's very different compared to
tools/perf.
 

You mean Windows?

For heaven's sake, why dont you think like Linus thought 20 years ago. To the
hell with Windows suckiness and lets make sure our stuff works well.


In our case, making our stuff work well means making sure guests of the 
user's choice run well.  Not ours.  Currently users mostly choose 
Windows and Linux, so we have to make them both work.


(btw, the analogy would be, 'To hell with Unix suckiness, let's make 
sure our stuff works well'; where Linux reimplemented the Unix APIs, 
ensuring source compatibility with applications, kvm reimplements the 
hardware interface, ensuring binary compatibility with guests).



  Then the
users will come, developers will come, and people will profile Linux under
Linux and maybe the tools will be so good that they'll profile under Linux
using Wine just to be able to use those good tools...
   


If we don't support Windows well, users will walk away, followed by 
starving developers.



If you gut Linux capabilities like that to accomodate for the suckiness of
Windows, without giving a technological edge to Linux, and then we are bound
to fail in the long run ...
   


I'm all for abusing the tight relationship between Linux-as-a-host and 
Linux-as-a-guest to gain an advantage for both.  One fruitful area would 
be asynchronous page faults, which has the potential to increase memory 
overcommit, for example.  But first of all we need to make sure that 
there is a baseline of support for all commonly used guests.


I think of it this way: once kvm deployment becomes widespread, 
Linux-as-a-guest gains an advantage.  But in order for kvm deployment to 
become widespread, it needs excellent support for all guests users 
actually use.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
> 
> Scheduling at event granularity would be a good thing.  However we need 
> to be able to handle the guest using the full pmu. 

Does the full PMU include things like LBR, PEBS and uncore? in that
case, there is no way you're going to get that properly and securely
virtualized by using raw access.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote:
> 
> Even if there were no security considerations, if the guest can observe 
> host data in the pmu, it means the pmu is inaccurate.  We should expose 
> guest data only in the guest pmu.  That's not difficult to do, you stop 
> the pmu on exit and swap the counters on context switches. 

That's not enough, memory node wide counters are impossible to isolate
like that, the same for core wide (ANY flag) counters.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 14:51 +0100, Jes Sorensen wrote:
> 
> > Furthermore, when KVM doesn't virtualize the physical system topology,
> > some PMU features cannot even be sanely used from a vcpu.
> 
> That is definitely an issue, and there is nothing we can really do about
> that. Having two guests running in parallel under KVM means that they
> are going to see more cache misses than they would if they ran barebone
> on the hardware.
> 
> However even with all of this, we have to keep in mind who is going to
> use the performance monitoring in a guest. It is going to be application
> writers, mostly people writing analytical/scientific applications. They
> rarely have control over the OS they are running on, but are given
> systems and told to work on what they are given. Driver upgrades and
> things like that don't come quickly. However they also tend to
> understand limitations like these and will be able to still benefit from
> perf on a system like that.

What I meant was things like memory controller bound counters, intel
uncore and amd northbridge, without knowing what node the vcpu got
scheduled to there is no way they can program the raw hardware in a
meaningful way, amd nb in particular is interesting in that you could
choose not to offer the intel uncore msrs, but the amd nb are shadowed
over the generic pmcs, so you have no way to filter those out.

Same goes for stuff like the intel ANY flag, LBR filter control and
similar muck, a vcpu can't make use of those things in a meaningful
manner.

Also, intel debugstore things requires a host linear address, again, not
something a vcpu can easily provide (although that might be worked
around with an msr trap, but that still limits you to 1 page data sizes,
not a limitation all software will respect).

> All that said, what we really want is for Intel+AMD to come up with
> proper hw PMU virtualization support that makes it easy to rotate the
> full PMU in and out for a guest. Then this whole discussion will become
> a non issue.

As it stands there simply are a number of PMU features that defy being
virtualized, simply because the virt stuff doesn't do system topology.
So even if they were to support a virtualized pmu, it would likely be a
different beast than the native hardware is, and it will be several
hardware models in the future, coming up with a paravirt interface and
getting !linux hosts to adapt and !linux guests to use is probably as
'easy'.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> >> Certainly guests that we don't port won't be able to use this.  I doubt
> >> we'll be able to make Windows work with this - the only performance tool 
> >> I'm
> >> familiar with on Windows is Intel's VTune, and that's proprietary.
> >
> > Dont you see the extreme irony of your wish to limit Linux kernel design 
> > decisions and features based on ... Windows and other proprietary 
> > software?
> 
> Not at all.  Virtualization is a hardware compatibility game.  To see what 
> happens if you don't play it, see Xen.  Eventually they to implemented 
> hardware support even though the pv approach is so wonderful.

That's not quite equivalent though.

KVM used to be the clean, integrate-code-with-Linux virtualization approach, 
designed specifically for CPUs that can be virtualized properly. (VMX support 
first, then SVM, etc.)

KVM virtualized ages-old concepts with relatively straightforward hardware 
ABIs: x86 execution, IRQ abstractions, device abstractions, etc.

Now you are in essence turning that all around:

 - the PMU is by no means properly virtualized nor really virtualizable by 
   direct access. There's no virtual PMU that ticks independently of the host 
   PMU.

 - the PMU hardware itself is not a well standardized piece of hardware. It's 
   very vendor dependent and very limiting.

So to some degree you are playing the role of Xen in this specific affair. You 
are pushing for something that shouldnt be done in that form. You want to 
interfere with the host PMU by going via the fast & easy short-term hack to 
just let the guest OS have the PMU, without any regard to how this impacts 
long-term feasible solutions.

I.e. you are a bit like the guy who would have told Linus in 1994:

 " Dude, why dont you use the Windows APIs? It's far more compatible and 
   that's the only way you could run any serious apps. Besides, it requires 
   no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our 
   installed base after all. "

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote:
> 
> That actually works on the Intel-only architectural pmu.  I'm beginning 
> to like it more and more. 

Only for the arch defined events, all _7_ of them.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:01 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

On 02/26/2010 03:31 PM, Ingo Molnar wrote:
 

* Avi Kivity   wrote:

   

Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.
 

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.
   

May be one too many, for certain guests.  Of course it may be argued
that if the guest wants performance monitoring that much, they will
upgrade.
 

Yes, that can certainly be argued.

Note another logical inconsistency: you are assuming reluctance to upgrade for
a set of users who are doing _performance analysis_.

In fact those types of users are amongst the most upgrade-happy. Often they'll
run modern hardware and modern software. Most of the time they are developers
themselves who try to make sure their stuff works on the latest&  greatest
hardware _and_ software.
   


I wouldn't go as far, but I agree there is less resistance to change 
here.  A Windows user certainly ought to be willing to install a new 
VTune release, and a RHEL user can be convinced to upgrade from (say) 
5.4 to 5.6 with new backported paravirt pmu support.


I wouldn't like to force them to upgrade to 2.6.3x though.  Many of 
those users will be developers of in-house applications who are trying 
to understand their applications under production loads.



Certainly guests that we don't port won't be able to use this.  I doubt
we'll be able to make Windows work with this - the only performance tool I'm
familiar with on Windows is Intel's VTune, and that's proprietary.
 

Dont you see the extreme irony of your wish to limit Linux kernel design
decisions and features based on ... Windows and other proprietary software?
   


Not at all.  Virtualization is a hardware compatibility game.  To see 
what happens if you don't play it, see Xen.  Eventually they to 
implemented hardware support even though the pv approach is so wonderful.


If we go the pv route, we'll limit the usefulness of Linux in this 
scenario to a subset of guests.  Users will simply walk away and choose 
a hypervisor whose authors have less interest in irony and more in 
providing the features they want.


A pv approach can come after we have a baseline that is useful to all users.


  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.
   

That also works for the architectural pmu, of course that's Intel
only.  And there you don't need to upgrade the guest even once.
 

Besides being Intel only, it only exposes a limited sub-set of hw events. (far
fewer than the generic ones offered by perf events)

   


Things aren't mutually exclusive.  Offer the arch pmu for maximum future 
compatibility (Intel only, alas), the full pmu for maximum features, and 
the pv pmu for flexibility.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> On 02/26/2010 03:44 PM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>On 02/26/2010 03:06 PM, Ingo Molnar wrote:
> >Firstly, an emulated PMU was only the second-tier option i suggested. By 
> >far
> >the best approach is native API to the host regarding performance events 
> >and
> >good guest side integration.
> >
> >Secondly, the PMU cannot be 'given' to the guest in the general case. 
> >Those
> >are privileged registers. They can expose sensitive host execution 
> >details,
> >etc. etc. So if you emulate a PMU you have to exit out of most PMU 
> >accesses
> >anyway for a secure solution. (RDPMC can still be supported, but in close
> >cooperation with the host)
> There is nothing secret in the host PMU, and it's easy to clear out the
> counters before passing them off to the guest.
> >>>That's wrong. On some CPUs the host PMU can be used to say sample aspects 
> >>>of
> >>>another CPU, allowing statistical attacks to recover crypto keys. It can be
> >>>used to sample memory access patterns of another node.
> >>>
> >>>There's a good reason PMU configuration registers are privileged and 
> >>>there's
> >>>good value in only giving a certain sub-set to less privileged entities by
> >>>default.
> >>Even if there were no security considerations, if the guest can observe host
> >>data in the pmu, it means the pmu is inaccurate.  We should expose guest
> >>data only in the guest pmu.  That's not difficult to do, you stop the pmu on
> >>exit and swap the counters on context switches.
> >Again you are making an incorrect assumption: that information leakage via 
> >the
> >PMU only occurs while the host is running on that CPU. It does not - the PMU
> >can leak general system details _while the guest is running_.
> 
> You mean like bus transactions on a multicore?  Well, we're already
> exposed to cache timing attacks.

If you give a full PMU to a guest it's a whole different dimension and quality 
of information. Literally hundreds of different events about all sorts of 
aspects of the CPU and the hardware in general.

> >So for this and for the many other reasons we dont want to give a raw PMU to
> >guests:
> >
> >  - A paravirt event driver is more compatible and more transparent in the 
> > long
> >run: it allows hardware upgrade and upgraded PMU functionality (for 
> > Linux)
> >without having to upgrade the guest OS. Via that a guest OS could even be
> >live-migrated to a different PMU, without noticing anything about it.
> 
> What about Windows?

What is your question? Why should i limit Linux kernel design decisions based 
on any aspect of Windows? You might want to support it, but _please_ dont let 
the design be dictated by it ...

> >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
> >always assumes the guest OS is upgraded to the host. Also, 'raw' PMU 
> > state
> >cannot be live-migrated. (save/restore doesnt help)
> 
> Why not?  So long as the source and destination are compatible?

'As long as it works' is certainly a good enough filter for quality ;-)

> >  - It's far cleaner on the host side as well: more granular, per event usage
> >is possible. The guest can use portion of the PMU (managed by the host),
> >and the host can use a portion too.
> >
> >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
> >precludes the host OS from running some different piece of 
> > instrumentation
> >at the same time.
> 
> Right, time slicing is something we want.
> 
> >  - It's more secure: the host can have a finegrained policy about what 
> > kinds of
> >events it exposes to the guest. It might chose to only expose software
> >events for example.
> >
> >In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
> >an all-or-nothing policy affair: either you fully allow the guest (and 
> > live
> >with whatever consequences the piece of hardware that takes up a fair 
> > chunk
> >on the CPU die causes), or you allow none of it.
> 
> No, we can hide insecure events with a full pmu.  Trap the control register 
> and don't pass it on to the hardware.

So you basically concede partial emulation ...

> >  - A proper paravirt event driver gives more features as well: it can 
> > exposes
> >host software events and tracepoints, probes - not restricting itself to
> >the 'hardware PMU' abstraction.
> 
> But it is limited to whatever the host stack supports.  At least
> that's our control, but things like PEBS will take a ton of work.

PEBS support is being implemented for perf, as a transparent feature. So once 
it's available, PEBS support will magically improve the quality of guest OS 
samples, if a paravirt driver approach is used and if sys_perf_event_open() is 
taught about that driver. Without any other change needed on the guest side.

> >  - There's proper event scheduling and event alloca

Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 04:07 PM, Jes Sorensen wrote:

On 02/26/10 14:27, Ingo Molnar wrote:


* Jes Sorensen  wrote:
You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon 
v2,

whereas Nehalem and Atom are v3 if I remember correctly. [...]


Of course you can emulate a good portion of it, as long as there's perf
support on the host side for P4.


Actually P4 is pretty uninteresting in this discussion due to the lack
of VMX support, it's the same issue for Nehalem vs Core2. The problem
is the same though, we cannot tell the guest that yes P4 has this
event, but no, we are going to feed you bogus data.


The Pentium D which is a P4 derivative has vmx support.  However it is 
so slow I'm fine with ignoring it for this feature.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:27, Ingo Molnar wrote:


* Jes Sorensen  wrote:

You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2,
whereas Nehalem and Atom are v3 if I remember correctly. [...]


Of course you can emulate a good portion of it, as long as there's perf
support on the host side for P4.


Actually P4 is pretty uninteresting in this discussion due to the lack
of VMX support, it's the same issue for Nehalem vs Core2. The problem
is the same though, we cannot tell the guest that yes P4 has this
event, but no, we are going to feed you bogus data.


If the guest programs a cachemiss event, you program a cachemiss perf event on
the host and feed its values to the emulated MSR state. You _dont_ program the
raw PMU on the host side - just use the API i outlined to get struct
perf_event.

The emulation wont be perfect: not all events will count and not all events
will be available in a P4 (and some Core2 events might not even make sense in
a P4), but that is reality as well: often documented events dont count, and
often non-documented events count.

What matters to 99.9% of people who actually use this stuff is a few core sets
of events - which are available in P4s and in Core2 as well. Cycles,
instructions, branches, maybe cache-misses. Sometimes FPU stuff.


I really do not like to make guesses about how people use this stuff.
The things you and I look for as kernel hackers are often very different
than application authors look for and use. That is one thing I learned
from being expose to strange Fortran programmers at SGI.

It makes me very uncomfortable telling a guest OS that we offer features
X, Y, Z and then start lying feeding back numbers that do not match what
was requested, and there is no way to to tell the guest that.


For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open()
on the guest side over to the host, transparently, via a paravirt driver.


Paravirt is a nice optimization, but is and will always be an
optimization. Fact of the matter is that the bulk of usage of
virtualization is for running distributions with slow kernel
upgrade rates, like SLES and RHEL, and other proprietary operating
systems which we have no control over. Para-virt will do little good for
either of these groups.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> On 02/26/2010 03:31 PM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>Or do you mean to define a new, kvm-specific pmu model and feed it off the
> >>host pmu?  In this case all the guests will need to be taught about it,
> >>which raises the compatibility problem.
> >You are missing two big things wrt. compatibility here:
> >
> >  1) The first upgrade overhead a one time overhead only.
> 
> May be one too many, for certain guests.  Of course it may be argued
> that if the guest wants performance monitoring that much, they will
> upgrade.

Yes, that can certainly be argued.

Note another logical inconsistency: you are assuming reluctance to upgrade for 
a set of users who are doing _performance analysis_.

In fact those types of users are amongst the most upgrade-happy. Often they'll 
run modern hardware and modern software. Most of the time they are developers 
themselves who try to make sure their stuff works on the latest & greatest 
hardware _and_ software.

So people running P4's trying to tune their stuff under Red Hat Linux 9 and 
trying to use the PMU uner KVM is not really a concern rooted overly deeply in 
reality.

> Certainly guests that we don't port won't be able to use this.  I doubt 
> we'll be able to make Windows work with this - the only performance tool I'm 
> familiar with on Windows is Intel's VTune, and that's proprietary.

Dont you see the extreme irony of your wish to limit Linux kernel design 
decisions and features based on ... Windows and other proprietary software?

> >  2) Once a Linux guest has upgraded, it will work in the future, with _any_
> > future CPU - _without_ having to upgrade the guest!
> >
> >Dont you see the advantage of that? You can instrument an old system on new
> >hardware, without having to upgrade that guest for the new CPU support.
> 
> That also works for the architectural pmu, of course that's Intel
> only.  And there you don't need to upgrade the guest even once.

Besides being Intel only, it only exposes a limited sub-set of hw events. (far 
fewer than the generic ones offered by perf events)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:37 PM, Jes Sorensen wrote:

On 02/26/10 14:31, Ingo Molnar wrote:

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.

  2) Once a Linux guest has upgraded, it will work in the future, 
with _any_

 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system 
on new

hardware, without having to upgrade that guest for the new CPU support.


That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.



That actually works on the Intel-only architectural pmu.  I'm beginning 
to like it more and more.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:44 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

On 02/26/2010 03:06 PM, Ingo Molnar wrote:
 
   

Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   

There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.
 

That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.
   

Even if there were no security considerations, if the guest can observe host
data in the pmu, it means the pmu is inaccurate.  We should expose guest
data only in the guest pmu.  That's not difficult to do, you stop the pmu on
exit and swap the counters on context switches.
 

Again you are making an incorrect assumption: that information leakage via the
PMU only occurs while the host is running on that CPU. It does not - the PMU
can leak general system details _while the guest is running_.
   


You mean like bus transactions on a multicore?  Well, we're already 
exposed to cache timing attacks.



So for this and for the many other reasons we dont want to give a raw PMU to
guests:

  - A paravirt event driver is more compatible and more transparent in the long
run: it allows hardware upgrade and upgraded PMU functionality (for Linux)
without having to upgrade the guest OS. Via that a guest OS could even be
live-migrated to a different PMU, without noticing anything about it.
   


What about Windows?


In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state
cannot be live-migrated. (save/restore doesnt help)
   


Why not?  So long as the source and destination are compatible?


  - It's far cleaner on the host side as well: more granular, per event usage
is possible. The guest can use portion of the PMU (managed by the host),
and the host can use a portion too.

In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
precludes the host OS from running some different piece of instrumentation
at the same time.
   


Right, time slicing is something we want.


  - It's more secure: the host can have a finegrained policy about what kinds of
events it exposes to the guest. It might chose to only expose software
events for example.

In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
an all-or-nothing policy affair: either you fully allow the guest (and live
with whatever consequences the piece of hardware that takes up a fair chunk
on the CPU die causes), or you allow none of it.
   


No, we can hide insecure events with a full pmu.  Trap the control 
register and don't pass it on to the hardware.



  - A proper paravirt event driver gives more features as well: it can exposes
host software events and tracepoints, probes - not restricting itself to
the 'hardware PMU' abstraction.
   


But it is limited to whatever the host stack supports.  At least that's 
our control, but things like PEBS will take a ton of work.



  - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the perfmon
vs. perfcounters discussions. There's really a big advantage to proper
abstractions, both on the host and on the guest side.
   


We only control half of the equation.  That's very different compared to 
tools/perf.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:28, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:


It would be the other way round - the host would steal the pmu from the
guest.  Later we can try to time-slice and extrapolate, though that's
not going to be easy.


Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.


What I don't like here is that without rewriting the guest OS, there
will be two layers of time-slicing and extrapolation. That is going to
make the reported numbers close to useless.


Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification.

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.

So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.


That is the problem. Today there is a large install base out there of
core2 users who wish to measure their stuff on the hardware they have.
The same will be true for Nehalem based stuff, when whatever replaces
Nehalem comes out makes that incompatible.

Since we are unable to emulate Core2 on Nehalem, and almost certainly
will be unable to emulate Nehalem on it's successor, we are stuck with
this.

A para-virt interface is a nice idea, but since we cannot emulate an
old CPU properly it still means there isn't much we can do as we're
stuck with the same limitations. I simply see the value of introducing
a para-virt interface for this.


Furthermore, when KVM doesn't virtualize the physical system topology,
some PMU features cannot even be sanely used from a vcpu.


That is definitely an issue, and there is nothing we can really do about
that. Having two guests running in parallel under KVM means that they
are going to see more cache misses than they would if they ran barebone
on the hardware.

However even with all of this, we have to keep in mind who is going to
use the performance monitoring in a guest. It is going to be application
writers, mostly people writing analytical/scientific applications. They
rarely have control over the OS they are running on, but are given
systems and told to work on what they are given. Driver upgrades and
things like that don't come quickly. However they also tend to
understand limitations like these and will be able to still benefit from
perf on a system like that.


So while currently a root user can already tie up all of the pmu using
perf, simply using that to hand the full pmu off to the guest still
leaves lots of issues.


Well isn't that the case with the current setup anyway? If enough user
apps start requesting PMU resources, the hw is going to run out of
counters very quickly anyway.

The real issue here IMHO is whether or not is it possible to use a PMU
to count anything on different CPU? If that is really possible, sharing
the PMU is not an option :(

All that said, what we really want is for Intel+AMD to come up with
proper hw PMU virtualization support that makes it easy to rotate the
full PMU in and out for a guest. Then this whole discussion will become
a non issue.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:28 PM, Peter Zijlstra wrote:

On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:

   

It would be the other way round - the host would steal the pmu from the
guest.  Later we can try to time-slice and extrapolate, though that's
not going to be easy.
 

Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.
   


True.


Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification.

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.
   


Right, this is pretty bad.  For Windows it's probably acceptable to 
upgrade your performance tools (since that's separate from the OS).  In 
Linux it is integrated into the kernel, and it's fairly unacceptable to 
demand a kernel upgrade when your host is upgraded underneath you.



So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.
   


Still that leaves us with no Windows / non-Linux solution.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> On 02/26/2010 03:06 PM, Ingo Molnar wrote:
> >
> >>>Firstly, an emulated PMU was only the second-tier option i suggested. By 
> >>>far
> >>>the best approach is native API to the host regarding performance events 
> >>>and
> >>>good guest side integration.
> >>>
> >>>Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> >>>are privileged registers. They can expose sensitive host execution details,
> >>>etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> >>>anyway for a secure solution. (RDPMC can still be supported, but in close
> >>>cooperation with the host)
> >>There is nothing secret in the host PMU, and it's easy to clear out the
> >>counters before passing them off to the guest.
> >That's wrong. On some CPUs the host PMU can be used to say sample aspects of
> >another CPU, allowing statistical attacks to recover crypto keys. It can be
> >used to sample memory access patterns of another node.
> >
> >There's a good reason PMU configuration registers are privileged and there's
> >good value in only giving a certain sub-set to less privileged entities by
> >default.
> 
> Even if there were no security considerations, if the guest can observe host 
> data in the pmu, it means the pmu is inaccurate.  We should expose guest 
> data only in the guest pmu.  That's not difficult to do, you stop the pmu on 
> exit and swap the counters on context switches.

Again you are making an incorrect assumption: that information leakage via the 
PMU only occurs while the host is running on that CPU. It does not - the PMU 
can leak general system details _while the guest is running_.

So for this and for the many other reasons we dont want to give a raw PMU to 
guests:

 - A paravirt event driver is more compatible and more transparent in the long 
   run: it allows hardware upgrade and upgraded PMU functionality (for Linux) 
   without having to upgrade the guest OS. Via that a guest OS could even be
   live-migrated to a different PMU, without noticing anything about it.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS 
   always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state 
   cannot be live-migrated. (save/restore doesnt help)

 - It's far cleaner on the host side as well: more granular, per event usage
   is possible. The guest can use portion of the PMU (managed by the host), 
   and the host can use a portion too.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS
   precludes the host OS from running some different piece of instrumentation
   at the same time.

 - It's more secure: the host can have a finegrained policy about what kinds of
   events it exposes to the guest. It might chose to only expose software 
   events for example.

   In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is
   an all-or-nothing policy affair: either you fully allow the guest (and live
   with whatever consequences the piece of hardware that takes up a fair chunk
   on the CPU die causes), or you allow none of it.

 - A proper paravirt event driver gives more features as well: it can exposes 
   host software events and tracepoints, probes - not restricting itself to 
   the 'hardware PMU' abstraction.

 - There's proper event scheduling and event allocation. Time-slicing, etc.


The thing is, we made quite similar arguments in the past, during the perfmon 
vs. perfcounters discussions. There's really a big advantage to proper 
abstractions, both on the host and on the guest side.
 
Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:31 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.
 

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.
   


May be one too many, for certain guests.  Of course it may be argued 
that if the guest wants performance monitoring that much, they will upgrade.


Certainly guests that we don't port won't be able to use this.  I doubt 
we'll be able to make Windows work with this - the only performance tool 
I'm familiar with on Windows is Intel's VTune, and that's proprietary.



  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.
   


That also works for the architectural pmu, of course that's Intel only.  
And there you don't need to upgrade the guest even once.


The arch pmu seems nicely done - there's a bit for every counter that 
can be enabled and disabled at will, and the number of counters is also 
determined from cpuid.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:31, Ingo Molnar wrote:

You are missing two big things wrt. compatibility here:

  1) The first upgrade overhead a one time overhead only.

  2) Once a Linux guest has upgraded, it will work in the future, with _any_
 future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new
hardware, without having to upgrade that guest for the new CPU support.


That would only work if you are guaranteed to be able to emulate old
hardware on new hardware. Not going to be feasible, so then we are in a
real mess.


With the 'steal the PMU' messy approach the guest OS has to be upgraded to the
new CPU type all the time. Ad infinitum.


The way the Perfmon architecture is specified by Intel, that is what we
are stuck with. It's not going to be possible via software emulation to
count cache misses, unless you run it in a micro architecture emulator.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:18, Ingo Molnar wrote:


* Avi Kivity  wrote:


Can you emulate the Core 2 pmu on, say, a P4? [...]


How about the Pentium? Or the i486?

As long as there's perf events support, the CPU can be supported in a soft
PMU. You can even cross-map exotic hw events if need to be - but most of the
tooling (in just about any OS) uses just a handful of core events ...


This is only possible if all future CPU perfmon events are guaranteed
to be a superset of previous versions. Otherwise you end up emulating
events and providing randomly generated numbers back.

The perfmon revision and size we present to a guest has to match the
current host.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Peter Zijlstra
On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote:

> It would be the other way round - the host would steal the pmu from the 
> guest.  Later we can try to time-slice and extrapolate, though that's 
> not going to be easy. 

Right, so perf already does the time slicing and interpolating thing, so
a soft-pmu gets that for free.

Anyway, this discussion seems somewhat in a stale-mate position.

The KVM folks basically demand a full PMU MSR shadow with PMI
passthrough so that their $legacy shit works without modification. 

My question with that is how $legacy muck can ever know how the current
PMU works, you can't even properly emulate a core2 pmu on a nehalem
because intel keeps messing with the event codes for every new model.

So basically for this to work means the guest can't run legacy stuff
anyway, but needs to run very up-to-date software, so we might as well
create a soft-pmu/paravirt interface now and have all up-to-date
software support that for the next generation.

Furthermore, when KVM doesn't virtualize the physical system topology,
some PMU features cannot even be sanely used from a vcpu.

So while currently a root user can already tie up all of the pmu using
perf, simply using that to hand the full pmu off to the guest still
leaves lots of issues.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:30, Avi Kivity wrote:

On 02/26/2010 03:06 PM, Ingo Molnar wrote:

That's precisely my point: the guest should obviously not get raw
access to
the PMU. (except where it might matter to performance, such as RDPMC)


That's doable if all counters are steerable. IIRC some counters are
fixed function, but I'm not certain about that.


I am not an expert, but from what I learned from Peter, there are
constraints on some of the counters. Ie. certain types of events can
only be counted on certain counters, which limits the already very
limited number of counters even further.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:27 PM, Ingo Molnar wrote:


For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open()
on the guest side over to the host, transparently, via a paravirt driver.
   


Let us for the purpose of this discussion assume that we are also 
interested in supporting Windows and older Linux.  Paravirt 
optimizations can be added after we have the basic functionality, if 
they prove necessary.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> Or do you mean to define a new, kvm-specific pmu model and feed it off the 
> host pmu?  In this case all the guests will need to be taught about it, 
> which raises the compatibility problem.

You are missing two big things wrt. compatibility here:

 1) The first upgrade overhead a one time overhead only.

 2) Once a Linux guest has upgraded, it will work in the future, with _any_ 
future CPU - _without_ having to upgrade the guest!

Dont you see the advantage of that? You can instrument an old system on new 
hardware, without having to upgrade that guest for the new CPU support.

With the 'steal the PMU' messy approach the guest OS has to be upgraded to the 
new CPU type all the time. Ad infinitum.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:06, Ingo Molnar wrote:


* Jes Sorensen  wrote:

Well you cannot steal the PMU without collaborating with perf_event.c, but
thats quite feasible. Sharing the PMU between the guest and the host is very
costly and guarantees incorrect results in the host. Unless you completely
emulate the PMU by faking it and then allocating PMU counters one by one at
the host level. However that means trapping a lot of MSR access.


It's not that many MSR accesses.


Well it's more than enough to double the number of MSRs KVM has to track
on switches.


There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.


That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.


If a PMU can really count stuff on another CPU, then we shouldn't allow
PMU access to any application at all. It's more than just a KVM guest vs
a KVM guest issue then, but also a thread to thread issue.

My idea was obviously not to expose host timings to a guest. Save the
counters when a guest exits, and reload them when it's restarted. Not
just when switching to another task, but also when entering KVM, to
avoid the guest seeing overhead spent within KVM.


Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest, the
host knows not to touch it until it's been released again.


'Full PMU' is not the granularity i find acceptable though: please do what i
suggested, event granularity allocation and scheduling.


As I wrote earlier, at that level we have to do it all emulated. In
this case, providing any of this to a guest seems to be a waste of time
since the interface will cost way too much in trapping back and forth
and you have contention with the very limited resources in the PMU with
just 5 counters to pick from on Core2.

The guest PMU will think it's running on top of real hardware, and
scaling/estimating numbers like the perf_event.c code does today,
except that it will be using already scaled and estimated numbers for
it's calculations. Application users will have little use for this.


Well with the hardware currently available, there is no such thing as clean
sharing between the host and the guest. It cannot be done without messing up
the host measurements, which effectively renders measuring at the host side
useless while a guest is allowed access to the PMU.


That's precisely my point: the guest should obviously not get raw access to
the PMU. (except where it might matter to performance, such as RDPMC)


Well either you allow access to the PMU or you don't. If you allow
direct access to the PMU counters, but not the control registers, you
have to specify the counter sizes to match that of the host, making it
impossible to really emulate core2 on a non core2 architecture etc.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 03:06 PM, Ingo Molnar wrote:



Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   

There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.
 

That's wrong. On some CPUs the host PMU can be used to say sample aspects of
another CPU, allowing statistical attacks to recover crypto keys. It can be
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's
good value in only giving a certain sub-set to less privileged entities by
default.
   


Even if there were no security considerations, if the guest can observe 
host data in the pmu, it means the pmu is inaccurate.  We should expose 
guest data only in the guest pmu.  That's not difficult to do, you stop 
the pmu on exit and swap the counters on context switches.



Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest, the
host knows not to touch it until it's been released again.
 

'Full PMU' is not the granularity i find acceptable though: please do what i
suggested, event granularity allocation and scheduling.

We are rehashing the whole 'perfmon versus perf events/counters' design
arguments again here really.
   


Scheduling at event granularity would be a good thing.  However we need 
to be able to handle the guest using the full pmu.


Note that scheduling is only needed if both the guest and host want the 
pmu at the same time - and that should be a rare case and not the one to 
optimize for.



You need to integrate it properly so that host PMU functionality still
works fine. (Within hardware constraints)
   

Well with the hardware currently available, there is no such thing as clean
sharing between the host and the guest. It cannot be done without messing up
the host measurements, which effectively renders measuring at the host side
useless while a guest is allowed access to the PMU.
 

That's precisely my point: the guest should obviously not get raw access to
the PMU. (except where it might matter to performance, such as RDPMC)
   


That's doable if all counters are steerable.  IIRC some counters are 
fixed function, but I'm not certain about that.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen  wrote:

> > Agree about favouring modern processors.
> 
> You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, 
> whereas Nehalem and Atom are v3 if I remember correctly. [...]

Of course you can emulate a good portion of it, as long as there's perf 
support on the host side for P4.

If the guest programs a cachemiss event, you program a cachemiss perf event on 
the host and feed its values to the emulated MSR state. You _dont_ program the 
raw PMU on the host side - just use the API i outlined to get struct 
perf_event.

The emulation wont be perfect: not all events will count and not all events 
will be available in a P4 (and some Core2 events might not even make sense in 
a P4), but that is reality as well: often documented events dont count, and 
often non-documented events count.

What matters to 99.9% of people who actually use this stuff is a few core sets 
of events - which are available in P4s and in Core2 as well. Cycles, 
instructions, branches, maybe cache-misses. Sometimes FPU stuff.

For Linux<->Linux the sanest, tier-1 approach would be to map sys_perf_open() 
on the guest side over to the host, transparently, via a paravirt driver.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> Can you emulate the Core 2 pmu on, say, a P4? [...]

How about the Pentium? Or the i486?

As long as there's perf events support, the CPU can be supported in a soft 
PMU. You can even cross-map exotic hw events if need to be - but most of the 
tooling (in just about any OS) uses just a handful of core events ...

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 14:04, Avi Kivity wrote:

On 02/26/2010 02:38 PM, Ingo Molnar wrote:

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is
really the target CPU type for any new feature we are talking about
right now.
Plus performance analysis tends to skew towards more modern CPU types as
well.)


Can you emulate the Core 2 pmu on, say, a P4? Those P4s have very
different instruction caches so I imagine the events are very different
as well.

Agree about favouring modern processors.


You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2,
whereas Nehalem and Atom are v3 if I remember correctly. I am not even
100% sure a v3 is capable of emulating a v2, though I expect v3 to have
bigger counters then v2, but I don't think that is guaranteed. I can
only handle so many hours of reading Intel manuals per day, before I end
up in a padded cell, so I could be wrong on some of this.


Plus the emulation can be smart about it and only use up a given
number. Most
guest OSs dont use the full PMU - they use a single counter.


But you have to expose all of the counters, no? Unless you go with a
kvm-specific pmu as described below.


You have to, at least all the fixed ones (3 on Core2) and the two arch
ones. Thats the minimum and any guest being told it's running on a Core2
will expect to find those.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen  wrote:

> On 02/26/10 12:42, Ingo Molnar wrote:
> >
> >* Jes Sorensen  wrote:
> >>
> >> I have to say I disagree on that. When you run perfmon on a system, it is 
> >> normally to measure a specific application. You want to see accurate 
> >> numbers for cache misses, mul instructions or whatever else is selected.
> >
> > You can still get those. You can even enable RDPMC access and avoid VM 
> > exits.
> >
> > What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
> 
> Well you cannot steal the PMU without collaborating with perf_event.c, but 
> thats quite feasible. Sharing the PMU between the guest and the host is very 
> costly and guarantees incorrect results in the host. Unless you completely 
> emulate the PMU by faking it and then allocating PMU counters one by one at 
> the host level. However that means trapping a lot of MSR access.

It's not that many MSR accesses.

> >Firstly, an emulated PMU was only the second-tier option i suggested. By far
> >the best approach is native API to the host regarding performance events and
> >good guest side integration.
> >
> >Secondly, the PMU cannot be 'given' to the guest in the general case. Those
> >are privileged registers. They can expose sensitive host execution details,
> >etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
> >anyway for a secure solution. (RDPMC can still be supported, but in close
> >cooperation with the host)
> 
> There is nothing secret in the host PMU, and it's easy to clear out the 
> counters before passing them off to the guest.

That's wrong. On some CPUs the host PMU can be used to say sample aspects of 
another CPU, allowing statistical attacks to recover crypto keys. It can be 
used to sample memory access patterns of another node.

There's a good reason PMU configuration registers are privileged and there's 
good value in only giving a certain sub-set to less privileged entities by 
default.

> >>We can do this in a reasonable way today, if we allow to take the PMU away
> >>from the host, and only let guests access it when it's in use. [...]
> >
> >You get my sure-fire NAK for that kind of crap though. Interfering with the
> >host PMU and stealing it, is not a technical approach that has acceptable
> >quality.
> 
> Having an allocation scheme and sharing it with the host, is a perfectly 
> legitimate and very clean way to do it. Once it's given to the guest, the 
> host knows not to touch it until it's been released again.

'Full PMU' is not the granularity i find acceptable though: please do what i 
suggested, event granularity allocation and scheduling.

We are rehashing the whole 'perfmon versus perf events/counters' design 
arguments again here really.

> > You need to integrate it properly so that host PMU functionality still 
> > works fine. (Within hardware constraints)
> 
> Well with the hardware currently available, there is no such thing as clean 
> sharing between the host and the guest. It cannot be done without messing up 
> the host measurements, which effectively renders measuring at the host side 
> useless while a guest is allowed access to the PMU.

That's precisely my point: the guest should obviously not get raw access to 
the PMU. (except where it might matter to performance, such as RDPMC)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 02:38 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

On 02/26/2010 02:07 PM, Ingo Molnar wrote:
 

* Avi Kivity   wrote:

   

A native API to the host will lock out 100% of the install base now, and a
large section of any future install base.
 

... which is why i suggested the soft-PMU approach.
   

Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them to
the guest via trap'n'emulate?  In that case we may as well assign the host
pmu to the guest if the host isn't using it, and avoid the traps.
 

You are making the incorrect assumption that the emulated PMU uses up all host
PMU resources ...
   


Well, in the general case, it may?  If it doesn't, the host may use 
them.  We do a similar thing with debug breakpoints.


Sharing the pmu will mean trapping control msr writes at least, though.


Do you mean to choose some older pmu and emulate it using whatever pmu model
the host has?  I haven't checked, but aren't there mutually exclusive events
in every model pair?  The closest thing would be the architectural pmu
thing.
 

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is
really the target CPU type for any new feature we are talking about right now.
Plus performance analysis tends to skew towards more modern CPU types as
well.)
   


Can you emulate the Core 2 pmu on, say, a P4?  Those P4s have very 
different instruction caches so I imagine the events are very different 
as well.


Agree about favouring modern processors.


Plus the emulation can be smart about it and only use up a given number. Most
guest OSs dont use the full PMU - they use a single counter.
   


But you have to expose all of the counters, no?  Unless you go with a 
kvm-specific pmu as described below.



Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates
events on an as-needed basis.
   


Or we could watch the control register and see how the guest programs 
it, provided it doesn't do that a lot.



Or do you mean to define a new, kvm-specific pmu model and feed it off the
host pmu?  In this case all the guests will need to be taught about it,
which raises the compatibility problem.

 

And note that _any_ solution we offer locks out 100% of the installed base
right now, as no solution is in the kernel yet. The only question is what
kind of upgrade effort is needed for users to make use of the feature.
   

I meant the guest installed base.  Hosts can be upgraded transparently to
the guests (not even a shutdown/reboot).
 

The irony: this time guest-transparent solutions that need no configuration
are good? ;-)

The very same argument holds for the file server thing: a guest transparent
solution is easier wrt. the upgrade path.
   


If we add pmu support, guests can begin to use if immediately.  If we 
add the file server support, guests need to install drivers before they 
can use it, while guest admins have no motivation to do so (it helps the 
host, not the guest).


Is something wrong with just using sshfs?  Seems a lot less hassle to me.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 13:20, Avi Kivity wrote:

On 02/26/2010 02:07 PM, Ingo Molnar wrote:

... which is why i suggested the soft-PMU approach.


Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them
to the guest via trap'n'emulate? In that case we may as well assign the
host pmu to the guest if the host isn't using it, and avoid the traps.

Do you mean to choose some older pmu and emulate it using whatever pmu
model the host has? I haven't checked, but aren't there mutually
exclusive events in every model pair? The closest thing would be the
architectural pmu thing.


You cannot do this, as you say there is no guarantee that there are no
overlaps, and the current host may have different counter sizes two
which makes emulating it even more costly.

The cpuid bits basically tells you which version of the counters are
available, how many counters are there, word size of the counters and
I believe there are bits also stating which optional features are
available to be counted.


Or do you mean to define a new, kvm-specific pmu model and feed it off
the host pmu? In this case all the guests will need to be taught about
it, which raises the compatibility problem.


Cannot be done in a reasonable manner due to the above.

The key to all of this is that guests OSes, including that other OS,
should be able to use the performance counters without needing special
para virt drivers or other OS modifications. If we start requering that
kind of stuff, the whole point of having the feature goes down the
toilet.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:42, Ingo Molnar wrote:


* Jes Sorensen  wrote:

I have to say I disagree on that. When you run perfmon on a system, it is
normally to measure a specific application. You want to see accurate numbers
for cache misses, mul instructions or whatever else is selected.


You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.


Well you cannot steal the PMU without collaborating with perf_event.c,
but thats quite feasible. Sharing the PMU between the guest and the host
is very costly and guarantees incorrect results in the host. Unless you
completely emulate the PMU by faking it and then allocating PMU counters
one by one at the host level. However that means trapping a lot of MSR
access.


Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)


There is nothing secret in the host PMU, and it's easy to clear out the
counters before passing them off to the guest.


We can do this in a reasonable way today, if we allow to take the PMU away
from the host, and only let guests access it when it's in use. [...]


You get my sure-fire NAK for that kind of crap though. Interfering with the
host PMU and stealing it, is not a technical approach that has acceptable
quality.


Having an allocation scheme and sharing it with the host, is a perfectly
legitimate and very clean way to do it. Once it's given to the guest,
the host knows not to touch it until it's been released again.


You need to integrate it properly so that host PMU functionality still works
fine. (Within hardware constraints)


Well with the hardware currently available, there is no such thing as
clean sharing between the host and the guest. It cannot be done without
messing up the host measurements, which effectively renders measuring at
the host side useless while a guest is allowed access to the PMU.

Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> On 02/26/2010 02:07 PM, Ingo Molnar wrote:
> >* Avi Kivity  wrote:
> >
> >>A native API to the host will lock out 100% of the install base now, and a
> >>large section of any future install base.
> >... which is why i suggested the soft-PMU approach.
> 
> Not sure I understand it completely.
> 
> Do you mean to take the model specific host pmu events, and expose them to 
> the guest via trap'n'emulate?  In that case we may as well assign the host 
> pmu to the guest if the host isn't using it, and avoid the traps.

You are making the incorrect assumption that the emulated PMU uses up all host 
PMU resources ...

> Do you mean to choose some older pmu and emulate it using whatever pmu model 
> the host has?  I haven't checked, but aren't there mutually exclusive events 
> in every model pair?  The closest thing would be the architectural pmu 
> thing.

Yes, something like Core2 with 2 generic events.

That would leave 2 extra generic events on Nehalem and better. (which is 
really the target CPU type for any new feature we are talking about right now. 
Plus performance analysis tends to skew towards more modern CPU types as 
well.)

Plus the emulation can be smart about it and only use up a given number. Most 
guest OSs dont use the full PMU - they use a single counter.

Ideally for Linux<->Linux there would be a PMU paravirt driver that allocates 
events on an as-needed basis.

> Or do you mean to define a new, kvm-specific pmu model and feed it off the 
> host pmu?  In this case all the guests will need to be taught about it, 
> which raises the compatibility problem.
>
> > And note that _any_ solution we offer locks out 100% of the installed base 
> > right now, as no solution is in the kernel yet. The only question is what 
> > kind of upgrade effort is needed for users to make use of the feature.
> 
> I meant the guest installed base.  Hosts can be upgraded transparently to 
> the guests (not even a shutdown/reboot).

The irony: this time guest-transparent solutions that need no configuration 
are good? ;-)

The very same argument holds for the file server thing: a guest transparent 
solution is easier wrt. the upgrade path.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 02:07 PM, Ingo Molnar wrote:

* Avi Kivity  wrote:

   

A native API to the host will lock out 100% of the install base now, and a
large section of any future install base.
 

... which is why i suggested the soft-PMU approach.
   


Not sure I understand it completely.

Do you mean to take the model specific host pmu events, and expose them 
to the guest via trap'n'emulate?  In that case we may as well assign the 
host pmu to the guest if the host isn't using it, and avoid the traps.


Do you mean to choose some older pmu and emulate it using whatever pmu 
model the host has?  I haven't checked, but aren't there mutually 
exclusive events in every model pair?  The closest thing would be the 
architectural pmu thing.


Or do you mean to define a new, kvm-specific pmu model and feed it off 
the host pmu?  In this case all the guests will need to be taught about 
it, which raises the compatibility problem.



And note that _any_ solution we offer locks out 100% of the installed base
right now, as no solution is in the kernel yet. The only question is what kind
of upgrade effort is needed for users to make use of the feature.
   


I meant the guest installed base.  Hosts can be upgraded transparently 
to the guests (not even a shutdown/reboot).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> A native API to the host will lock out 100% of the install base now, and a 
> large section of any future install base.

... which is why i suggested the soft-PMU approach.

And note that _any_ solution we offer locks out 100% of the installed base 
right now, as no solution is in the kernel yet. The only question is what kind 
of upgrade effort is needed for users to make use of the feature.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:42 PM, Ingo Molnar wrote:

* Jes Sorensen  wrote:

   

On 02/26/10 11:44, Ingo Molnar wrote:
 

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.
   

I have to say I disagree on that. When you run perfmon on a system, it is
normally to measure a specific application. You want to see accurate numbers
for cache misses, mul instructions or whatever else is selected.
 

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.
   


Agreed - if both the host and guest want the pmu, the host wins.  This 
is what we do with debug registers - if both the host and guest contend 
for them, the host wins.



Emulating the PMU rather than using the real one, makes the numbers far less
useful. The most useful way to provide PMU support in a guest is to expose
the real PMU and let the guest OS program it.
 

Firstly, an emulated PMU was only the second-tier option i suggested. By far
the best approach is native API to the host regarding performance events and
good guest side integration.
   


A native API to the host will lock out 100% of the install base now, and 
a large section of any future install base.



Secondly, the PMU cannot be 'given' to the guest in the general case. Those
are privileged registers. They can expose sensitive host execution details,
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses
anyway for a secure solution. (RDPMC can still be supported, but in close
cooperation with the host)
   


No, stop and restart the counters on every exit/entry, so the guest 
doesn't observe any host data.



We can do this in a reasonable way today, if we allow to take the PMU away
from the host, and only let guests access it when it's in use. [...]
 

You get my sure-fire NAK for that kind of crap though. Interfering with the
host PMU and stealing it, is not a technical approach that has acceptable
quality.

   


It would be the other way round - the host would steal the pmu from the 
guest.  Later we can try to time-slice and extrapolate, though that's 
not going to be easy.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 01:26 PM, Ingo Molnar wrote:



By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards.
   

In a way, virtualization as a whole is a step backwards.  We take the nice
firesystem/timer/network/scheduler APIs, and expose them as raw hardware.
The pmu isn't any different.
 

Uhm, it's obviously very different. A fake NE2000 will work on both Intel and
AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor
though.

So there's no "generic hardware" to emulate.
   


That's true, and it reduces the usability of the feature (you have to 
restrict your migration pools or not expose the pmu), but the general 
points still stand.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen  wrote:

> On 02/26/10 11:44, Ingo Molnar wrote:
> >Direct access to counters is not something that is a big issue. [ Given that 
> >i
> >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> >is the biggest of performance challenges right now ;-) ]
> >
> >By far the biggest instrumentation issue is:
> >
> >  - availability
> >  - usability
> >  - flexibility
> >
> >Exposing the raw hw is a step backwards in many regards. The same way we dont
> >want to expose chipsets to the guest to allow them to do RAS. The same way we
> >dont want to expose most raw PCI devices to guest in general, but have all
> >these virt driver abstractions.
> 
> I have to say I disagree on that. When you run perfmon on a system, it is 
> normally to measure a specific application. You want to see accurate numbers 
> for cache misses, mul instructions or whatever else is selected.

You can still get those. You can even enable RDPMC access and avoid VM exits.

What you _cannot_ do is to 'steal' the PMU and just give it to the guest.

> Emulating the PMU rather than using the real one, makes the numbers far less 
> useful. The most useful way to provide PMU support in a guest is to expose 
> the real PMU and let the guest OS program it.

Firstly, an emulated PMU was only the second-tier option i suggested. By far 
the best approach is native API to the host regarding performance events and 
good guest side integration.

Secondly, the PMU cannot be 'given' to the guest in the general case. Those 
are privileged registers. They can expose sensitive host execution details, 
etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses 
anyway for a secure solution. (RDPMC can still be supported, but in close 
cooperation with the host)

> We can do this in a reasonable way today, if we allow to take the PMU away 
> from the host, and only let guests access it when it's in use. [...]

You get my sure-fire NAK for that kind of crap though. Interfering with the 
host PMU and stealing it, is not a technical approach that has acceptable 
quality.

You need to integrate it properly so that host PMU functionality still works 
fine. (Within hardware constraints)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> On 02/26/2010 12:44 PM, Ingo Molnar wrote:
> >>>Far cleaner would be to expose it via hypercalls to guest OSs that are
> >>>interested in instrumentation.
> >>It's also slower - you can give the guest direct access to the various
> >>counters so no exits are taken when reading the counters (though perhaps
> >>many tools are only interested in the interrupts, not the counter values).
> >Direct access to counters is not something that is a big issue. [ Given that 
> >i
> >sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
> >is the biggest of performance challenges right now ;-) ]
> 
> Outside 4-bit vga mode, this shouldn't happen.  Can you describe
> your scenario?
> 
> >By far the biggest instrumentation issue is:
> >
> >  - availability
> >  - usability
> >  - flexibility
> >
> >Exposing the raw hw is a step backwards in many regards.
> 
> In a way, virtualization as a whole is a step backwards.  We take the nice 
> firesystem/timer/network/scheduler APIs, and expose them as raw hardware.  
> The pmu isn't any different.

Uhm, it's obviously very different. A fake NE2000 will work on both Intel and 
AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor 
though.

So there's no "generic hardware" to emulate.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:24, Ingo Molnar wrote:

There is a way to query the CPU for 'architectural perfmon' though, via CPUID
alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic
is:

 if (c->cpuid_level>  9) {
 unsigned eax = cpuid_eax(10);
 /* Check for version and the number of counters */
 if ((eax&  0xff)&&  (((eax>>8)&  0xff)>  1))
 set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
 }

But emulating that doesnt solve the problem: as OSs generally dont key their
PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but
based on much higher level CPUID attributes. (like Intel/AMD)


Right, there is far more to it than just the arch-perfmon feature. They
still need to query cpuid 0x0a for counter size, number of counters and
stuff like that.

Cheers,
Jes
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Jes Sorensen  wrote:

> On 02/26/10 12:06, Joerg Roedel wrote:
>
> > Isn't there a cpuid bit indicating the availability of architectural 
> > perfmon?
> 
> Nope, the perfmon flag is a fake Linux flag, set based on the contents on 
> cpuid 0x0a

There is a way to query the CPU for 'architectural perfmon' though, via CPUID 
alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic 
is:

if (c->cpuid_level > 9) {
unsigned eax = cpuid_eax(10);
/* Check for version and the number of counters */
if ((eax & 0xff) && (((eax>>8) & 0xff) > 1))
set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON);
}

But emulating that doesnt solve the problem: as OSs generally dont key their 
PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but 
based on much higher level CPUID attributes. (like Intel/AMD)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 11:44, Ingo Molnar wrote:

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.


I have to say I disagree on that. When you run perfmon on a system, it
is normally to measure a specific application. You want to see accurate
numbers for cache misses, mul instructions or whatever else is selected.
Emulating the PMU rather than using the real one, makes the numbers far
less useful. The most useful way to provide PMU support in a guest is
to expose the real PMU and let the guest OS program it.

We can do this in a reasonable way today, if we allow to take the PMU
away from the host, and only let guests access it when it's in use.
Hopefully Intel and AMD will come up with proper hw PMU virtualization
support that allows us to do it 100% guest and host at some point.

Cheers,
Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
> > 
> > * Joerg Roedel  wrote:
> > 
> > > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > > > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> > > >> Note that the 'soft PMU' still sucks from a design POV as there's no 
> > > >> generic
> > > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 
> > > >> 'soft
> > > >> Intel' PMU driver at minimum.
> > > >>
> > > >
> > > > Right, this will severely limit migration domains to hosts of the same  
> > > > vendor and processor generation.  There is a  middle ground, though,  
> > > > Intel has recently moved to define an "architectural pmu" which is not  
> > > > model specific.  I don't know if AMD adopted it.  We could offer both  
> > > > options - native host capabilities, with a loss of compatibility, and  
> > > > the architectural pmu, with loss of model specific counters.
> > > 
> > > I only had a quick look yet on the architectural pmu from intel but it 
> > > looks 
> > > like it can be emulated for a guest on amd using existing features.
> > 
> > AMD CPUs dont have enough events for that, they cannot do the 3 fixed 
> > events 
> > in addition to the 2 generic ones.
> 
> Good point. Maybe we can emulate that with some counter round-robin
> usage if the guest really uses all 5 counters.
> 
> > Nor do you really want to standardize on KVM guests on returning 
> > 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel 
> > PMU 
> > drivers, right?
> 
> Isn't there a cpuid bit indicating the availability of architectural 
> perfmon?

there is, but can you rely on all guest OSs keying off their PMU drivers based 
purely on the CPUID bit and not on any other CPUID aspects?

Guest OSs like ... Linux v2.6.33:

void __init init_hw_perf_events(void)
{
int err;

pr_info("Performance Events: ");

switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_INTEL:
err = intel_pmu_init();
break;
case X86_VENDOR_AMD:
err = amd_pmu_init();
break;
default:

Really, if you want to emulate a single Intel PMU driver model you need to 
pretend that you are an Intel CPU, throughout. This cannot be had both ways.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/26/10 12:06, Joerg Roedel wrote:

Isn't there a cpuid bit indicating the availability of architectural
perfmon?


Nope, the perfmon flag is a fake Linux flag, set based on the contents
on cpuid 0x0a

Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 12:44 PM, Ingo Molnar wrote:

Far cleaner would be to expose it via hypercalls to guest OSs that are
interested in instrumentation.
   

It's also slower - you can give the guest direct access to the various
counters so no exits are taken when reading the counters (though perhaps
many tools are only interested in the interrupts, not the counter values).
 

Direct access to counters is not something that is a big issue. [ Given that i
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this
is the biggest of performance challenges right now ;-) ]
   


Outside 4-bit vga mode, this shouldn't happen.  Can you describe your 
scenario?



By far the biggest instrumentation issue is:

  - availability
  - usability
  - flexibility

Exposing the raw hw is a step backwards in many regards.


In a way, virtualization as a whole is a step backwards.  We take the 
nice firesystem/timer/network/scheduler APIs, and expose them as raw 
hardware.  The pmu isn't any different.




The same way we dont
want to expose chipsets to the guest to allow them to do RAS. The same way we
dont want to expose most raw PCI devices to guest in general, but have all
these virt driver abstractions.
   


Whenever we have a choice, we expose raw hardware (usually emulated, but 
in some cases real).  Raw hardware has the huge advantage of being 
already supported.  Write a software abstraction, and you get to (a) 
write and maintain the spec (b) write drivers for all guests (c) mumble 
something to users of OSes to which you haven't ported your driver (d) 
explain to users that they need to install those drivers.


For networking and block, it is simply impossible to obtain good 
performance without introducing a new interface, but for other stuff, 
that may not be the case.



That way it could also transparently integrate with tracing, probes, etc.
It would also be wiser to first concentrate on improving Linux<->Linux
guest/host combos before gutting the design just to fit Windows into the
picture ...
   

"gutting the design"?
 

Yes, gutting the design of a sane instrumentation API and moving it back 10-20
years by squeezing it through non-standardized and incompatible PMU drivers.
   


Any new interface will be incompatible to all the exiting guests out 
there; and unlike networking, you can't retrofit a pmu interface to an 
existing guest.



When it comes to design my main interest is the Linux<->Linux combo.
   


My main interest is the OSes that users actually install, and those are 
Windows and non-bleeding-edge Linux.


Look at guests as you do at userspace: you don't want to inflict changes 
upon them.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote:
> 
> * Joerg Roedel  wrote:
> 
> > On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> > >> Note that the 'soft PMU' still sucks from a design POV as there's no 
> > >> generic
> > >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 
> > >> 'soft
> > >> Intel' PMU driver at minimum.
> > >>
> > >
> > > Right, this will severely limit migration domains to hosts of the same  
> > > vendor and processor generation.  There is a  middle ground, though,  
> > > Intel has recently moved to define an "architectural pmu" which is not  
> > > model specific.  I don't know if AMD adopted it.  We could offer both  
> > > options - native host capabilities, with a loss of compatibility, and  
> > > the architectural pmu, with loss of model specific counters.
> > 
> > I only had a quick look yet on the architectural pmu from intel but it 
> > looks 
> > like it can be emulated for a guest on amd using existing features.
> 
> AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
> in addition to the 2 generic ones.

Good point. Maybe we can emulate that with some counter round-robin
usage if the guest really uses all 5 counters.

> Nor do you really want to standardize on KVM guests on returning 
> 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
> drivers, right?

Isn't there a cpuid bit indicating the availability of architectural
perfmon?

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/25/10 18:34, Joerg Roedel wrote:

The biggest problem I see here is teaching the guest about the available
events. The available event sets are dependent on the processor family
(at least on AMD).
A simple approach would be shadowing the perf msrs which is a simple
thing to do. More problematic is the reinjection of performance
interrupts and performance nmis.


IMHO the only real solution here is to map it to the host CPU, and
require -cpu host for PMU support. There is no point in trying to
emulate PMU features which we don't have in the hardware. Ie. you cannot
count cache misses if the hardware doesn't support it.

Cheers,
Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Jes Sorensen

On 02/25/10 17:26, Ingo Molnar wrote:

Given that perf can apply the PMU to individual host tasks, I don't see
fundamental problems multiplexing it between individual guests (which can
then internally multiplex it again).


In terms of how to expose it to guests, a 'soft PMU' might be a usable
approach. Although to Linux guests you could expose much more functionality
and an non-PMU-limited number of instrumentation events, via a more
intelligent interface.

But note that in terms of handling it on the host side the PMU approach is not
acceptable: instead it should map to proper perf_events, not try to muck with
the PMU itself.


I am not keen on emulating the PMU, if we do that we end up having to
emulate a large number of MSR accesses, which is really costly. It makes
a lot more sense to give the guest direct access to the PMU. The problem
here is how to manage it without too much overhead.

Cheers,
Jes

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
> > My suggestion, as always, would be to start very simple and very minimal:
> > 
> > Enable 'perf kvm top' to show guest overhead. Use the exact same kernel 
> > image 
> > both as a host and as guest (for testing), to not have to deal with the 
> > symbol 
> > space transport problem initially. Enable 'perf kvm record' to only record 
> > guest events by default. Etc.
> > 
> > This alone will be a quite useful result already - and gives a basis for 
> > further work. No need to spend months to do the big grand design straight 
> > away, all of this can be done gradually and in the order of usefulness - 
> > and 
> > you'll always have something that actually works (and helps your other KVM 
> > projects) along the way.
> > 
> > [ And, as so often, once you walk that path, that grand scheme you are 
> >   thinking about right now might easily become last year's really bad idea 
> > ;-) ]
> > 
> > So please start walking the path and experience the challenges first-hand.
> 
> That sounds like a good approach for the 'measure-guest-from-host'
> problem. It is also not very hard to implement. Where does perf fetch
> the rip of the nmi from, stack only or is this configurable?

The host semantics are that it takes the stack from the regs, and with 
call-graph recording (perf record -g) it will walk down the exception stack, 
irq stack, kernel stack, and user-space stack as well. (up to the point the 
pages are present - it stops on a non-present page. An app that is being 
profiled has its stack present so it's not an issue in practice.)

I'd suggest to leave out call graph sampling initially, and just get 'perf kvm 
top' to work with guest RIPs, simply sampled from the VM exit state.

See arch/x86/kernel/cpu/perf_event.c:

static void
perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry)
{
callchain_store(entry, PERF_CONTEXT_KERNEL);
callchain_store(entry, regs->ip);

dump_trace(NULL, regs, NULL, regs->bp, &backtrace_ops, entry);
}

If you have easy access to the VM state from NMI context right there then just 
hack in the guest RIP and you should have some prototype that samples the 
guest. (assuming you use the same kernel image for both the host an the guest)

This would be the easiest way to prototype it all.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 12:46 PM, Ingo Molnar wrote:



Right, this will severely limit migration domains to hosts of the same
vendor and processor generation.  There is a  middle ground, though,
Intel has recently moved to define an "architectural pmu" which is not
model specific.  I don't know if AMD adopted it.  We could offer both
options - native host capabilities, with a loss of compatibility, and
the architectural pmu, with loss of model specific counters.
   

I only had a quick look yet on the architectural pmu from intel but it looks
like it can be emulated for a guest on amd using existing features.
 

AMD CPUs dont have enough events for that, they cannot do the 3 fixed events
in addition to the 2 generic ones.

Nor do you really want to standardize on KVM guests on returning
'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU
drivers, right?

   


No - that would only work if AMD also adopted the architectural pmu.

Note virtualization clusters are typically split into 'migration pools' 
consisting of hosts with similar processor features, so that you can 
expose those features and yet live migrate guests at will.  It's likely 
that all hosts have the same pmu anyway, so the only downside is that we 
now have to expose the host's processor family and model.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> > On 02/26/2010 10:42 AM, Ingo Molnar wrote:
> >> Note that the 'soft PMU' still sucks from a design POV as there's no 
> >> generic
> >> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
> >> Intel' PMU driver at minimum.
> >>
> >
> > Right, this will severely limit migration domains to hosts of the same  
> > vendor and processor generation.  There is a  middle ground, though,  
> > Intel has recently moved to define an "architectural pmu" which is not  
> > model specific.  I don't know if AMD adopted it.  We could offer both  
> > options - native host capabilities, with a loss of compatibility, and  
> > the architectural pmu, with loss of model specific counters.
> 
> I only had a quick look yet on the architectural pmu from intel but it looks 
> like it can be emulated for a guest on amd using existing features.

AMD CPUs dont have enough events for that, they cannot do the 3 fixed events 
in addition to the 2 generic ones.

Nor do you really want to standardize on KVM guests on returning 
'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU 
drivers, right?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Avi Kivity  wrote:

> Right, this will severely limit migration domains to hosts of the same 
> vendor and processor generation.  There is a middle ground, though, Intel 
> has recently moved to define an "architectural pmu" which is not model 
> specific.  I don't know if AMD adopted it. [...]

Nope. It's "architectural" the following way: Intel wont change it with future 
CPU models, outside of the definitions of the hw-ABI. PMUs were model specific 
prior that time.

I'd say there's near zero chance the MSR spaces will unify. All the 'advanced' 
PMU features are wildly incompatible, and the gap is increasing not 
decreasing.

> > Far cleaner would be to expose it via hypercalls to guest OSs that are 
> > interested in instrumentation.
> 
> It's also slower - you can give the guest direct access to the various 
> counters so no exits are taken when reading the counters (though perhaps 
> many tools are only interested in the interrupts, not the counter values).

Direct access to counters is not something that is a big issue. [ Given that i 
sometimes can see KVM redraw the screen of a guest OS real-time i doubt this 
is the biggest of performance challenges right now ;-) ]

By far the biggest instrumentation issue is:

 - availability
 - usability
 - flexibility

Exposing the raw hw is a step backwards in many regards. The same way we dont 
want to expose chipsets to the guest to allow them to do RAS. The same way we 
dont want to expose most raw PCI devices to guest in general, but have all 
these virt driver abstractions.

> > That way it could also transparently integrate with tracing, probes, etc. 
> > It would also be wiser to first concentrate on improving Linux<->Linux 
> > guest/host combos before gutting the design just to fit Windows into the 
> > picture ...
> 
> "gutting the design"?

Yes, gutting the design of a sane instrumentation API and moving it back 10-20 
years by squeezing it through non-standardized and incompatible PMU drivers.

When it comes to design my main interest is the Linux<->Linux combo.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote:
> My suggestion, as always, would be to start very simple and very minimal:
> 
> Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
> both as a host and as guest (for testing), to not have to deal with the 
> symbol 
> space transport problem initially. Enable 'perf kvm record' to only record 
> guest events by default. Etc.
> 
> This alone will be a quite useful result already - and gives a basis for 
> further work. No need to spend months to do the big grand design straight 
> away, all of this can be done gradually and in the order of usefulness - and 
> you'll always have something that actually works (and helps your other KVM 
> projects) along the way.
> 
> [ And, as so often, once you walk that path, that grand scheme you are 
>   thinking about right now might easily become last year's really bad idea 
> ;-) ]
> 
> So please start walking the path and experience the challenges first-hand.

That sounds like a good approach for the 'measure-guest-from-host'
problem. It is also not very hard to implement. Where does perf fetch
the rip of the nmi from, stack only or is this configurable?

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote:
> On 02/26/2010 10:42 AM, Ingo Molnar wrote:
>> Note that the 'soft PMU' still sucks from a design POV as there's no generic
>> hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
>> Intel' PMU driver at minimum.
>>
>
> Right, this will severely limit migration domains to hosts of the same  
> vendor and processor generation.  There is a  middle ground, though,  
> Intel has recently moved to define an "architectural pmu" which is not  
> model specific.  I don't know if AMD adopted it.  We could offer both  
> options - native host capabilities, with a loss of compatibility, and  
> the architectural pmu, with loss of model specific counters.

I only had a quick look yet on the architectural pmu from intel but it
looks like it can be emulated for a guest on amd using existing
features.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Avi Kivity

On 02/26/2010 10:42 AM, Ingo Molnar wrote:

* Joerg Roedel  wrote:

   

I personally don't like a self-defined event-set as the only solution
because that would probably only work with linux and perf. [...]
 

The 'soft-PMU' i suggested is transparent on the guest side - if you want to
enable non-Linux and legacy-Linux.

It's basically a PMU interface provided to the guest by catching the right MSR
accesses, implemented via perf_event_create_kernel_counter()/etc. on the host
side.
   


That only works if the software interface is 100% lossless - we can 
recreate every single hardware configuration through the API.  Is this 
the case?



Note that the 'soft PMU' still sucks from a design POV as there's no generic
hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft
Intel' PMU driver at minimum.
   


Right, this will severely limit migration domains to hosts of the same 
vendor and processor generation.  There is a  middle ground, though, 
Intel has recently moved to define an "architectural pmu" which is not 
model specific.  I don't know if AMD adopted it.  We could offer both 
options - native host capabilities, with a loss of compatibility, and 
the architectural pmu, with loss of model specific counters.



Far cleaner would be to expose it via hypercalls to guest OSs that are
interested in instrumentation.


It's also slower - you can give the guest direct access to the various 
counters so no exits are taken when reading the counters (though perhaps 
many tools are only interested in the interrupts, not the counter values).



That way it could also transparently integrate
with tracing, probes, etc. It would also be wiser to first concentrate on
improving Linux<->Linux guest/host combos before gutting the design just to
fit Windows into the picture ...
   


"gutting the design"?

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel  wrote:

> On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> > On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > > 
> > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > >host.
> > > 
> > > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > > configured to count only when in guest mode. Perf needs to be aware of
> > > that and fetch the rip from a different place when monitoring a guest.
> 
> > The idea is we want to measure both host and guest at the same time, and
> > compare all the hot functions fairly.
> 
> So you want to measure while the guest vcpu is running and the vmexit
> path of that vcpu (including qemu userspace part) together? The
> challenge here is to find out if a performance event originated in guest
> mode or in host mode.
> But we can check for that in the nmi-protected part of the vmexit path.

As far as instrumentation goes, virtualization is simply another 'PID 
dimension' of measurement.

Today we can isolate system performance measurements/events to the following 
domains:

 - per system
 - per cpu
 - per task

( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' 
  domain separation, and we have some ABI details for all that but it's by no 
  means complete. Anton is using the PowerPC bits AFAIK, so it already works 
  to a certain degree. )

When extending measurements to KVM, we want two things:

 - user friendliness: instead of having to check 'ps' and figure out which 
   Qemu thread is the KVM thread we want to profile, just give a convenience
   namespace to access guest profiling info. -G ought to map to the first
   currently running KVM guest it can find. (which would match like 90% of the
   cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something 
   useful by default the whole effort is for naught.

 - Extend core facilities and enable the following measurement dimensions:

 host-kernel-space
 host-user-space
 guest-kernel-space
 guest-user-space

   on a per guest basis. We want to be able to measure just what the guest 
   does, and we want to be able to measure just what the host does.

   Some of this the hardware helps us with (say only measuring host kernel 
   events is possible), some has to be done by fiddling with event 
   enable/disable at vm-exit / vm-entry time.

My suggestion, as always, would be to start very simple and very minimal:

Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image 
both as a host and as guest (for testing), to not have to deal with the symbol 
space transport problem initially. Enable 'perf kvm record' to only record 
guest events by default. Etc.

This alone will be a quite useful result already - and gives a basis for 
further work. No need to spend months to do the big grand design straight 
away, all of this can be done gradually and in the order of usefulness - and 
you'll always have something that actually works (and helps your other KVM 
projects) along the way.

[ And, as so often, once you walk that path, that grand scheme you are 
  thinking about right now might easily become last year's really bad idea ;-) ]

So please start walking the path and experience the challenges first-hand.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Joerg Roedel
On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote:
> On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> > On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> > 
> > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > >host.
> > 
> > This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> > configured to count only when in guest mode. Perf needs to be aware of
> > that and fetch the rip from a different place when monitoring a guest.

> The idea is we want to measure both host and guest at the same time, and
> compare all the hot functions fairly.

So you want to measure while the guest vcpu is running and the vmexit
path of that vcpu (including qemu userspace part) together? The
challenge here is to find out if a performance event originated in guest
mode or in host mode.
But we can check for that in the nmi-protected part of the vmexit path.

Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Zhang, Yanmin  wrote:

> On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
> > * Jan Kiszka  wrote:
> > 
> > > Jes Sorensen wrote:
> > > > Hi,
> > > > 
> > > > It looks like several of us have been looking at how to use the PMU
> > > > for virtualization. Rather than continuing to have discussions in
> > > > smaller groups, I think it is a good idea we move it to the mailing
> > > > lists to see what we can share and avoid duplicate efforts.
> > > > 
> > > > There are really two separate things to handle:
> > > > 
> > > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > > >host.
> > > > 
> > > > 2) Allow guests access to the PMU (or an emulated PMU), making it
> > > >possible to run perf on applications running within the guest.
> > > > 
> > > > I know some of you have been looking at 1) and I am currently working
> > > > on 2). I have been looking at various approaches, including whether it
> > > > is feasible to share the PMU between the host and multiple guests. For
> > > > now I am going to focus on allowing one guest to take control of the
> > > > PMU, then later hopefully adding support for multiplexing it between
> > > > multiple guests.
> > > 
> > > Given that perf can apply the PMU to individual host tasks, I don't see 
> > > fundamental problems multiplexing it between individual guests (which can 
> > > then internally multiplex it again).
> > 
> > In terms of how to expose it to guests, a 'soft PMU' might be a usable 
> > approach. Although to Linux guests you could expose much more 
> > functionality and an non-PMU-limited number of instrumentation events, via 
> > a more intelligent interface.
> > 
> > But note that in terms of handling it on the host side the PMU approach is 
> > not acceptable: instead it should map to proper perf_events, not try to 
> > muck with the PMU itself.
> 
> 
> > That, besides integrating properly with perf usage on the host, will also 
> > allow interesting 'PMU' features on guests: you could set up the host side 
> > to trace block IO requests (or VM exits) for example, and expose that as 
> > 'PMC
> > #0' on the guest side.
>
> So virtualization becomes non-transparent to guest os? I know virtio is an 
> optimization on guest side.

The 'soft PMU' is transparent. The 'count IO events' kind of feature could be 
transparent too: you could re-configure (on the host) a given 'hardware' event 
to really count some software event.

That would make it compatible with whatever guest side tooling (without having 
to change that tooling) - while still allowing interesting new things to be 
measured.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-26 Thread Ingo Molnar

* Joerg Roedel  wrote:

> I personally don't like a self-defined event-set as the only solution 
> because that would probably only work with linux and perf. [...]

The 'soft-PMU' i suggested is transparent on the guest side - if you want to 
enable non-Linux and legacy-Linux.

It's basically a PMU interface provided to the guest by catching the right MSR 
accesses, implemented via perf_event_create_kernel_counter()/etc. on the host 
side.

Note that the 'soft PMU' still sucks from a design POV as there's no generic 
hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft 
Intel' PMU driver at minimum.

Far cleaner would be to expose it via hypercalls to guest OSs that are 
interested in instrumentation. That way it could also transparently integrate 
with tracing, probes, etc. It would also be wiser to first concentrate on 
improving Linux<->Linux guest/host combos before gutting the design just to 
fit Windows into the picture ...

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Zhang, Yanmin
On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote:
> On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote:
> 
> > 1) Add support to perf to allow it to monitor a KVM guest from the
> >host.
> 
> This shouldn't be a big problem. The PMU of AMD Fam10 processors can be
> configured to count only when in guest mode. Perf needs to be aware of
> that and fetch the rip from a different place when monitoring a guest.
The idea is we want to measure both host and guest at the same time, and
compare all the hot functions fairly.


> 
> > 2) Allow guests access to the PMU (or an emulated PMU), making it
> >possible to run perf on applications running within the guest.
> 
> The biggest problem I see here is teaching the guest about the available
> events. The available event sets are dependent on the processor family
> (at least on AMD).
> A simple approach would be shadowing the perf msrs which is a simple
> thing to do. More problematic is the reinjection of performance
> interrupts and performance nmis.
> 
> I personally don't like a self-defined event-set as the only solution
> because that would probably only work with linux and perf. I think we
> should have a way (additionally to a soft-event interface) which allows
> to expose the host pmu events to the guest.
> 
>   Joerg
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Zhang, Yanmin
On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote:
> * Jan Kiszka  wrote:
> 
> > Jes Sorensen wrote:
> > > Hi,
> > > 
> > > It looks like several of us have been looking at how to use the PMU
> > > for virtualization. Rather than continuing to have discussions in
> > > smaller groups, I think it is a good idea we move it to the mailing
> > > lists to see what we can share and avoid duplicate efforts.
> > > 
> > > There are really two separate things to handle:
> > > 
> > > 1) Add support to perf to allow it to monitor a KVM guest from the
> > >host.
> > > 
> > > 2) Allow guests access to the PMU (or an emulated PMU), making it
> > >possible to run perf on applications running within the guest.
> > > 
> > > I know some of you have been looking at 1) and I am currently working
> > > on 2). I have been looking at various approaches, including whether it
> > > is feasible to share the PMU between the host and multiple guests. For
> > > now I am going to focus on allowing one guest to take control of the
> > > PMU, then later hopefully adding support for multiplexing it between
> > > multiple guests.
> > 
> > Given that perf can apply the PMU to individual host tasks, I don't see 
> > fundamental problems multiplexing it between individual guests (which can 
> > then internally multiplex it again).
> 
> In terms of how to expose it to guests, a 'soft PMU' might be a usable 
> approach. Although to Linux guests you could expose much more functionality 
> and an non-PMU-limited number of instrumentation events, via a more 
> intelligent interface.
> 
> But note that in terms of handling it on the host side the PMU approach is 
> not 
> acceptable: instead it should map to proper perf_events, not try to muck with 
> the PMU itself.
> 


> That, besides integrating properly with perf usage on the host, will also 
> allow interesting 'PMU' features on guests: you could set up the host side to 
> trace block IO requests (or VM exits) for example, and expose that as 'PMC
> #0' on the guest side.
So virtualization becomes non-transparent to guest os? I know virtio is an
optimization on guest side.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Ingo Molnar

* Jan Kiszka  wrote:

> Jes Sorensen wrote:
> > Hi,
> > 
> > It looks like several of us have been looking at how to use the PMU
> > for virtualization. Rather than continuing to have discussions in
> > smaller groups, I think it is a good idea we move it to the mailing
> > lists to see what we can share and avoid duplicate efforts.
> > 
> > There are really two separate things to handle:
> > 
> > 1) Add support to perf to allow it to monitor a KVM guest from the
> >host.
> > 
> > 2) Allow guests access to the PMU (or an emulated PMU), making it
> >possible to run perf on applications running within the guest.
> > 
> > I know some of you have been looking at 1) and I am currently working
> > on 2). I have been looking at various approaches, including whether it
> > is feasible to share the PMU between the host and multiple guests. For
> > now I am going to focus on allowing one guest to take control of the
> > PMU, then later hopefully adding support for multiplexing it between
> > multiple guests.
> 
> Given that perf can apply the PMU to individual host tasks, I don't see 
> fundamental problems multiplexing it between individual guests (which can 
> then internally multiplex it again).

In terms of how to expose it to guests, a 'soft PMU' might be a usable 
approach. Although to Linux guests you could expose much more functionality 
and an non-PMU-limited number of instrumentation events, via a more 
intelligent interface.

But note that in terms of handling it on the host side the PMU approach is not 
acceptable: instead it should map to proper perf_events, not try to muck with 
the PMU itself.

That, besides integrating properly with perf usage on the host, will also 
allow interesting 'PMU' features on guests: you could set up the host side to 
trace block IO requests (or VM exits) for example, and expose that as 'PMC
#0' on the guest side.

That's a neat feature: the guest profiling tools would immediately (and 
transparently) be able to measure VM exits or IO heaviness, on a per guest 
basis, as seen on the host side.

More would be possible too.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PMU virtualization

2010-02-25 Thread Jan Kiszka
Jes Sorensen wrote:
> Hi,
> 
> It looks like several of us have been looking at how to use the PMU
> for virtualization. Rather than continuing to have discussions in
> smaller groups, I think it is a good idea we move it to the mailing
> lists to see what we can share and avoid duplicate efforts.
> 
> There are really two separate things to handle:
> 
> 1) Add support to perf to allow it to monitor a KVM guest from the
>host.
> 
> 2) Allow guests access to the PMU (or an emulated PMU), making it
>possible to run perf on applications running within the guest.
> 
> I know some of you have been looking at 1) and I am currently working
> on 2). I have been looking at various approaches, including whether it
> is feasible to share the PMU between the host and multiple guests. For
> now I am going to focus on allowing one guest to take control of the
> PMU, then later hopefully adding support for multiplexing it between
> multiple guests.

Given that perf can apply the PMU to individual host tasks, I don't see
fundamental problems multiplexing it between individual guests (which
can then internally multiplex it again).

Then the next challenge might be how to handle the case of both host and
guest trying to use PMU resources at the same time. For the sparse debug
registers resources I simply disable the effect of guest injected
breakpoints once the host wants to use them. The guest still sees its
programmed values, though. One could try to schedule free registers
between both, but given how rare such use cases are, I decided to go for
a simple approach. Probably the situation is not that different for the PMU.

> 
> Eventually we will see proper hardware PMU virtualization from Intel and
> AMD (admittedly I have only looked at the Intel specs so far), and by
> then be able to allow the host as well as the guests to share the PMU.
> 
> If anybody else is working on this, I'd love to hear about it so we can
> coordinate our efforts. The main purpose with this mail was really to
> being the discussion to the mailing list to avoid duplicated efforts.

I thought I've seen quite some code for PMU virtualization in Xen's HVM
code. Might be worth studying what they do already and adopt/extend it
for KVM.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM PMU virtualization

2010-02-25 Thread Jes Sorensen

Hi,

It looks like several of us have been looking at how to use the PMU
for virtualization. Rather than continuing to have discussions in
smaller groups, I think it is a good idea we move it to the mailing
lists to see what we can share and avoid duplicate efforts.

There are really two separate things to handle:

1) Add support to perf to allow it to monitor a KVM guest from the
   host.

2) Allow guests access to the PMU (or an emulated PMU), making it
   possible to run perf on applications running within the guest.

I know some of you have been looking at 1) and I am currently working
on 2). I have been looking at various approaches, including whether it
is feasible to share the PMU between the host and multiple guests. For
now I am going to focus on allowing one guest to take control of the
PMU, then later hopefully adding support for multiplexing it between
multiple guests.

Eventually we will see proper hardware PMU virtualization from Intel and
AMD (admittedly I have only looked at the Intel specs so far), and by
then be able to allow the host as well as the guests to share the PMU.

If anybody else is working on this, I'd love to hear about it so we can
coordinate our efforts. The main purpose with this mail was really to
being the discussion to the mailing list to avoid duplicated efforts.

Cheers,
Jes

PS: I'll be AFK all of next week, so it may take a few days for me to
reply to follow-up discussions.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html