Re: Add savevm/loadvm support for MCE
Huang Ying wrote: MCE registers are saved/load into/from CPUState in kvm_arch_save/load_regs. Because all MCE registers except for MCG_STATUS should be preserved, MCE registers are saved before kvm_arch_load_regs in kvm_arch_cpu_reset. To simulate the MCG_STATUS clearing upon reset, env-mcg_status is set to 0 after saving. That should be solved differently on top of [1]: Write back MSR_MCG_STATUS on KVM_PUT_RESET_STATE, write all MCE MSRs on KVM_PUT_FULL_STATE. Then you can also unfold kvm_load/save_mce_regs to avoid duplicating its infrastructure (becomes even more obvious when looking at kvm_get/put/_msrs in upstream). Signed-off-by: Huang Ying ying.hu...@intel.com --- qemu-kvm-x86.c | 54 ++ 1 file changed, 54 insertions(+) --- a/qemu-kvm-x86.c +++ b/qemu-kvm-x86.c @@ -803,6 +803,27 @@ static void get_seg(SegmentCache *lhs, c | (rhs-avl * DESC_AVL_MASK); } +static void kvm_load_mce_regs(CPUState *env) +{ +#ifdef KVM_CAP_MCE +struct kvm_msr_entry msrs[100]; +int rc, n, i; + +if (!env-mcg_cap) + return; + +n = 0; +set_msr_entry(msrs[n++], MSR_MCG_STATUS, env-mcg_status); +set_msr_entry(msrs[n++], MSR_MCG_CTL, env-mcg_ctl); +for (i = 0; i (env-mcg_cap 0xff) * 4; i++) +set_msr_entry(msrs[n++], MSR_MC0_CTL + i, env-mce_banks[i]); + +rc = kvm_set_msrs(env, msrs, n); +if (rc == -1) +perror(kvm_set_msrs FAILED); +#endif +} + void kvm_arch_load_regs(CPUState *env) { struct kvm_regs regs; @@ -922,6 +943,8 @@ void kvm_arch_load_regs(CPUState *env) if (rc == -1) perror(kvm_set_msrs FAILED); +kvm_load_mce_regs(env); + /* * Kernels before 2.6.33 (which correlates with !kvm_has_vcpu_events()) * overwrote flags.TF injected via SET_GUEST_DEBUG while updating GP regs. @@ -991,6 +1014,33 @@ void kvm_arch_load_mpstate(CPUState *env #endif } +static void kvm_save_mce_regs(CPUState *env) +{ +#ifdef KVM_CAP_MCE +struct kvm_msr_entry msrs[100]; +int rc, n, i; + +if (!env-mcg_cap) + return; + +msrs[0].index = MSR_MCG_STATUS; +msrs[1].index = MSR_MCG_CTL; +n = (env-mcg_cap 0xff) * 4; +for (i = 0; i n; i++) +msrs[2 + i].index = MSR_MC0_CTL + i; + +rc = kvm_get_msrs(env, msrs, n + 2); +if (rc == -1) +perror(kvm_set_msrs FAILED); +else { +env-mcg_status = msrs[0].data; +env-mcg_ctl = msrs[1].data; +for (i = 0; i n; i++) +env-mce_banks[i] = msrs[2 + i].data; +} +#endif +} + void kvm_arch_save_regs(CPUState *env) { struct kvm_regs regs; @@ -1148,6 +1198,7 @@ void kvm_arch_save_regs(CPUState *env) } } kvm_arch_save_mpstate(env); +kvm_save_mce_regs(env); } static void do_cpuid_ent(struct kvm_cpuid_entry2 *e, uint32_t function, @@ -1385,6 +1436,9 @@ void kvm_arch_push_nmi(void *opaque) void kvm_arch_cpu_reset(CPUState *env) { kvm_arch_reset_vcpu(env); +/* MCE registers except MCG_STATUS should be unchanged across reset */ +kvm_save_mce_regs(env); +env-mcg_status = 0; kvm_arch_load_regs(env); kvm_put_vcpu_events(env); if (!cpu_is_bsp(env)) { Jan [1] http://thread.gmane.org/gmane.comp.emulators.kvm.devel/47411 -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: On Thu, 2010-02-25 at 17:26 +0100, Ingo Molnar wrote: * Jan Kiszka jan.kis...@siemens.com wrote: Jes Sorensen wrote: Hi, It looks like several of us have been looking at how to use the PMU for virtualization. Rather than continuing to have discussions in smaller groups, I think it is a good idea we move it to the mailing lists to see what we can share and avoid duplicate efforts. There are really two separate things to handle: 1) Add support to perf to allow it to monitor a KVM guest from the host. 2) Allow guests access to the PMU (or an emulated PMU), making it possible to run perf on applications running within the guest. I know some of you have been looking at 1) and I am currently working on 2). I have been looking at various approaches, including whether it is feasible to share the PMU between the host and multiple guests. For now I am going to focus on allowing one guest to take control of the PMU, then later hopefully adding support for multiplexing it between multiple guests. Given that perf can apply the PMU to individual host tasks, I don't see fundamental problems multiplexing it between individual guests (which can then internally multiplex it again). In terms of how to expose it to guests, a 'soft PMU' might be a usable approach. Although to Linux guests you could expose much more functionality and an non-PMU-limited number of instrumentation events, via a more intelligent interface. But note that in terms of handling it on the host side the PMU approach is not acceptable: instead it should map to proper perf_events, not try to muck with the PMU itself. That, besides integrating properly with perf usage on the host, will also allow interesting 'PMU' features on guests: you could set up the host side to trace block IO requests (or VM exits) for example, and expose that as 'PMC #0' on the guest side. So virtualization becomes non-transparent to guest os? I know virtio is an optimization on guest side. The 'soft PMU' is transparent. The 'count IO events' kind of feature could be transparent too: you could re-configure (on the host) a given 'hardware' event to really count some software event. That would make it compatible with whatever guest side tooling (without having to change that tooling) - while still allowing interesting new things to be measured. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: 1) Add support to perf to allow it to monitor a KVM guest from the host. This shouldn't be a big problem. The PMU of AMD Fam10 processors can be configured to count only when in guest mode. Perf needs to be aware of that and fetch the rip from a different place when monitoring a guest. The idea is we want to measure both host and guest at the same time, and compare all the hot functions fairly. So you want to measure while the guest vcpu is running and the vmexit path of that vcpu (including qemu userspace part) together? The challenge here is to find out if a performance event originated in guest mode or in host mode. But we can check for that in the nmi-protected part of the vmexit path. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
* Zhang, Yanmin yanmin_zh...@linux.intel.com wrote: 2) We couldn't get guest os kernel/user stack data in an easy way, so we might not support callchain feature of tool perf. A work around is KVM copies kernel stack data out, so we could at least support guest os kernel callchain. If the guest is Linux, KVM can get all the info we need. While the PMU event itself might trigger in an NMI (where we cannot access most of KVM's data structures safely), for this specific case of KVM instrumentation we can delay the processing to a more appropriate time - in fact we can do it in the KVM thread itself. We can do that because we just triggered a VM exit, so the VM state is for all purposes frozen (as far as this virtual CPU goes). Which egives us plenty of time and opportunity to piggy back to the KVM thread, look up the guest stack, process/fill the MMU cache as we walk the guest page tables, etc. etc. It would need some minimal callback facility towards KVM, triggered by a perf event PMI. One additional step needed is to get symbol information from the guest, and to integrate it into the symbol cache on the host side in ~/.debug. We already support cross-arch symbols and 'perf archive', so the basic facilities are there for that. So you can profile on 32-bit PA-RISC and type 'perf report' on 64-bit x86 and get all the right info. For this to work across a guest, a gateway is needed towards the guest. There's several ways to achieve this. The most practical would be two steps: - a user-space facility to access guest images/libraries. (say via ssh, or just a plain TCP port) This would be useful for general 'remote profiling' sessions as well, so it's not KVM specific - it would be useful for remote debugging. - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel as well. (Note that this is purely for guest symbol space access - all the profiling data itself comes via the host kernel.) In theory we could build some sort of 'symbol server' facility into the kernel, which could be enabled in guest kernels too - but i suspect existing, user-space transports go most of the way already. (the only disadvantage of existing transports is that they all have to be configured, enabled and made user-accessible, which is one of the few weak points of KVM in general.) Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 10:55:17AM +0800, Zhang, Yanmin wrote: On Thu, 2010-02-25 at 18:34 +0100, Joerg Roedel wrote: On Thu, Feb 25, 2010 at 04:04:28PM +0100, Jes Sorensen wrote: 1) Add support to perf to allow it to monitor a KVM guest from the host. This shouldn't be a big problem. The PMU of AMD Fam10 processors can be configured to count only when in guest mode. Perf needs to be aware of that and fetch the rip from a different place when monitoring a guest. The idea is we want to measure both host and guest at the same time, and compare all the hot functions fairly. So you want to measure while the guest vcpu is running and the vmexit path of that vcpu (including qemu userspace part) together? The challenge here is to find out if a performance event originated in guest mode or in host mode. But we can check for that in the nmi-protected part of the vmexit path. As far as instrumentation goes, virtualization is simply another 'PID dimension' of measurement. Today we can isolate system performance measurements/events to the following domains: - per system - per cpu - per task ( Note that PowerPC already supports certain sorts of 'hypervisor/kernel/user' domain separation, and we have some ABI details for all that but it's by no means complete. Anton is using the PowerPC bits AFAIK, so it already works to a certain degree. ) When extending measurements to KVM, we want two things: - user friendliness: instead of having to check 'ps' and figure out which Qemu thread is the KVM thread we want to profile, just give a convenience namespace to access guest profiling info. -G ought to map to the first currently running KVM guest it can find. (which would match like 90% of the cases) - etc. No ifs and when. If 'perf kvm top' doesnt show something useful by default the whole effort is for naught. - Extend core facilities and enable the following measurement dimensions: host-kernel-space host-user-space guest-kernel-space guest-user-space on a per guest basis. We want to be able to measure just what the guest does, and we want to be able to measure just what the host does. Some of this the hardware helps us with (say only measuring host kernel events is possible), some has to be done by fiddling with event enable/disable at vm-exit / vm-entry time. My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. [ And, as so often, once you walk that path, that grand scheme you are thinking about right now might easily become last year's really bad idea ;-) ] So please start walking the path and experience the challenges first-hand. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 10:42 AM, Ingo Molnar wrote: * Joerg Roedelj...@8bytes.org wrote: I personally don't like a self-defined event-set as the only solution because that would probably only work with linux and perf. [...] The 'soft-PMU' i suggested is transparent on the guest side - if you want to enable non-Linux and legacy-Linux. It's basically a PMU interface provided to the guest by catching the right MSR accesses, implemented via perf_event_create_kernel_counter()/etc. on the host side. That only works if the software interface is 100% lossless - we can recreate every single hardware configuration through the API. Is this the case? Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux-Linux guest/host combos before gutting the design just to fit Windows into the picture ... gutting the design? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: SVM: Move msrpm offset calculation to seperate function
On 02/25/2010 07:15 PM, Joerg Roedel wrote: The algorithm to find the offset in the msrpm for a given msr is needed at other places too. Move that logic to its own function. #define MAX_INST_SIZE 15 @@ -417,23 +439,22 @@ err_1: static void set_msr_interception(u32 *msrpm, unsigned msr, int read, int write) { - int i; + u8 bit_read, bit_write; + unsigned long tmp; + u32 offset; - for (i = 0; i NUM_MSR_MAPS; i++) { - if (msr= msrpm_ranges[i] - msr msrpm_ranges[i] + MSRS_IN_RANGE) { - u32 msr_offset = (i * MSRS_IN_RANGE + msr - - msrpm_ranges[i]) * 2; - - u32 *base = msrpm + (msr_offset / 32); - u32 msr_shift = msr_offset % 32; - u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1); - *base = (*base ~(0x3 msr_shift)) | - (mask msr_shift); - return; - } - } - BUG(); + offset= svm_msrpm_offset(msr); + bit_read = 2 * (msr 0x0f); + bit_write = 2 * (msr 0x0f) + 1; + + BUG_ON(offset == MSR_INVALID); + + tmp = msrpm[offset]; + + read ? clear_bit(bit_read,tmp) : set_bit(bit_read,tmp); + write ? clear_bit(bit_write,tmp) : set_bit(bit_write,tmp); + + msrpm[offset] = tmp; } This can fault - set_bit() accesses an unsigned long, which can be 8 bytes, while offset can point into the last u32 of msrpm. So this needs either to revert to u32 shift/mask ops or msrpm be changed to a ulong array (actually better, since bitmaps in general are defined as arrays of ulongs). btw, the op-level ternary expression is terrible, relying solely on *_bit()'s side effects. Please convert to an ordinary if. btw2, use __set_bit() which atomic operation is not needed. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] KVM: SVM: Move msrpm offset calculation to seperate function
On Fri, Feb 26, 2010 at 12:20:10PM +0200, Avi Kivity wrote: On 02/25/2010 07:15 PM, Joerg Roedel wrote: The algorithm to find the offset in the msrpm for a given msr is needed at other places too. Move that logic to its own function. #define MAX_INST_SIZE 15 @@ -417,23 +439,22 @@ err_1: static void set_msr_interception(u32 *msrpm, unsigned msr, int read, int write) { -int i; +u8 bit_read, bit_write; +unsigned long tmp; +u32 offset; -for (i = 0; i NUM_MSR_MAPS; i++) { -if (msr= msrpm_ranges[i] -msr msrpm_ranges[i] + MSRS_IN_RANGE) { -u32 msr_offset = (i * MSRS_IN_RANGE + msr - - msrpm_ranges[i]) * 2; - -u32 *base = msrpm + (msr_offset / 32); -u32 msr_shift = msr_offset % 32; -u32 mask = ((write) ? 0 : 2) | ((read) ? 0 : 1); -*base = (*base ~(0x3 msr_shift)) | -(mask msr_shift); -return; -} -} -BUG(); +offset= svm_msrpm_offset(msr); +bit_read = 2 * (msr 0x0f); +bit_write = 2 * (msr 0x0f) + 1; + +BUG_ON(offset == MSR_INVALID); + +tmp = msrpm[offset]; + +read ? clear_bit(bit_read,tmp) : set_bit(bit_read,tmp); +write ? clear_bit(bit_write,tmp) : set_bit(bit_write,tmp); + +msrpm[offset] = tmp; } This can fault - set_bit() accesses an unsigned long, which can be 8 bytes, while offset can point into the last u32 of msrpm. So this needs either to revert to u32 shift/mask ops or msrpm be changed to a ulong array (actually better, since bitmaps in general are defined as arrays of ulongs). Ah true, I will fix that. Thanks. btw, the op-level ternary expression is terrible, relying solely on *_bit()'s side effects. Please convert to an ordinary if. btw2, use __set_bit() which atomic operation is not needed. Right, will switch to __set_bit and __clear_bit. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On 02/25/2010 07:15 PM, Joerg Roedel wrote: This patch optimizes the way the msrpm of the host and the guest are merged. The old code merged the 2 msrpm pages completly. This code needed to touch 24kb of memory for that operation. The optimized variant this patch introduces merges only the parts where the host msrpm may contain zero bits. This reduces the amount of memory which is touched to 48 bytes. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/svm.c | 67 +--- 1 files changed, 58 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d8d4e35..d15e0ea 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -92,6 +92,9 @@ struct nested_state { }; +#define MSRPM_OFFSETS 16 +static u32 msrpm_offsets[MSRPM_OFFSETS] __read_mostly; + struct vcpu_svm { struct kvm_vcpu vcpu; struct vmcb *vmcb; @@ -436,6 +439,34 @@ err_1: } +static void add_msr_offset(u32 offset) +{ + u32 old; + int i; + +again: + for (i = 0; i MSRPM_OFFSETS; ++i) { + old = msrpm_offsets[i]; + + if (old == offset) + return; + + if (old != MSR_INVALID) + continue; + + if (cmpxchg(msrpm_offsets[i], old, offset) != old) + goto again; + + return; + } + + /* +* If this BUG triggers the msrpm_offsets table has an overflow. Just +* increase MSRPM_OFFSETS in this case. +*/ + BUG(); +} Why all this atomic cleverness? The possible offsets are all determined statically. Even if you do them dynamically (makes sense when considering pmu passthrough), it's per-vcpu and therefore single threaded (just move msrpm_offsets into vcpu context). @@ -1846,20 +1882,33 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) { - u32 *nested_msrpm; - struct page *page; + /* +* This function merges the msr permission bitmaps of kvm and the +* nested vmcb. It is omptimized in that it only merges the parts where +* the kvm msr permission bitmap may contain zero bits +*/ A comment that describes the entire function can be moved above the function, freeing a whole tab stop for contents. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] KVM: SVM: Use svm_msrpm_offset in nested_svm_exit_handled_msr
On 02/25/2010 07:15 PM, Joerg Roedel wrote: There is a generic function now to calculate msrpm offsets. Use that function in nested_svm_exit_handled_msr() remove the duplicate logic. Hm, if the function would also calculate the mask, then it would be useful for set_msr_interception() as well. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] KVM: SVM: Add correct handling of nested iopm
On 02/25/2010 07:15 PM, Joerg Roedel wrote: This patch adds the correct handling of the nested io permission bitmap. Old behavior was to not lookup the port in the iopm but only reinject an io intercept to the guest. Signed-off-by: Joerg Roedeljoerg.roe...@amd.com --- arch/x86/kvm/svm.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index bb75a44..3859e2c 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -78,6 +78,7 @@ struct nested_state { /* gpa pointers to the real vectors */ u64 vmcb_msrpm; + u64 vmcb_iopm; /* A VMEXIT is required but not yet emulated */ bool exit_required; @@ -1603,6 +1604,26 @@ static void nested_svm_unmap(struct page *page) kvm_release_page_dirty(page); } +static int nested_svm_intercept_ioio(struct vcpu_svm *svm) +{ + unsigned port; + u8 val, bit; + u64 gpa; + + if (!(svm-nested.intercept (1ULL INTERCEPT_IOIO_PROT))) + return NESTED_EXIT_HOST; + + port = svm-vmcb-control.exit_info_1 16; + gpa = svm-nested.vmcb_iopm + (port / 8); + bit = port % 8; + val = 0; + + if (kvm_read_guest(svm-vcpu.kvm, gpa,val, 1)) + val= (1 bit); + + return val ? NESTED_EXIT_DONE : NESTED_EXIT_HOST; +} + A kvm_{test,set,clear}_guest_bit() would be useful, we have several users already (not a requirement for this patchset). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 11:01 AM, Ingo Molnar wrote: * Zhang, Yanminyanmin_zh...@linux.intel.com wrote: 2) We couldn't get guest os kernel/user stack data in an easy way, so we might not support callchain feature of tool perf. A work around is KVM copies kernel stack data out, so we could at least support guest os kernel callchain. If the guest is Linux, KVM can get all the info we need. While the PMU event itself might trigger in an NMI (where we cannot access most of KVM's data structures safely), for this specific case of KVM instrumentation we can delay the processing to a more appropriate time - in fact we can do it in the KVM thread itself. The nmi will be a synchronous event: it happens in guest context, and we program the hardware to intercept nmis, so we just get an exit telling us that an nmi has happened. (would also be interesting to allow the guest to process the nmi directly in some scenarios, though that would require that there be no nmi sources on the host). We can do that because we just triggered a VM exit, so the VM state is for all purposes frozen (as far as this virtual CPU goes). Yes. Which egives us plenty of time and opportunity to piggy back to the KVM thread, look up the guest stack, process/fill the MMU cache as we walk the guest page tables, etc. etc. It would need some minimal callback facility towards KVM, triggered by a perf event PMI. Since the event is synchronous and kvm is aware of it we don't need a callback; kvm can call directly into perf with all the information. Yes - it's still a callback in the abstract sense. Much of it already all existing. One additional step needed is to get symbol information from the guest, and to integrate it into the symbol cache on the host side in ~/.debug. We already support cross-arch symbols and 'perf archive', so the basic facilities are there for that. So you can profile on 32-bit PA-RISC and type 'perf report' on 64-bit x86 and get all the right info. For this to work across a guest, a gateway is needed towards the guest. There's several ways to achieve this. The most practical would be two steps: - a user-space facility to access guest images/libraries. (say via ssh, or just a plain TCP port) This would be useful for general 'remote profiling' sessions as well, so it's not KVM specific - it would be useful for remote debugging. - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel as well. (Note that this is purely for guest symbol space access - all the profiling data itself comes via the host kernel.) In theory we could build some sort of 'symbol server' facility into the kernel, which could be enabled in guest kernels too - but i suspect existing, user-space transports go most of the way already. There is also vmchannel aka virtio-serial, a guest-to-host communication channel. Basically what is needed is plain filesystem access - properly privileged. So doing this via a vmchannel would be nice, but for the symbol extraction it would be a glorified NFS server in essence. Do you have (or plan) any turn-key 'access to all files of the guest' kind of guest-transparent facility that could be used for such purposes? That would have various advantages over a traditional explicit file server approach: - it would not contaminate the guest port space - no guest side configuration needed (the various oprofile remote daemons always sucked as they needed extra setup) - it might even be used with a guest that does no networking - if done fully in the kernel it could be done with a fully 'unaware' guest, etc. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: On 02/26/2010 10:42 AM, Ingo Molnar wrote: Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote: My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. [ And, as so often, once you walk that path, that grand scheme you are thinking about right now might easily become last year's really bad idea ;-) ] So please start walking the path and experience the challenges first-hand. That sounds like a good approach for the 'measure-guest-from-host' problem. It is also not very hard to implement. Where does perf fetch the rip of the nmi from, stack only or is this configurable? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. [...] Nope. It's architectural the following way: Intel wont change it with future CPU models, outside of the definitions of the hw-ABI. PMUs were model specific prior that time. I'd say there's near zero chance the MSR spaces will unify. All the 'advanced' PMU features are wildly incompatible, and the gap is increasing not decreasing. Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux-Linux guest/host combos before gutting the design just to fit Windows into the picture ... gutting the design? Yes, gutting the design of a sane instrumentation API and moving it back 10-20 years by squeezing it through non-standardized and incompatible PMU drivers. When it comes to design my main interest is the Linux-Linux combo. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: On 02/26/2010 10:42 AM, Ingo Molnar wrote: Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 12:35 PM, Ingo Molnar wrote: One additional step needed is to get symbol information from the guest, and to integrate it into the symbol cache on the host side in ~/.debug. We already support cross-arch symbols and 'perf archive', so the basic facilities are there for that. So you can profile on 32-bit PA-RISC and type 'perf report' on 64-bit x86 and get all the right info. For this to work across a guest, a gateway is needed towards the guest. There's several ways to achieve this. The most practical would be two steps: - a user-space facility to access guest images/libraries. (say via ssh, or just a plain TCP port) This would be useful for general 'remote profiling' sessions as well, so it's not KVM specific - it would be useful for remote debugging. - The guest /proc/kallsyms (and vmlinux) could be accessed via that channel as well. (Note that this is purely for guest symbol space access - all the profiling data itself comes via the host kernel.) In theory we could build some sort of 'symbol server' facility into the kernel, which could be enabled in guest kernels too - but i suspect existing, user-space transports go most of the way already. There is also vmchannel aka virtio-serial, a guest-to-host communication channel. Basically what is needed is plain filesystem access - properly privileged. So doing this via a vmchannel would be nice, but for the symbol extraction it would be a glorified NFS server in essence. Well, we could run an nfs server over vmchannel, or over a private network interface. Do you have (or plan) any turn-key 'access to all files of the guest' kind of guest-transparent facility that could be used for such purposes? Not really. The guest and host admins are usually different people, who may, being admins, even actively hate each other. The guest admin would probably regard it as a security hole. It's probably useful for the single-host scenario, and of course for developers. I guess sshfs can fill this role, with one command it gives you secure access to all guest files, provided you have the proper credentials. That would have various advantages over a traditional explicit file server approach: - it would not contaminate the guest port space - no guest side configuration needed (the various oprofile remote daemons always sucked as they needed extra setup) - it might even be used with a guest that does no networking - if done fully in the kernel it could be done with a fully 'unaware' guest, etc. Seems sshfs fulfils the first two. For the latter, we could do a vmchannelfs, but it seems quite a bit of work, and would require fairly new guest kernels, whereas sshfs would work out of the box on 10 year old guests and can be easily made to work on Windows. Somewhat related, see libguestfs/guestfish, though that provides offline access only. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 12:46 PM, Ingo Molnar wrote: Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? No - that would only work if AMD also adopted the architectural pmu. Note virtualization clusters are typically split into 'migration pools' consisting of hosts with similar processor features, so that you can expose those features and yet live migrate guests at will. It's likely that all hosts have the same pmu anyway, so the only downside is that we now have to expose the host's processor family and model. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 10:17:32AM +0100, Ingo Molnar wrote: My suggestion, as always, would be to start very simple and very minimal: Enable 'perf kvm top' to show guest overhead. Use the exact same kernel image both as a host and as guest (for testing), to not have to deal with the symbol space transport problem initially. Enable 'perf kvm record' to only record guest events by default. Etc. This alone will be a quite useful result already - and gives a basis for further work. No need to spend months to do the big grand design straight away, all of this can be done gradually and in the order of usefulness - and you'll always have something that actually works (and helps your other KVM projects) along the way. [ And, as so often, once you walk that path, that grand scheme you are thinking about right now might easily become last year's really bad idea ;-) ] So please start walking the path and experience the challenges first-hand. That sounds like a good approach for the 'measure-guest-from-host' problem. It is also not very hard to implement. Where does perf fetch the rip of the nmi from, stack only or is this configurable? The host semantics are that it takes the stack from the regs, and with call-graph recording (perf record -g) it will walk down the exception stack, irq stack, kernel stack, and user-space stack as well. (up to the point the pages are present - it stops on a non-present page. An app that is being profiled has its stack present so it's not an issue in practice.) I'd suggest to leave out call graph sampling initially, and just get 'perf kvm top' to work with guest RIPs, simply sampled from the VM exit state. See arch/x86/kernel/cpu/perf_event.c: static void perf_callchain_kernel(struct pt_regs *regs, struct perf_callchain_entry *entry) { callchain_store(entry, PERF_CONTEXT_KERNEL); callchain_store(entry, regs-ip); dump_trace(NULL, regs, NULL, regs-bp, backtrace_ops, entry); } If you have easy access to the VM state from NMI context right there then just hack in the guest RIP and you should have some prototype that samples the guest. (assuming you use the same kernel image for both the host an the guest) This would be the easiest way to prototype it all. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/25/10 17:26, Ingo Molnar wrote: Given that perf can apply the PMU to individual host tasks, I don't see fundamental problems multiplexing it between individual guests (which can then internally multiplex it again). In terms of how to expose it to guests, a 'soft PMU' might be a usable approach. Although to Linux guests you could expose much more functionality and an non-PMU-limited number of instrumentation events, via a more intelligent interface. But note that in terms of handling it on the host side the PMU approach is not acceptable: instead it should map to proper perf_events, not try to muck with the PMU itself. I am not keen on emulating the PMU, if we do that we end up having to emulate a large number of MSR accesses, which is really costly. It makes a lot more sense to give the guest direct access to the PMU. The problem here is how to manage it without too much overhead. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote: * Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: On 02/26/2010 10:42 AM, Ingo Molnar wrote: Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Good point. Maybe we can emulate that with some counter round-robin usage if the guest really uses all 5 counters. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? Isn't there a cpuid bit indicating the availability of architectural perfmon? Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 12:44 PM, Ingo Molnar wrote: Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] Outside 4-bit vga mode, this shouldn't happen. Can you describe your scenario? By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. In a way, virtualization as a whole is a step backwards. We take the nice firesystem/timer/network/scheduler APIs, and expose them as raw hardware. The pmu isn't any different. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. Whenever we have a choice, we expose raw hardware (usually emulated, but in some cases real). Raw hardware has the huge advantage of being already supported. Write a software abstraction, and you get to (a) write and maintain the spec (b) write drivers for all guests (c) mumble something to users of OSes to which you haven't ported your driver (d) explain to users that they need to install those drivers. For networking and block, it is simply impossible to obtain good performance without introducing a new interface, but for other stuff, that may not be the case. That way it could also transparently integrate with tracing, probes, etc. It would also be wiser to first concentrate on improving Linux-Linux guest/host combos before gutting the design just to fit Windows into the picture ... gutting the design? Yes, gutting the design of a sane instrumentation API and moving it back 10-20 years by squeezing it through non-standardized and incompatible PMU drivers. Any new interface will be incompatible to all the exiting guests out there; and unlike networking, you can't retrofit a pmu interface to an existing guest. When it comes to design my main interest is the Linux-Linux combo. My main interest is the OSes that users actually install, and those are Windows and non-bleeding-edge Linux. Look at guests as you do at userspace: you don't want to inflict changes upon them. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
* Avi Kivity a...@redhat.com wrote: Do you have (or plan) any turn-key 'access to all files of the guest' kind of guest-transparent facility that could be used for such purposes? Not really. The guest and host admins are usually different people, who may, being admins, even actively hate each other. The guest admin would probably regard it as a security hole. It's probably useful for the single-host scenario, and of course for developers. Sounds like an exceedingly silly argument to me - the host admin is the king in any case. Your argument boils down to: 'dont offer transparent, turn-key solutions because some might object to the functionality they offer for all the wrong reasons'. Which does not withstand elementary scrutiny. This is a basic usability issue, and affects many parts of the KVM universe. Really, it's by far the most fubar-ed notion of KVM. You are pushing _way_ too much to user-space into different modules and maintenance domains, and user-space forks those bits, fragments, diverts, delays and messes up basic features in the usual fashion. The result is a basic out-of-box virtualization experience that sucks even these days. Nobody is really 'in charge' of how KVM gets delivered to the user. You isolated the fun kernel part for you and pushed out the boring bits to user-space. So if mundane things like mouse integration sucks 'hey that's a user-space tooling problem', if file integration sucks then 'hey, that's an admin problem', if it cannot be used over the network 'hey, that's an Xorg problem', etc. etc. You basically have given up control over the quality of KVM by pushing so many aspects of it to user-space and letting it rot there. Sure the design looks somewhat cleaner on paper, but if the end result is not helped by it then over-modularization sure can hurt ... ( Note that i dont mind user-space tooling per se, as long as it sits together with the kernel bits and gets developed, packaged and given to the user in the same domain. ) And that's a key conceptual area were tools/perf/ differs: it's an integrated, turn-key solution that you can really rely on. We take responsibility for the full thing, no ifs and when. And if you cannot rely on your instrumentation tooling as a single unit you cannot use it, simple as that. (that is a key mistake Oprofile made a decade ago too btw.) So i can see some upcoming culture friction with standing KVM principles there ;-) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list_add corruption?
On 02/26/2010 06:57 AM, Zachary Amsden wrote: Anyone seeing list_add corruption running qemu-kvm with -smp 2 on Intel hardware? Debugging some local changes, which don't appear related. Running module from latest git on F12. Can you post a trace? Which list appears to be involved? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:06, Joerg Roedel wrote: Isn't there a cpuid bit indicating the availability of architectural perfmon? Nope, the perfmon flag is a fake Linux flag, set based on the contents on cpuid 0x0a Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 11:46:59AM +0100, Ingo Molnar wrote: * Joerg Roedel j...@8bytes.org wrote: On Fri, Feb 26, 2010 at 11:46:34AM +0200, Avi Kivity wrote: On 02/26/2010 10:42 AM, Ingo Molnar wrote: Note that the 'soft PMU' still sucks from a design POV as there's no generic hw interface to the PMU. So there would have to be a 'soft AMD' and a 'soft Intel' PMU driver at minimum. Right, this will severely limit migration domains to hosts of the same vendor and processor generation. There is a middle ground, though, Intel has recently moved to define an architectural pmu which is not model specific. I don't know if AMD adopted it. We could offer both options - native host capabilities, with a loss of compatibility, and the architectural pmu, with loss of model specific counters. I only had a quick look yet on the architectural pmu from intel but it looks like it can be emulated for a guest on amd using existing features. AMD CPUs dont have enough events for that, they cannot do the 3 fixed events in addition to the 2 generic ones. Good point. Maybe we can emulate that with some counter round-robin usage if the guest really uses all 5 counters. Nor do you really want to standardize on KVM guests on returning 'GenuineIntel' in CPUID, so that the various guest side OSs use the Intel PMU drivers, right? Isn't there a cpuid bit indicating the availability of architectural perfmon? there is, but can you rely on all guest OSs keying off their PMU drivers based purely on the CPUID bit and not on any other CPUID aspects? Guest OSs like ... Linux v2.6.33: void __init init_hw_perf_events(void) { int err; pr_info(Performance Events: ); switch (boot_cpu_data.x86_vendor) { case X86_VENDOR_INTEL: err = intel_pmu_init(); break; case X86_VENDOR_AMD: err = amd_pmu_init(); break; default: Really, if you want to emulate a single Intel PMU driver model you need to pretend that you are an Intel CPU, throughout. This cannot be had both ways. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen jes.soren...@redhat.com wrote: On 02/26/10 12:06, Joerg Roedel wrote: Isn't there a cpuid bit indicating the availability of architectural perfmon? Nope, the perfmon flag is a fake Linux flag, set based on the contents on cpuid 0x0a There is a way to query the CPU for 'architectural perfmon' though, via CPUID alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic is: if (c-cpuid_level 9) { unsigned eax = cpuid_eax(10); /* Check for version and the number of counters */ if ((eax 0xff) (((eax8) 0xff) 1)) set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON); } But emulating that doesnt solve the problem: as OSs generally dont key their PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but based on much higher level CPUID attributes. (like Intel/AMD) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. Hopefully Intel and AMD will come up with proper hw PMU virtualization support that allows us to do it 100% guest and host at some point. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:24, Ingo Molnar wrote: There is a way to query the CPU for 'architectural perfmon' though, via CPUID alone - that is how we set the X86_FEATURE_ARCH_PERFMON shortcut. The logic is: if (c-cpuid_level 9) { unsigned eax = cpuid_eax(10); /* Check for version and the number of counters */ if ((eax 0xff) (((eax8) 0xff) 1)) set_cpu_cap(c, X86_FEATURE_ARCH_PERFMON); } But emulating that doesnt solve the problem: as OSs generally dont key their PMU drivers off the relatively new 'architectural perfmon' CPUID detail, but based on much higher level CPUID attributes. (like Intel/AMD) Right, there is far more to it than just the arch-perfmon feature. They still need to query cpuid 0x0a for counter size, number of counters and stuff like that. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 12:44 PM, Ingo Molnar wrote: Far cleaner would be to expose it via hypercalls to guest OSs that are interested in instrumentation. It's also slower - you can give the guest direct access to the various counters so no exits are taken when reading the counters (though perhaps many tools are only interested in the interrupts, not the counter values). Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] Outside 4-bit vga mode, this shouldn't happen. Can you describe your scenario? By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. In a way, virtualization as a whole is a step backwards. We take the nice firesystem/timer/network/scheduler APIs, and expose them as raw hardware. The pmu isn't any different. Uhm, it's obviously very different. A fake NE2000 will work on both Intel and AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor though. So there's no generic hardware to emulate. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen jes.soren...@redhat.com wrote: On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 01:17 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Do you have (or plan) any turn-key 'access to all files of the guest' kind of guest-transparent facility that could be used for such purposes? Not really. The guest and host admins are usually different people, who may, being admins, even actively hate each other. The guest admin would probably regard it as a security hole. It's probably useful for the single-host scenario, and of course for developers. Sounds like an exceedingly silly argument to me - the host admin is the king in any case. Your argument boils down to: 'dont offer transparent, turn-key solutions because some might object to the functionality they offer for all the wrong reasons'. Which does not withstand elementary scrutiny. Again, the host admin and the guest admin are different people. What would the host admin do with guest files? Why would the guest admin want to run any code that exposes their files? This is a basic usability issue, and affects many parts of the KVM universe. Really, it's by far the most fubar-ed notion of KVM. You are pushing _way_ too much to user-space into different modules and maintenance domains, and user-space forks those bits, fragments, diverts, delays and messes up basic features in the usual fashion. The result is a basic out-of-box virtualization experience that sucks even these days. Nobody is really 'in charge' of how KVM gets delivered to the user. You isolated the fun kernel part for you and pushed out the boring bits to user-space. So if mundane things like mouse integration sucks 'hey that's a user-space tooling problem', if file integration sucks then 'hey, that's an admin problem', if it cannot be used over the network 'hey, that's an Xorg problem', etc. etc. What would you have me do? Push 200K lines of device emulation code into the kernel? Write an X client, toolkit, and display in the kernel so that mouse integration works out of the box when you install Linux 2.6.653? As to nobody is in charge, that's really insulting to the people who are in charge of the userspace components. Perhaps the problems that we see are not the same problems that you see. It might be that direct access to guest files from the host is only a pressing problem for you, but nobody else. If there are features that you miss, post patches, if you will deign to code for lowly user space. You basically have given up control over the quality of KVM by pushing so many aspects of it to user-space and letting it rot there. That's wrong on so many levels. First, nothing is rotting in userspace, qemu is evolving faster than kvm is. If I pushed it into the kernel then development pace would be much slower (since kernel development is harder), quality would be lower (less infrastructure, any bug is a host crash or security issue), and I personally would be totally swamped. Sure the design looks somewhat cleaner on paper, but if the end result is not helped by it then over-modularization sure can hurt ... Run 'rpm -qa' one of these days. Modern software is modular, that's the only way to manage it. ( Note that i dont mind user-space tooling per se, as long as it sits together with the kernel bits and gets developed, packaged and given to the user in the same domain. ) Call me when glibc, the X servers and clients, and everything else qemu now uses is developed, packaged, and given to the user in the same domain. And that's a key conceptual area were tools/perf/ differs: it's an integrated, turn-key solution that you can really rely on. We take responsibility for the full thing, no ifs and when. And if you cannot rely on your instrumentation tooling as a single unit you cannot use it, simple as that. (that is a key mistake Oprofile made a decade ago too btw.) perf is a tool written by developers for developers. kvm is written for users (most of them hidden behind management interfaces). There's no point at all in shipping it as part of the kernel, users don't install and use kernels, they install and use distributions. So i can see some upcoming culture friction with standing KVM principles there ;-) No friction at all - I don't think any kvm developer agrees with you (but if anyone does please speak up). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On Fri, 2010-02-26 at 12:47 +0200, Avi Kivity wrote: Not really. The guest and host admins are usually different people, who may, being admins, even actively hate each other. The guest admin would probably regard it as a security hole. It's probably useful for the single-host scenario, and of course for developers. LOL, let me be the malicious host admin, then you can be the guest, there is no way you can protect yourself. If you don't trust the host, don't use it. All your IO flows through the host, all your sekrit keys are in memory, there is no security. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 01:26 PM, Ingo Molnar wrote: By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. In a way, virtualization as a whole is a step backwards. We take the nice firesystem/timer/network/scheduler APIs, and expose them as raw hardware. The pmu isn't any different. Uhm, it's obviously very different. A fake NE2000 will work on both Intel and AMD CPUs. Same for a fake PIT. PMU drivers are fundamentally per CPU vendor though. So there's no generic hardware to emulate. That's true, and it reduces the usability of the feature (you have to restrict your migration pools or not expose the pmu), but the general points still stand. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 01:42 PM, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: On 02/26/10 11:44, Ingo Molnar wrote: Direct access to counters is not something that is a big issue. [ Given that i sometimes can see KVM redraw the screen of a guest OS real-time i doubt this is the biggest of performance challenges right now ;-) ] By far the biggest instrumentation issue is: - availability - usability - flexibility Exposing the raw hw is a step backwards in many regards. The same way we dont want to expose chipsets to the guest to allow them to do RAS. The same way we dont want to expose most raw PCI devices to guest in general, but have all these virt driver abstractions. I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Agreed - if both the host and guest want the pmu, the host wins. This is what we do with debug registers - if both the host and guest contend for them, the host wins. Emulating the PMU rather than using the real one, makes the numbers far less useful. The most useful way to provide PMU support in a guest is to expose the real PMU and let the guest OS program it. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. A native API to the host will lock out 100% of the install base now, and a large section of any future install base. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) No, stop and restart the counters on every exit/entry, so the guest doesn't observe any host data. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 01:48 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 12:47 +0200, Avi Kivity wrote: Not really. The guest and host admins are usually different people, who may, being admins, even actively hate each other. The guest admin would probably regard it as a security hole. It's probably useful for the single-host scenario, and of course for developers. LOL, let me be the malicious host admin, then you can be the guest, there is no way you can protect yourself. If you don't trust the host, don't use it. All your IO flows through the host, all your sekrit keys are in memory, there is no security. That's true. But guest admins are going to be unhappy about a file server serving their data to the host all the same. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 02:07 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. I meant the guest installed base. Hosts can be upgraded transparently to the guests (not even a shutdown/reboot). -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On 26.02.2010, at 13:25, Joerg Roedel wrote: On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote: +static void add_msr_offset(u32 offset) +{ + u32 old; + int i; + +again: + for (i = 0; i MSRPM_OFFSETS; ++i) { + old = msrpm_offsets[i]; + + if (old == offset) + return; + + if (old != MSR_INVALID) + continue; + + if (cmpxchg(msrpm_offsets[i], old, offset) != old) + goto again; + + return; + } + + /* +* If this BUG triggers the msrpm_offsets table has an overflow. Just +* increase MSRPM_OFFSETS in this case. +*/ + BUG(); +} Why all this atomic cleverness? The possible offsets are all determined statically. Even if you do them dynamically (makes sense when considering pmu passthrough), it's per-vcpu and therefore single threaded (just move msrpm_offsets into vcpu context). The msr_offset table is the same for all guests. It doesn't make sense to keep it per vcpu because it will currently look the same for all vcpus. For standard guests this array contains 3 entrys. It is marked with __read_mostly for the same reason. I'm still not convinced on this way of doing things. If it's static, make it static. If it's dynamic, make it dynamic. Dynamically generating a static list just sounds plain wrong to me. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/8] use eventfd for iothread
When this was merged in qemu-kvm/master (commit 6249f61a891b6b003531ca4e459c3a553faa82bc) it removed Avi's compile fix when !CONFIG_EVENTFD (db311e8619d310bd7729637b702581d3d8565049). So current master fails to build: CCosdep.o cc1: warnings being treated as errors osdep.c: In function 'qemu_eventfd': osdep.c:296: error: unused variable 'ret' make: *** [osdep.o] Error 1 On 22 févr. 2010, at 22:26, Marcelo Tosatti wrote: From: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com --- osdep.c | 32 qemu-common.h |1 + vl.c |9 + 3 files changed, 38 insertions(+), 4 deletions(-) diff --git a/osdep.c b/osdep.c index 9059f01..9e4b17b 100644 --- a/osdep.c +++ b/osdep.c @@ -37,6 +37,10 @@ #include sys/statvfs.h #endif +#ifdef CONFIG_EVENTFD +#include sys/eventfd.h +#endif + #ifdef _WIN32 #include windows.h #elif defined(CONFIG_BSD) @@ -281,6 +285,34 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count) #ifndef _WIN32 /* + * Creates an eventfd that looks like a pipe and has EFD_CLOEXEC set. + */ +int qemu_eventfd(int fds[2]) +{ +int ret; + +#ifdef CONFIG_EVENTFD +ret = eventfd(0, 0); +if (ret = 0) { +fds[0] = ret; +qemu_set_cloexec(ret); +if ((fds[1] = dup(ret)) == -1) { +close(ret); +return -1; +} +qemu_set_cloexec(fds[1]); +return 0; +} + +if (errno != ENOSYS) { +return -1; +} +#endif + +return qemu_pipe(fds); +} + +/* * Creates a pipe with FD_CLOEXEC set on both file descriptors */ int qemu_pipe(int pipefd[2]) diff --git a/qemu-common.h b/qemu-common.h index b09f717..c941006 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -170,6 +170,7 @@ ssize_t qemu_write_full(int fd, const void *buf, size_t count) void qemu_set_cloexec(int fd); #ifndef _WIN32 +int qemu_eventfd(int pipefd[2]); int qemu_pipe(int pipefd[2]); #endif diff --git a/vl.c b/vl.c index 98918ac..1957018 100644 --- a/vl.c +++ b/vl.c @@ -3211,14 +3211,15 @@ static int io_thread_fd = -1; static void qemu_event_increment(void) { -static const char byte = 0; +/* Write 8 bytes to be compatible with eventfd. */ +static uint64_t val = 1; ssize_t ret; if (io_thread_fd == -1) return; do { -ret = write(io_thread_fd, byte, sizeof(byte)); +ret = write(io_thread_fd, val, sizeof(val)); } while (ret 0 errno == EINTR); /* EAGAIN is fine, a read must be pending. */ @@ -3235,7 +3236,7 @@ static void qemu_event_read(void *opaque) ssize_t len; char buffer[512]; -/* Drain the notify pipe */ +/* Drain the notify pipe. For eventfd, only 8 bytes will be read. */ do { len = read(fd, buffer, sizeof(buffer)); } while ((len == -1 errno == EINTR) || len == sizeof(buffer)); @@ -3246,7 +3247,7 @@ static int qemu_event_init(void) int err; int fds[2]; -err = qemu_pipe(fds); +err = qemu_eventfd(fds); if (err == -1) return -errno; -- 1.6.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Pierre Riteau -- PhD student, Myriads team, IRISA, Rennes, France http://perso.univ-rennes1.fr/pierre.riteau/ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 02:07 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. You are making the incorrect assumption that the emulated PMU uses up all host PMU resources ... Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. Yes, something like Core2 with 2 generic events. That would leave 2 extra generic events on Nehalem and better. (which is really the target CPU type for any new feature we are talking about right now. Plus performance analysis tends to skew towards more modern CPU types as well.) Plus the emulation can be smart about it and only use up a given number. Most guest OSs dont use the full PMU - they use a single counter. Ideally for Linux-Linux there would be a PMU paravirt driver that allocates events on an as-needed basis. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. I meant the guest installed base. Hosts can be upgraded transparently to the guests (not even a shutdown/reboot). The irony: this time guest-transparent solutions that need no configuration are good? ;-) The very same argument holds for the file server thing: a guest transparent solution is easier wrt. the upgrade path. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote: +static void add_msr_offset(u32 offset) +{ +u32 old; +int i; + +again: +for (i = 0; i MSRPM_OFFSETS; ++i) { +old = msrpm_offsets[i]; + +if (old == offset) +return; + +if (old != MSR_INVALID) +continue; + +if (cmpxchg(msrpm_offsets[i], old, offset) != old) +goto again; + +return; +} + +/* + * If this BUG triggers the msrpm_offsets table has an overflow. Just + * increase MSRPM_OFFSETS in this case. + */ +BUG(); +} Why all this atomic cleverness? The possible offsets are all determined statically. Even if you do them dynamically (makes sense when considering pmu passthrough), it's per-vcpu and therefore single threaded (just move msrpm_offsets into vcpu context). The msr_offset table is the same for all guests. It doesn't make sense to keep it per vcpu because it will currently look the same for all vcpus. For standard guests this array contains 3 entrys. It is marked with __read_mostly for the same reason. @@ -1846,20 +1882,33 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) static bool nested_svm_vmrun_msrpm(struct vcpu_svm *svm) { -u32 *nested_msrpm; -struct page *page; +/* + * This function merges the msr permission bitmaps of kvm and the + * nested vmcb. It is omptimized in that it only merges the parts where + * the kvm msr permission bitmap may contain zero bits + */ A comment that describes the entire function can be moved above the function, freeing a whole tab stop for contents. Ok, will move it out of the function. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
* Avi Kivity a...@redhat.com wrote: You basically have given up control over the quality of KVM by pushing so many aspects of it to user-space and letting it rot there. That's wrong on so many levels. First, nothing is rotting in userspace, qemu is evolving faster than kvm is. If I pushed it into the kernel then development pace would be much slower (since kernel development is harder), quality would be lower (less infrastructure, any bug is a host crash or security issue), and I personally would be totally swamped. That was not what i suggested tho. tools/kvm/ would work plenty fine. As i said: [...] You are pushing _way_ too much to user-space into different modules and maintenance domains, [...] ( Note that i dont mind user-space tooling per se, as long as it sits together with the kernel bits and gets developed, packaged and given to the user in the same domain. ) [...] Sure the design looks somewhat cleaner on paper, but if the end result is not helped by it then over-modularization sure can hurt ... Run 'rpm -qa' one of these days. Modern software is modular, that's the only way to manage it. Of course rpm -qa shows cases where modularization works. But my point was over-modularization, which due to the KVM/qemu split we all suffer from. Modularizing along the wrong interface is worse than not modularizing something that could be. So when designing software you generally want to err on the side of _under_-modularizing. It's always very easy to split stuff up, when there's a really strong technical argument for it. It's very hard to pull the broken pieces back together though once they are in difference domains of maintanence - as then it's usually social integration that has to happen, which is always harder than a technical split-up. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 12:42, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Well you cannot steal the PMU without collaborating with perf_event.c, but thats quite feasible. Sharing the PMU between the guest and the host is very costly and guarantees incorrect results in the host. Unless you completely emulate the PMU by faking it and then allocating PMU counters one by one at the host level. However that means trapping a lot of MSR access. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 02:46 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: You basically have given up control over the quality of KVM by pushing so many aspects of it to user-space and letting it rot there. That's wrong on so many levels. First, nothing is rotting in userspace, qemu is evolving faster than kvm is. If I pushed it into the kernel then development pace would be much slower (since kernel development is harder), quality would be lower (less infrastructure, any bug is a host crash or security issue), and I personally would be totally swamped. That was not what i suggested tho. tools/kvm/ would work plenty fine. I'll wait until we have tools/libc and tools/X. After all, they affect a lot more people and are concerned with a lot more kernel/user interfaces than kvm. As i said: [...] You are pushing _way_ too much to user-space into different modules and maintenance domains, [...] ( Note that i dont mind user-space tooling per se, as long as it sits together with the kernel bits and gets developed, packaged and given to the user in the same domain. ) [...] Sure the design looks somewhat cleaner on paper, but if the end result is not helped by it then over-modularization sure can hurt ... Run 'rpm -qa' one of these days. Modern software is modular, that's the only way to manage it. Of course rpm -qa shows cases where modularization works. But my point was over-modularization, which due to the KVM/qemu split we all suffer from. You're the only one who suffers from it. Everyone else is happy with adding features in the modules that implements them, be it kvm, qemu, libvirt, or virt-manager (to name one tool stack out of several). Modularizing along the wrong interface is worse than not modularizing something that could be. So when designing software you generally want to err on the side of _under_-modularizing. It's always very easy to split stuff up, when there's a really strong technical argument for it. It's very hard to pull the broken pieces back together though once they are in difference domains of maintanence - as then it's usually social integration that has to happen, which is always harder than a technical split-up. As it happens, the kvm and qemu development community has a large overlap. Many developers read both lists, contribute to both projects, and participate on the same weekly call. While we had difficulties pushing patches to qemu in the past, that's behind us, and qemu is now accepting patches at a much higher rate than kvm. Technically, it is obvious that the userspace and kernel components are separate projects. All that remains is the social divide. Since everyone (except you) is mostly happy, I see no reason to change. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 13:20, Avi Kivity wrote: On 02/26/2010 02:07 PM, Ingo Molnar wrote: ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. You cannot do this, as you say there is no guarantee that there are no overlaps, and the current host may have different counter sizes two which makes emulating it even more costly. The cpuid bits basically tells you which version of the counters are available, how many counters are there, word size of the counters and I believe there are bits also stating which optional features are available to be counted. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. Cannot be done in a reasonable manner due to the above. The key to all of this is that guests OSes, including that other OS, should be able to use the performance counters without needing special para virt drivers or other OS modifications. If we start requering that kind of stuff, the whole point of having the feature goes down the toilet. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On Fri, Feb 26, 2010 at 01:28:29PM +0100, Alexander Graf wrote: On 26.02.2010, at 13:25, Joerg Roedel wrote: On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote: +static void add_msr_offset(u32 offset) +{ + u32 old; + int i; + +again: + for (i = 0; i MSRPM_OFFSETS; ++i) { + old = msrpm_offsets[i]; + + if (old == offset) + return; + + if (old != MSR_INVALID) + continue; + + if (cmpxchg(msrpm_offsets[i], old, offset) != old) + goto again; + + return; + } + + /* + * If this BUG triggers the msrpm_offsets table has an overflow. Just + * increase MSRPM_OFFSETS in this case. + */ + BUG(); +} Why all this atomic cleverness? The possible offsets are all determined statically. Even if you do them dynamically (makes sense when considering pmu passthrough), it's per-vcpu and therefore single threaded (just move msrpm_offsets into vcpu context). The msr_offset table is the same for all guests. It doesn't make sense to keep it per vcpu because it will currently look the same for all vcpus. For standard guests this array contains 3 entrys. It is marked with __read_mostly for the same reason. I'm still not convinced on this way of doing things. If it's static, make it static. If it's dynamic, make it dynamic. Dynamically generating a static list just sounds plain wrong to me. Stop. I had a static list in the first version of the patch. This list was fine except the fact that a developer needs to remember to update this list if the list of non-intercepted msrs is expanded. The whole reason for a dynamically built list is to take the task of maintaining the list away from the developer and remove a possible source of hard to find bugs. This is what the current approach does. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 02:38 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 02/26/2010 02:07 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: A native API to the host will lock out 100% of the install base now, and a large section of any future install base. ... which is why i suggested the soft-PMU approach. Not sure I understand it completely. Do you mean to take the model specific host pmu events, and expose them to the guest via trap'n'emulate? In that case we may as well assign the host pmu to the guest if the host isn't using it, and avoid the traps. You are making the incorrect assumption that the emulated PMU uses up all host PMU resources ... Well, in the general case, it may? If it doesn't, the host may use them. We do a similar thing with debug breakpoints. Sharing the pmu will mean trapping control msr writes at least, though. Do you mean to choose some older pmu and emulate it using whatever pmu model the host has? I haven't checked, but aren't there mutually exclusive events in every model pair? The closest thing would be the architectural pmu thing. Yes, something like Core2 with 2 generic events. That would leave 2 extra generic events on Nehalem and better. (which is really the target CPU type for any new feature we are talking about right now. Plus performance analysis tends to skew towards more modern CPU types as well.) Can you emulate the Core 2 pmu on, say, a P4? Those P4s have very different instruction caches so I imagine the events are very different as well. Agree about favouring modern processors. Plus the emulation can be smart about it and only use up a given number. Most guest OSs dont use the full PMU - they use a single counter. But you have to expose all of the counters, no? Unless you go with a kvm-specific pmu as described below. Ideally for Linux-Linux there would be a PMU paravirt driver that allocates events on an as-needed basis. Or we could watch the control register and see how the guest programs it, provided it doesn't do that a lot. Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. And note that _any_ solution we offer locks out 100% of the installed base right now, as no solution is in the kernel yet. The only question is what kind of upgrade effort is needed for users to make use of the feature. I meant the guest installed base. Hosts can be upgraded transparently to the guests (not even a shutdown/reboot). The irony: this time guest-transparent solutions that need no configuration are good? ;-) The very same argument holds for the file server thing: a guest transparent solution is easier wrt. the upgrade path. If we add pmu support, guests can begin to use if immediately. If we add the file server support, guests need to install drivers before they can use it, while guest admins have no motivation to do so (it helps the host, not the guest). Is something wrong with just using sshfs? Seems a lot less hassle to me. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Jes Sorensen jes.soren...@redhat.com wrote: On 02/26/10 12:42, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: I have to say I disagree on that. When you run perfmon on a system, it is normally to measure a specific application. You want to see accurate numbers for cache misses, mul instructions or whatever else is selected. You can still get those. You can even enable RDPMC access and avoid VM exits. What you _cannot_ do is to 'steal' the PMU and just give it to the guest. Well you cannot steal the PMU without collaborating with perf_event.c, but thats quite feasible. Sharing the PMU between the guest and the host is very costly and guarantees incorrect results in the host. Unless you completely emulate the PMU by faking it and then allocating PMU counters one by one at the host level. However that means trapping a lot of MSR access. It's not that many MSR accesses. Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. We can do this in a reasonable way today, if we allow to take the PMU away from the host, and only let guests access it when it's in use. [...] You get my sure-fire NAK for that kind of crap though. Interfering with the host PMU and stealing it, is not a technical approach that has acceptable quality. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. We are rehashing the whole 'perfmon versus perf events/counters' design arguments again here really. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On 26.02.2010, at 14:04, Joerg Roedel wrote: On Fri, Feb 26, 2010 at 01:28:29PM +0100, Alexander Graf wrote: On 26.02.2010, at 13:25, Joerg Roedel wrote: On Fri, Feb 26, 2010 at 12:28:24PM +0200, Avi Kivity wrote: +static void add_msr_offset(u32 offset) +{ + u32 old; + int i; + +again: + for (i = 0; i MSRPM_OFFSETS; ++i) { + old = msrpm_offsets[i]; + + if (old == offset) + return; + + if (old != MSR_INVALID) + continue; + + if (cmpxchg(msrpm_offsets[i], old, offset) != old) + goto again; + + return; + } + + /* + * If this BUG triggers the msrpm_offsets table has an overflow. Just + * increase MSRPM_OFFSETS in this case. + */ + BUG(); +} Why all this atomic cleverness? The possible offsets are all determined statically. Even if you do them dynamically (makes sense when considering pmu passthrough), it's per-vcpu and therefore single threaded (just move msrpm_offsets into vcpu context). The msr_offset table is the same for all guests. It doesn't make sense to keep it per vcpu because it will currently look the same for all vcpus. For standard guests this array contains 3 entrys. It is marked with __read_mostly for the same reason. I'm still not convinced on this way of doing things. If it's static, make it static. If it's dynamic, make it dynamic. Dynamically generating a static list just sounds plain wrong to me. Stop. I had a static list in the first version of the patch. This list was fine except the fact that a developer needs to remember to update this list if the list of non-intercepted msrs is expanded. The whole reason for a dynamically built list is to take the task of maintaining the list away from the developer and remove a possible source of hard to find bugs. This is what the current approach does. I was more thinking of replacing the function calls with a list of MSRs. You can then take that list on module init, generate the MSR bitmap once and be good. Later you can use the same list for the nested bitmap. Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On 02/26/2010 03:04 PM, Joerg Roedel wrote: I'm still not convinced on this way of doing things. If it's static, make it static. If it's dynamic, make it dynamic. Dynamically generating a static list just sounds plain wrong to me. Stop. I had a static list in the first version of the patch. This list was fine except the fact that a developer needs to remember to update this list if the list of non-intercepted msrs is expanded. The whole reason for a dynamically built list is to take the task of maintaining the list away from the developer and remove a possible source of hard to find bugs. This is what the current approach does. The problem was the two lists. If you had a static struct svm_direct_access_msrs = { u32 index; bool longmode_only; } direct_access_msrs = { ... }; You could generate static unsigned *msrpm_offsets_longmode, *msrpm_offsets_legacy; as well as the original bitmaps at module init, no? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On Fri, Feb 26, 2010 at 02:26:32PM +0100, Alexander Graf wrote: On 26.02.2010, at 14:21, Joerg Roedel wrote: On Fri, Feb 26, 2010 at 03:10:13PM +0200, Avi Kivity wrote: On 02/26/2010 03:04 PM, Joerg Roedel wrote: I'm still not convinced on this way of doing things. If it's static, make it static. If it's dynamic, make it dynamic. Dynamically generating a static list just sounds plain wrong to me. Stop. I had a static list in the first version of the patch. This list was fine except the fact that a developer needs to remember to update this list if the list of non-intercepted msrs is expanded. The whole reason for a dynamically built list is to take the task of maintaining the list away from the developer and remove a possible source of hard to find bugs. This is what the current approach does. The problem was the two lists. If you had a static struct svm_direct_access_msrs = { u32 index; bool longmode_only; } direct_access_msrs = { ... }; You could generate static unsigned *msrpm_offsets_longmode, *msrpm_offsets_legacy; as well as the original bitmaps at module init, no? True for the msrs the guest always has access too. But for the lbr-msrs the intercept bits may change at runtime. So an addtional flag is required to indicate if the bits should be cleared initially. So the msrpm bitmap changes dynamically for each vcpu? Great, make it fully dynamic then, changing the vcpu-arch.msrpm only from within its vcpu context. No need for atomic ops. The msrpm_offsets table is global. But I think I will follow Avis suggestions and create a static direct_access_msrs list and generate the msrpm_offsets at module_init. This solves the problem of two independent lists too. Joerg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. We are rehashing the whole 'perfmon versus perf events/counters' design arguments again here really. Scheduling at event granularity would be a good thing. However we need to be able to handle the guest using the full pmu. Note that scheduling is only needed if both the guest and host want the pmu at the same time - and that should be a rare case and not the one to optimize for. You need to integrate it properly so that host PMU functionality still works fine. (Within hardware constraints) Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) That's doable if all counters are steerable. IIRC some counters are fixed function, but I'm not certain about that. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:06, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: Well you cannot steal the PMU without collaborating with perf_event.c, but thats quite feasible. Sharing the PMU between the guest and the host is very costly and guarantees incorrect results in the host. Unless you completely emulate the PMU by faking it and then allocating PMU counters one by one at the host level. However that means trapping a lot of MSR access. It's not that many MSR accesses. Well it's more than enough to double the number of MSRs KVM has to track on switches. There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. If a PMU can really count stuff on another CPU, then we shouldn't allow PMU access to any application at all. It's more than just a KVM guest vs a KVM guest issue then, but also a thread to thread issue. My idea was obviously not to expose host timings to a guest. Save the counters when a guest exits, and reload them when it's restarted. Not just when switching to another task, but also when entering KVM, to avoid the guest seeing overhead spent within KVM. Having an allocation scheme and sharing it with the host, is a perfectly legitimate and very clean way to do it. Once it's given to the guest, the host knows not to touch it until it's been released again. 'Full PMU' is not the granularity i find acceptable though: please do what i suggested, event granularity allocation and scheduling. As I wrote earlier, at that level we have to do it all emulated. In this case, providing any of this to a guest seems to be a waste of time since the interface will cost way too much in trapping back and forth and you have contention with the very limited resources in the PMU with just 5 counters to pick from on Core2. The guest PMU will think it's running on top of real hardware, and scaling/estimating numbers like the perf_event.c code does today, except that it will be using already scaled and estimated numbers for it's calculations. Application users will have little use for this. Well with the hardware currently available, there is no such thing as clean sharing between the host and the guest. It cannot be done without messing up the host measurements, which effectively renders measuring at the host side useless while a guest is allowed access to the PMU. That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) Well either you allow access to the PMU or you don't. If you allow direct access to the PMU counters, but not the control registers, you have to specify the counter sizes to match that of the host, making it impossible to really emulate core2 on a non core2 architecture etc. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. With the 'steal the PMU' messy approach the guest OS has to be upgraded to the new CPU type all the time. Ad infinitum. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:27 PM, Ingo Molnar wrote: For Linux-Linux the sanest, tier-1 approach would be to map sys_perf_open() on the guest side over to the host, transparently, via a paravirt driver. Let us for the purpose of this discussion assume that we are also interested in supporting Windows and older Linux. Paravirt optimizations can be added after we have the basic functionality, if they prove necessary. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:30, Avi Kivity wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: That's precisely my point: the guest should obviously not get raw access to the PMU. (except where it might matter to performance, such as RDPMC) That's doable if all counters are steerable. IIRC some counters are fixed function, but I'm not certain about that. I am not an expert, but from what I learned from Peter, there are constraints on some of the counters. Ie. certain types of events can only be counted on certain counters, which limits the already very limited number of counters even further. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. Furthermore, when KVM doesn't virtualize the physical system topology, some PMU features cannot even be sanely used from a vcpu. So while currently a root user can already tie up all of the pmu using perf, simply using that to hand the full pmu off to the guest still leaves lots of issues. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:18, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Can you emulate the Core 2 pmu on, say, a P4? [...] How about the Pentium? Or the i486? As long as there's perf events support, the CPU can be supported in a soft PMU. You can even cross-map exotic hw events if need to be - but most of the tooling (in just about any OS) uses just a handful of core events ... This is only possible if all future CPU perfmon events are guaranteed to be a superset of previous versions. Otherwise you end up emulating events and providing randomly generated numbers back. The perfmon revision and size we present to a guest has to match the current host. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:31, Ingo Molnar wrote: You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That would only work if you are guaranteed to be able to emulate old hardware on new hardware. Not going to be feasible, so then we are in a real mess. With the 'steal the PMU' messy approach the guest OS has to be upgraded to the new CPU type all the time. Ad infinitum. The way the Perfmon architecture is specified by Intel, that is what we are stuck with. It's not going to be possible via software emulation to count cache misses, unless you run it in a micro architecture emulator. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:31 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. May be one too many, for certain guests. Of course it may be argued that if the guest wants performance monitoring that much, they will upgrade. Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That also works for the architectural pmu, of course that's Intel only. And there you don't need to upgrade the guest even once. The arch pmu seems nicely done - there's a bit for every counter that can be enabled and disabled at will, and the number of counters is also determined from cpuid. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) - It's far cleaner on the host side as well: more granular, per event usage is possible. The guest can use portion of the PMU (managed by the host), and the host can use a portion too. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS precludes the host OS from running some different piece of instrumentation at the same time. - It's more secure: the host can have a finegrained policy about what kinds of events it exposes to the guest. It might chose to only expose software events for example. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is an all-or-nothing policy affair: either you fully allow the guest (and live with whatever consequences the piece of hardware that takes up a fair chunk on the CPU die causes), or you allow none of it. - A proper paravirt event driver gives more features as well: it can exposes host software events and tracepoints, probes - not restricting itself to the 'hardware PMU' abstraction. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:28 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. True. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. Right, this is pretty bad. For Windows it's probably acceptable to upgrade your performance tools (since that's separate from the OS). In Linux it is integrated into the kernel, and it's fairly unacceptable to demand a kernel upgrade when your host is upgraded underneath you. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. Still that leaves us with no Windows / non-Linux solution. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:28, Peter Zijlstra wrote: On Fri, 2010-02-26 at 13:51 +0200, Avi Kivity wrote: It would be the other way round - the host would steal the pmu from the guest. Later we can try to time-slice and extrapolate, though that's not going to be easy. Right, so perf already does the time slicing and interpolating thing, so a soft-pmu gets that for free. What I don't like here is that without rewriting the guest OS, there will be two layers of time-slicing and extrapolation. That is going to make the reported numbers close to useless. Anyway, this discussion seems somewhat in a stale-mate position. The KVM folks basically demand a full PMU MSR shadow with PMI passthrough so that their $legacy shit works without modification. My question with that is how $legacy muck can ever know how the current PMU works, you can't even properly emulate a core2 pmu on a nehalem because intel keeps messing with the event codes for every new model. So basically for this to work means the guest can't run legacy stuff anyway, but needs to run very up-to-date software, so we might as well create a soft-pmu/paravirt interface now and have all up-to-date software support that for the next generation. That is the problem. Today there is a large install base out there of core2 users who wish to measure their stuff on the hardware they have. The same will be true for Nehalem based stuff, when whatever replaces Nehalem comes out makes that incompatible. Since we are unable to emulate Core2 on Nehalem, and almost certainly will be unable to emulate Nehalem on it's successor, we are stuck with this. A para-virt interface is a nice idea, but since we cannot emulate an old CPU properly it still means there isn't much we can do as we're stuck with the same limitations. I simply see the value of introducing a para-virt interface for this. Furthermore, when KVM doesn't virtualize the physical system topology, some PMU features cannot even be sanely used from a vcpu. That is definitely an issue, and there is nothing we can really do about that. Having two guests running in parallel under KVM means that they are going to see more cache misses than they would if they ran barebone on the hardware. However even with all of this, we have to keep in mind who is going to use the performance monitoring in a guest. It is going to be application writers, mostly people writing analytical/scientific applications. They rarely have control over the OS they are running on, but are given systems and told to work on what they are given. Driver upgrades and things like that don't come quickly. However they also tend to understand limitations like these and will be able to still benefit from perf on a system like that. So while currently a root user can already tie up all of the pmu using perf, simply using that to hand the full pmu off to the guest still leaves lots of issues. Well isn't that the case with the current setup anyway? If enough user apps start requesting PMU resources, the hw is going to run out of counters very quickly anyway. The real issue here IMHO is whether or not is it possible to use a PMU to count anything on different CPU? If that is really possible, sharing the PMU is not an option :( All that said, what we really want is for Intel+AMD to come up with proper hw PMU virtualization support that makes it easy to rotate the full PMU in and out for a guest. Then this whole discussion will become a non issue. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:44 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. You mean like bus transactions on a multicore? Well, we're already exposed to cache timing attacks. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. What about Windows? In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) Why not? So long as the source and destination are compatible? - It's far cleaner on the host side as well: more granular, per event usage is possible. The guest can use portion of the PMU (managed by the host), and the host can use a portion too. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS precludes the host OS from running some different piece of instrumentation at the same time. Right, time slicing is something we want. - It's more secure: the host can have a finegrained policy about what kinds of events it exposes to the guest. It might chose to only expose software events for example. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is an all-or-nothing policy affair: either you fully allow the guest (and live with whatever consequences the piece of hardware that takes up a fair chunk on the CPU die causes), or you allow none of it. No, we can hide insecure events with a full pmu. Trap the control register and don't pass it on to the hardware. - A proper paravirt event driver gives more features as well: it can exposes host software events and tracepoints, probes - not restricting itself to the 'hardware PMU' abstraction. But it is limited to whatever the host stack supports. At least that's our control, but things like PEBS will take a ton of work. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. We only control half of the equation. That's very different compared to tools/perf. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 03:37 PM, Jes Sorensen wrote: On 02/26/10 14:31, Ingo Molnar wrote: You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That would only work if you are guaranteed to be able to emulate old hardware on new hardware. Not going to be feasible, so then we are in a real mess. That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/10 14:16, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: That was not what i suggested tho. tools/kvm/ would work plenty fine. I'll wait until we have tools/libc and tools/X. After all, they affect a lot more people and are concerned with a lot more kernel/user interfaces than kvm. So your answer can be summed up as: 'we wont do what makes sense technically because others suck even more' ? Well in this discussion what makes sense technically differs depending on who you ask. I will argue that emulating the MSR access doesn't make sense technically because there is no fixed specification we can rely on, since the spec seems to change randomly with every cpu family release from Inte. In addition the overhead is making the resulting numbers less if at all interesting. Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] KVM: SVM: Optimize nested svm msrpm merging
On 02/26/2010 03:30 PM, Joerg Roedel wrote: So the msrpm bitmap changes dynamically for each vcpu? Great, make it fully dynamic then, changing the vcpu-arch.msrpm only from within its vcpu context. No need for atomic ops. The msrpm_offsets table is global. But I think I will follow Avis suggestions and create a static direct_access_msrs list and generate the msrpm_offsets at module_init. This solves the problem of two independent lists too. But with LBR virt, maybe a fully dynamic approach is better. Just have static lists for updating the msrpm and offset table dynamically. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 03:31 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. May be one too many, for certain guests. Of course it may be argued that if the guest wants performance monitoring that much, they will upgrade. Yes, that can certainly be argued. Note another logical inconsistency: you are assuming reluctance to upgrade for a set of users who are doing _performance analysis_. In fact those types of users are amongst the most upgrade-happy. Often they'll run modern hardware and modern software. Most of the time they are developers themselves who try to make sure their stuff works on the latest greatest hardware _and_ software. So people running P4's trying to tune their stuff under Red Hat Linux 9 and trying to use the PMU uner KVM is not really a concern rooted overly deeply in reality. Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That also works for the architectural pmu, of course that's Intel only. And there you don't need to upgrade the guest even once. Besides being Intel only, it only exposes a limited sub-set of hw events. (far fewer than the generic ones offered by perf events) Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 03:16 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: That was not what i suggested tho. tools/kvm/ would work plenty fine. I'll wait until we have tools/libc and tools/X. After all, they affect a lot more people and are concerned with a lot more kernel/user interfaces than kvm. So your answer can be summed up as: 'we wont do what makes sense technically because others suck even more' ? I can sum up your this remark as 'whenever you disagree with me, I will rephrase your words to make you look like an idiot'. If you believe I'm an idiot, there's no need to have this (or any) conversation. If not, please refrain from this type of verbal gymnastics. And it's not just the kernel-user interface (which btw., for the case of X is far narrower than what KVM currently has to Qemu). The issue is a basic question of software design: does kvm-qemu really make as much sense without the kernel component as with it? The answer is: it will borderline-work with CPU emulation (and i'm sure there are people making use of it that way), but 90%+ of the userbase uses it with KVM and vice versa. It is really a single logical component as far as maintenance goes, and tools/kvm/ would make quite a bit of sense. There are two separate questions. Is there room for a kvm-only userspace component? I believe so, but throwing away the momentum behind qemu would be foolish. Does it make sense for such a component to live in linux.git? IMO, no, and certainly a lot less than libc and X. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/10 14:27, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, whereas Nehalem and Atom are v3 if I remember correctly. [...] Of course you can emulate a good portion of it, as long as there's perf support on the host side for P4. Actually P4 is pretty uninteresting in this discussion due to the lack of VMX support, it's the same issue for Nehalem vs Core2. The problem is the same though, we cannot tell the guest that yes P4 has this event, but no, we are going to feed you bogus data. If the guest programs a cachemiss event, you program a cachemiss perf event on the host and feed its values to the emulated MSR state. You _dont_ program the raw PMU on the host side - just use the API i outlined to get struct perf_event. The emulation wont be perfect: not all events will count and not all events will be available in a P4 (and some Core2 events might not even make sense in a P4), but that is reality as well: often documented events dont count, and often non-documented events count. What matters to 99.9% of people who actually use this stuff is a few core sets of events - which are available in P4s and in Core2 as well. Cycles, instructions, branches, maybe cache-misses. Sometimes FPU stuff. I really do not like to make guesses about how people use this stuff. The things you and I look for as kernel hackers are often very different than application authors look for and use. That is one thing I learned from being expose to strange Fortran programmers at SGI. It makes me very uncomfortable telling a guest OS that we offer features X, Y, Z and then start lying feeding back numbers that do not match what was requested, and there is no way to to tell the guest that. For Linux-Linux the sanest, tier-1 approach would be to map sys_perf_open() on the guest side over to the host, transparently, via a paravirt driver. Paravirt is a nice optimization, but is and will always be an optimization. Fact of the matter is that the bulk of usage of virtualization is for running distributions with slow kernel upgrade rates, like SLES and RHEL, and other proprietary operating systems which we have no control over. Para-virt will do little good for either of these groups. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PCI hotplug broken?
Hi list, While trying to upgrade some internal infrastructure to qemu-kvm-0.12 I stumbled across this really weird problem that I see with current qemu-kvm git too: I start qemu-kvm using: ./qemu-system-x86_64 -L ../pc-bios/ -m 512 -net nic,model=virtio -net tap,ifname=tap0,script=/bin/true -snapshot sles11.qcow2 -vnc :0 -monitor stdio The system boots up just fine, networking works. On the qemu monitor I then issue: (qemu) pci_add auto storage file=/tmp/image.raw,if=virtio after which I get a fully functional virtio block device, but the network stops sending/receiving packets. The same thing with qemu-kvm-0.10 works just fine. Has anyone seen this before? Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:07 PM, Jes Sorensen wrote: On 02/26/10 14:27, Ingo Molnar wrote: * Jes Sorensenjes.soren...@redhat.com wrote: You certainly cannot emulate the Core2 on a P4. The Core2 is Perfmon v2, whereas Nehalem and Atom are v3 if I remember correctly. [...] Of course you can emulate a good portion of it, as long as there's perf support on the host side for P4. Actually P4 is pretty uninteresting in this discussion due to the lack of VMX support, it's the same issue for Nehalem vs Core2. The problem is the same though, we cannot tell the guest that yes P4 has this event, but no, we are going to feed you bogus data. The Pentium D which is a P4 derivative has vmx support. However it is so slow I'm fine with ignoring it for this feature. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 03:44 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 02/26/2010 03:06 PM, Ingo Molnar wrote: Firstly, an emulated PMU was only the second-tier option i suggested. By far the best approach is native API to the host regarding performance events and good guest side integration. Secondly, the PMU cannot be 'given' to the guest in the general case. Those are privileged registers. They can expose sensitive host execution details, etc. etc. So if you emulate a PMU you have to exit out of most PMU accesses anyway for a secure solution. (RDPMC can still be supported, but in close cooperation with the host) There is nothing secret in the host PMU, and it's easy to clear out the counters before passing them off to the guest. That's wrong. On some CPUs the host PMU can be used to say sample aspects of another CPU, allowing statistical attacks to recover crypto keys. It can be used to sample memory access patterns of another node. There's a good reason PMU configuration registers are privileged and there's good value in only giving a certain sub-set to less privileged entities by default. Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. You mean like bus transactions on a multicore? Well, we're already exposed to cache timing attacks. If you give a full PMU to a guest it's a whole different dimension and quality of information. Literally hundreds of different events about all sorts of aspects of the CPU and the hardware in general. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. What about Windows? What is your question? Why should i limit Linux kernel design decisions based on any aspect of Windows? You might want to support it, but _please_ dont let the design be dictated by it ... In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) Why not? So long as the source and destination are compatible? 'As long as it works' is certainly a good enough filter for quality ;-) - It's far cleaner on the host side as well: more granular, per event usage is possible. The guest can use portion of the PMU (managed by the host), and the host can use a portion too. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS precludes the host OS from running some different piece of instrumentation at the same time. Right, time slicing is something we want. - It's more secure: the host can have a finegrained policy about what kinds of events it exposes to the guest. It might chose to only expose software events for example. In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS is an all-or-nothing policy affair: either you fully allow the guest (and live with whatever consequences the piece of hardware that takes up a fair chunk on the CPU die causes), or you allow none of it. No, we can hide insecure events with a full pmu. Trap the control register and don't pass it on to the hardware. So you basically concede partial emulation ... - A proper paravirt event driver gives more features as well: it can exposes host software events and tracepoints, probes - not restricting itself to the 'hardware PMU' abstraction. But it is limited to whatever the host stack supports. At least that's our control, but things like PEBS will take a ton of work. PEBS support is being implemented for perf, as a transparent feature. So once it's available, PEBS support will magically improve the quality of guest OS samples, if a paravirt driver approach is used and if sys_perf_event_open() is taught about that driver. Without any other change needed on the guest side. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host
Re: KVM PMU virtualization
On 02/26/2010 04:01 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 02/26/2010 03:31 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Or do you mean to define a new, kvm-specific pmu model and feed it off the host pmu? In this case all the guests will need to be taught about it, which raises the compatibility problem. You are missing two big things wrt. compatibility here: 1) The first upgrade overhead a one time overhead only. May be one too many, for certain guests. Of course it may be argued that if the guest wants performance monitoring that much, they will upgrade. Yes, that can certainly be argued. Note another logical inconsistency: you are assuming reluctance to upgrade for a set of users who are doing _performance analysis_. In fact those types of users are amongst the most upgrade-happy. Often they'll run modern hardware and modern software. Most of the time they are developers themselves who try to make sure their stuff works on the latest greatest hardware _and_ software. I wouldn't go as far, but I agree there is less resistance to change here. A Windows user certainly ought to be willing to install a new VTune release, and a RHEL user can be convinced to upgrade from (say) 5.4 to 5.6 with new backported paravirt pmu support. I wouldn't like to force them to upgrade to 2.6.3x though. Many of those users will be developers of in-house applications who are trying to understand their applications under production loads. Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? Not at all. Virtualization is a hardware compatibility game. To see what happens if you don't play it, see Xen. Eventually they to implemented hardware support even though the pv approach is so wonderful. If we go the pv route, we'll limit the usefulness of Linux in this scenario to a subset of guests. Users will simply walk away and choose a hypervisor whose authors have less interest in irony and more in providing the features they want. A pv approach can come after we have a baseline that is useful to all users. 2) Once a Linux guest has upgraded, it will work in the future, with _any_ future CPU - _without_ having to upgrade the guest! Dont you see the advantage of that? You can instrument an old system on new hardware, without having to upgrade that guest for the new CPU support. That also works for the architectural pmu, of course that's Intel only. And there you don't need to upgrade the guest even once. Besides being Intel only, it only exposes a limited sub-set of hw events. (far fewer than the generic ones offered by perf events) Things aren't mutually exclusive. Offer the arch pmu for maximum future compatibility (Intel only, alas), the full pmu for maximum features, and the pv pmu for flexibility. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
* Avi Kivity a...@redhat.com wrote: On 02/26/2010 03:16 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: That was not what i suggested tho. tools/kvm/ would work plenty fine. I'll wait until we have tools/libc and tools/X. After all, they affect a lot more people and are concerned with a lot more kernel/user interfaces than kvm. So your answer can be summed up as: 'we wont do what makes sense technically because others suck even more' ? I can sum up your this remark as 'whenever you disagree with me, I will rephrase your words to make you look like an idiot'. Two points: 1) You can try to ridicule me if you want, but do you actually claim that my summary is inaccurate? I do claim it's a substantially accurate summary: you said you will (quote:) wait with tools/kvm/ until we have tools/libc and tools/X. I do think tools/X and tools/libc would make quite a bit of sense - this is one of the better design aspects of FreeBSD et al. It's a mistake that it's not being done. 2) I used a question mark (the sentence was not a statement of fact), and you have no obligation to agree with the summary i provided. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. Only for the arch defined events, all _7_ of them. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
* Avi Kivity a...@redhat.com wrote: Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? Not at all. Virtualization is a hardware compatibility game. To see what happens if you don't play it, see Xen. Eventually they to implemented hardware support even though the pv approach is so wonderful. That's not quite equivalent though. KVM used to be the clean, integrate-code-with-Linux virtualization approach, designed specifically for CPUs that can be virtualized properly. (VMX support first, then SVM, etc.) KVM virtualized ages-old concepts with relatively straightforward hardware ABIs: x86 execution, IRQ abstractions, device abstractions, etc. Now you are in essence turning that all around: - the PMU is by no means properly virtualized nor really virtualizable by direct access. There's no virtual PMU that ticks independently of the host PMU. - the PMU hardware itself is not a well standardized piece of hardware. It's very vendor dependent and very limiting. So to some degree you are playing the role of Xen in this specific affair. You are pushing for something that shouldnt be done in that form. You want to interfere with the host PMU by going via the fast easy short-term hack to just let the guest OS have the PMU, without any regard to how this impacts long-term feasible solutions. I.e. you are a bit like the guy who would have told Linus in 1994: Dude, why dont you use the Windows APIs? It's far more compatible and that's the only way you could run any serious apps. Besides, it requires no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our installed base after all. Ingo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 14:51 +0100, Jes Sorensen wrote: Furthermore, when KVM doesn't virtualize the physical system topology, some PMU features cannot even be sanely used from a vcpu. That is definitely an issue, and there is nothing we can really do about that. Having two guests running in parallel under KVM means that they are going to see more cache misses than they would if they ran barebone on the hardware. However even with all of this, we have to keep in mind who is going to use the performance monitoring in a guest. It is going to be application writers, mostly people writing analytical/scientific applications. They rarely have control over the OS they are running on, but are given systems and told to work on what they are given. Driver upgrades and things like that don't come quickly. However they also tend to understand limitations like these and will be able to still benefit from perf on a system like that. What I meant was things like memory controller bound counters, intel uncore and amd northbridge, without knowing what node the vcpu got scheduled to there is no way they can program the raw hardware in a meaningful way, amd nb in particular is interesting in that you could choose not to offer the intel uncore msrs, but the amd nb are shadowed over the generic pmcs, so you have no way to filter those out. Same goes for stuff like the intel ANY flag, LBR filter control and similar muck, a vcpu can't make use of those things in a meaningful manner. Also, intel debugstore things requires a host linear address, again, not something a vcpu can easily provide (although that might be worked around with an msr trap, but that still limits you to 1 page data sizes, not a limitation all software will respect). All that said, what we really want is for Intel+AMD to come up with proper hw PMU virtualization support that makes it easy to rotate the full PMU in and out for a guest. Then this whole discussion will become a non issue. As it stands there simply are a number of PMU features that defy being virtualized, simply because the virt stuff doesn't do system topology. So even if they were to support a virtualized pmu, it would likely be a different beast than the native hardware is, and it will be several hardware models in the future, coming up with a paravirt interface and getting !linux hosts to adapt and !linux guests to use is probably as 'easy'. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote: Even if there were no security considerations, if the guest can observe host data in the pmu, it means the pmu is inaccurate. We should expose guest data only in the guest pmu. That's not difficult to do, you stop the pmu on exit and swap the counters on context switches. That's not enough, memory node wide counters are impossible to isolate like that, the same for core wide (ANY flag) counters. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 15:30 +0200, Avi Kivity wrote: Scheduling at event granularity would be a good thing. However we need to be able to handle the guest using the full pmu. Does the full PMU include things like LBR, PEBS and uncore? in that case, there is no way you're going to get that properly and securely virtualized by using raw access. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PCI hotplug broken?
On 26.02.2010, at 15:12, Alexander Graf wrote: Hi list, While trying to upgrade some internal infrastructure to qemu-kvm-0.12 I stumbled across this really weird problem that I see with current qemu-kvm git too: I start qemu-kvm using: ./qemu-system-x86_64 -L ../pc-bios/ -m 512 -net nic,model=virtio -net tap,ifname=tap0,script=/bin/true -snapshot sles11.qcow2 -vnc :0 -monitor stdio The system boots up just fine, networking works. On the qemu monitor I then issue: (qemu) pci_add auto storage file=/tmp/image.raw,if=virtio after which I get a fully functional virtio block device, but the network stops sending/receiving packets. The same thing with qemu-kvm-0.10 works just fine. Has anyone seen this before? Same thing happens when hotplug only: pci_add auto nic model=virtio,vlan=0 - network works pci_add auto storage file=/tmp/image.raw,if=virtio - network stops working pci_add auto nic model=virtio,vlan=0 - network works again on the new device Alex-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:12 PM, Ingo Molnar wrote: Again you are making an incorrect assumption: that information leakage via the PMU only occurs while the host is running on that CPU. It does not - the PMU can leak general system details _while the guest is running_. You mean like bus transactions on a multicore? Well, we're already exposed to cache timing attacks. If you give a full PMU to a guest it's a whole different dimension and quality of information. Literally hundreds of different events about all sorts of aspects of the CPU and the hardware in general. Well, we filter out the bad events then. So for this and for the many other reasons we dont want to give a raw PMU to guests: - A paravirt event driver is more compatible and more transparent in the long run: it allows hardware upgrade and upgraded PMU functionality (for Linux) without having to upgrade the guest OS. Via that a guest OS could even be live-migrated to a different PMU, without noticing anything about it. What about Windows? What is your question? Why should i limit Linux kernel design decisions based on any aspect of Windows? You might want to support it, but _please_ dont let the design be dictated by it ... In our case the quality of implementation is judged by how well we support workloads that users run, and that means we have to support Windows well. And that more or less means we can't have a pv-only pmu. Which part of this do you disagree with? In contrast, a 'stolen', 'raw' PMU directly programmed by the guest OS always assumes the guest OS is upgraded to the host. Also, 'raw' PMU state cannot be live-migrated. (save/restore doesnt help) Why not? So long as the source and destination are compatible? 'As long as it works' is certainly a good enough filter for quality ;-) We already have this. If you expose sse4.2 to the guest, you can't migrate to a host which doesn't support it. If you expose a Nehalem pmu to the guest, you can't migrate to a host which supports it. Users and tools already understand this. It's true that the pmu case is more difficult since you can't migrate forwards as well as backwards, but that's life. No, we can hide insecure events with a full pmu. Trap the control register and don't pass it on to the hardware. So you basically concede partial emulation ... Yes. Still appears to follow the spec to the guest, though. And with the option of full emulation for those who need it and sign on the dotted line. - There's proper event scheduling and event allocation. Time-slicing, etc. The thing is, we made quite similar arguments in the past, during the perfmon vs. perfcounters discussions. There's really a big advantage to proper abstractions, both on the host and on the guest side. We only control half of the equation. That's very different compared to tools/perf. You mean Windows? For heaven's sake, why dont you think like Linus thought 20 years ago. To the hell with Windows suckiness and lets make sure our stuff works well. In our case, making our stuff work well means making sure guests of the user's choice run well. Not ours. Currently users mostly choose Windows and Linux, so we have to make them both work. (btw, the analogy would be, 'To hell with Unix suckiness, let's make sure our stuff works well'; where Linux reimplemented the Unix APIs, ensuring source compatibility with applications, kvm reimplements the hardware interface, ensuring binary compatibility with guests). Then the users will come, developers will come, and people will profile Linux under Linux and maybe the tools will be so good that they'll profile under Linux using Wine just to be able to use those good tools... If we don't support Windows well, users will walk away, followed by starving developers. If you gut Linux capabilities like that to accomodate for the suckiness of Windows, without giving a technological edge to Linux, and then we are bound to fail in the long run ... I'm all for abusing the tight relationship between Linux-as-a-host and Linux-as-a-guest to gain an advantage for both. One fruitful area would be asynchronous page faults, which has the potential to increase memory overcommit, for example. But first of all we need to make sure that there is a baseline of support for all commonly used guests. I think of it this way: once kvm deployment becomes widespread, Linux-as-a-guest gains an advantage. But in order for kvm deployment to become widespread, it needs excellent support for all guests users actually use. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:27 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. Only for the arch defined events, all _7_ of them. That's 7 more than what we support now, and 7 more than what we can guarantee without it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 04:23 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: On 02/26/2010 03:16 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: That was not what i suggested tho. tools/kvm/ would work plenty fine. I'll wait until we have tools/libc and tools/X. After all, they affect a lot more people and are concerned with a lot more kernel/user interfaces than kvm. So your answer can be summed up as: 'we wont do what makes sense technically because others suck even more' ? I can sum up your this remark as 'whenever you disagree with me, I will rephrase your words to make you look like an idiot'. Two points: 1) You can try to ridicule me if you want, I'd much prefer it if if no ridiculing was employed on either side. but do you actually claim that my summary is inaccurate? I do claim it's a substantially accurate summary: you said you will (quote:) wait with tools/kvm/ until we have tools/libc and tools/X. I do think tools/X and tools/libc would make quite a bit of sense - this is one of the better design aspects of FreeBSD et al. It's a mistake that it's not being done. There are arguments for libc to be developed in linux-2.6.git, and arguments against. The fact is that they are not, so presumably the arguments against plus inertia outweigh the arguments for. The same logic holds for kvm, except that there are less arguments for development in linux-2.6.git. Only a small part of qemu is actually concerned with kvm; most of it is mucking around with X, emulating old devices, emulating instruction sets (irrelevant for tools/kvm) and doing boring managementy stuff. Do we really want to add several hundered thousand lines to Linux, only a few thousand or of which talk to the kernel? 2) I used a question mark (the sentence was not a statement of fact), and you have no obligation to agree with the summary i provided. Thanks. I hope you don't agree with it either. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 16:54 +0200, Avi Kivity wrote: On 02/26/2010 04:27 PM, Peter Zijlstra wrote: On Fri, 2010-02-26 at 15:55 +0200, Avi Kivity wrote: That actually works on the Intel-only architectural pmu. I'm beginning to like it more and more. Only for the arch defined events, all _7_ of them. That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 05:08 PM, Peter Zijlstra wrote: That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? Good question. Would be interesting to try out VTune with the non-arch pmu masked out. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 16:53 +0200, Avi Kivity wrote: If you give a full PMU to a guest it's a whole different dimension and quality of information. Literally hundreds of different events about all sorts of aspects of the CPU and the hardware in general. Well, we filter out the bad events then. Which requires trapping the MSR access, at which point a soft-PMU is almost there, right? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote: On 02/26/2010 05:08 PM, Peter Zijlstra wrote: That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? Good question. Would be interesting to try out VTune with the non-arch pmu masked out. From what I understood VTune uses PEBS+LBR, although I suppose they have simple PMU modes too, never actually seen the software. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] kvm-kmod-2.6.33
Hello Jan, I can compile kvm-kmod-2.6.32.9 under Ubuntu 9.1 64-Bit, but 'make install' fails with ing...@nexoc:~/KVM/kvm-kmod-2.6.32.9$ sudo make install [sudo] password for ingmar: mkdir -p ///usr/local/include/kvm-kmod/asm/ install -m 644 usr/include/asm-x86/{kvm,kvm_para}.h ///usr/local/include/kvm-kmod/asm/ install: cannot stat `usr/include/asm-x86/{kvm,kvm_para}.h': No such file or directory make: *** [install-hdr] Error 1 Before I used kvm-kmod-2.6.32.3 which installs just fine: ing...@nexoc:~/KVM/kvm-kmod-2.6.32.3$ sudo make install mkdir -p ///lib/modules/2.6.31-19-generic/extra cp x86/*.ko ///lib/modules/2.6.31-19-generic/extra for i in ///lib/modules/2.6.31-19-generic/kernel/drivers/kvm/*.ko \ ///lib/modules/2.6.31-19-generic/kernel/arch/x86/kvm/*.ko; do \ if [ -f $i ]; then mv $i $i.orig; fi; \ done /sbin/depmod -a 2.6.31-19-generic -b / install -m 644 -D scripts/65-kvm.rules //etc/udev/rules.d/65-kvm.rules install -m 644 -D usr/include/asm-x86/kvm.h ///usr/local/include/kvm-kmod/asm/kvm.h install -m 644 -D usr/include/linux/kvm.h ///usr/local/include/kvm-kmod/linux/kvm.h sed 's|PREFIX|/usr/local|; s/VERSION/kvm-kmod-2.6.32.3/' kvm-kmod.pc .tmp.kvm-kmod.pc install -m 644 -D .tmp.kvm-kmod.pc ///usr/local/lib/pkgconfig/kvm-kmod.pc Any idea what could be wrong? Regards, Ingmar Jan Kiszka wrote: Now that 2.6.33 is out, time to release the corresponding kvm-kmod package as well. Not much has happened since 2.6.33-rc6, though. KVM changes since kvm-kmod-2.6.33-rc6: - PIT: control word is write-only (fixes side-effects of spurious reads) - kvmclock: count total_sleep_time when updating guest clock (requires = 2.6.32.9 as host, falls back to unfixed version otherwise) kvm-kmod changes: - warn about kvmclock issues across host suspend/resume - detect host kernel extra version to make use of fixes in stable series See [1] for the delta to 2.6.32. I also released kvm-kmod-2.6.32.9 with basically the same changes. That may be the last release based on that kernel, but nothing is set in stone yet (specifically as we already maintain kvm-kmod-2.6.32 internally for a customer). Jan [1] https://sourceforge.net/projects/kvm/files/kvm-kmod/2.6.33-rc6/changelog/view -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] kvm-kmod-2.6.33
Ingmar Schraub wrote: Hello Jan, I can compile kvm-kmod-2.6.32.9 under Ubuntu 9.1 64-Bit, but 'make install' fails with ing...@nexoc:~/KVM/kvm-kmod-2.6.32.9$ sudo make install [sudo] password for ingmar: mkdir -p ///usr/local/include/kvm-kmod/asm/ install -m 644 usr/include/asm-x86/{kvm,kvm_para}.h ///usr/local/include/kvm-kmod/asm/ install: cannot stat `usr/include/asm-x86/{kvm,kvm_para}.h': No such file or directory make: *** [install-hdr] Error 1 Before I used kvm-kmod-2.6.32.3 which installs just fine: ing...@nexoc:~/KVM/kvm-kmod-2.6.32.3$ sudo make install mkdir -p ///lib/modules/2.6.31-19-generic/extra cp x86/*.ko ///lib/modules/2.6.31-19-generic/extra for i in ///lib/modules/2.6.31-19-generic/kernel/drivers/kvm/*.ko \ ///lib/modules/2.6.31-19-generic/kernel/arch/x86/kvm/*.ko; do \ if [ -f $i ]; then mv $i $i.orig; fi; \ done /sbin/depmod -a 2.6.31-19-generic -b / install -m 644 -D scripts/65-kvm.rules //etc/udev/rules.d/65-kvm.rules install -m 644 -D usr/include/asm-x86/kvm.h ///usr/local/include/kvm-kmod/asm/kvm.h install -m 644 -D usr/include/linux/kvm.h ///usr/local/include/kvm-kmod/linux/kvm.h sed 's|PREFIX|/usr/local|; s/VERSION/kvm-kmod-2.6.32.3/' kvm-kmod.pc .tmp.kvm-kmod.pc install -m 644 -D .tmp.kvm-kmod.pc ///usr/local/lib/pkgconfig/kvm-kmod.pc Any idea what could be wrong? Likely bash'ism of mine (what's your shell?). This should fix it: diff --git a/Makefile b/Makefile index 94dde5c..c031701 100644 --- a/Makefile +++ b/Makefile @@ -62,9 +62,9 @@ KVM_KMOD_VERSION = $(strip $(if $(wildcard KVM_VERSION), \ install-hdr: mkdir -p $(DESTDIR)/$(HEADERDIR)/asm/ - install -m 644 usr/include/asm-$(ARCH_DIR)/{kvm,kvm_para}.h $(DESTDIR)/$(HEADERDIR)/asm/ + install -m 644 usr/include/asm-$(ARCH_DIR)/*.h $(DESTDIR)/$(HEADERDIR)/asm/ mkdir -p $(DESTDIR)/$(HEADERDIR)/linux/ - install -m 644 usr/include/linux/{kvm,kvm_para}.h $(DESTDIR)/$(HEADERDIR)/linux/ + install -m 644 usr/include/linux/*.h $(DESTDIR)/$(HEADERDIR)/linux/ sed 's|PREFIX|$(PREFIX)|; s/VERSION/$(KVM_KMOD_VERSION)/' kvm-kmod.pc $(tmppc) install -m 644 -D $(tmppc) $(DESTDIR)/$(PKGCONFIGDIR)/kvm-kmod.pc Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Offline for a week
I will be on vacation and offline, pmu threads included, for a week. Marcelo will handle all kvm issues as usual. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On Fri, 2010-02-26 at 17:11 +0200, Avi Kivity wrote: On 02/26/2010 05:08 PM, Peter Zijlstra wrote: That's 7 more than what we support now, and 7 more than what we can guarantee without it. Again, what windows software uses only those 7? Does it pay to only have access to those 7 or does it limit the usability to exactly the same subset a paravirt interface would? Good question. Would be interesting to try out VTune with the non-arch pmu masked out. Also, the ANY bit is part of the intel arch pmu, but you still have to mask it out. BTW, just wondering, why would a developer be running VTune in a guest anyway? I'd think that a developer that windows oriented would simply run windows on his desktop and VTune there. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 04:37 PM, Ingo Molnar wrote: * Avi Kivitya...@redhat.com wrote: Certainly guests that we don't port won't be able to use this. I doubt we'll be able to make Windows work with this - the only performance tool I'm familiar with on Windows is Intel's VTune, and that's proprietary. Dont you see the extreme irony of your wish to limit Linux kernel design decisions and features based on ... Windows and other proprietary software? Not at all. Virtualization is a hardware compatibility game. To see what happens if you don't play it, see Xen. Eventually they to implemented hardware support even though the pv approach is so wonderful. That's not quite equivalent though. KVM used to be the clean, integrate-code-with-Linux virtualization approach, designed specifically for CPUs that can be virtualized properly. (VMX support first, then SVM, etc.) KVM virtualized ages-old concepts with relatively straightforward hardware ABIs: x86 execution, IRQ abstractions, device abstractions, etc. Now you are in essence turning that all around: - the PMU is by no means properly virtualized nor really virtualizable by direct access. There's no virtual PMU that ticks independently of the host PMU. There's no guest debug registers that can be programmed independently of the host debug registers, but we manage somehow. It's not perfect, but better than nothing. For the common case of host-only or guest-only monitoring, things will work, perhaps without socketwide counters in security concious environments. When both are used at the same time, something will have to give. - the PMU hardware itself is not a well standardized piece of hardware. It's very vendor dependent and very limiting. That's life. If we force standardization by having a soft pmu, we'll be very limited as well. If we don't, we reduce hardware independence which is a strong point of virtualization. Clearly we need to make a trade-off here. In favour of hardware dependence is that tools and users are already used to it. There is also the architectural pmu that can provide a limited form of hardware independence. Going pv trades off hardware dependence for software dependence. Suddenly only guests that you have control over can use the pmu. So to some degree you are playing the role of Xen in this specific affair. You are pushing for something that shouldnt be done in that form. You want to interfere with the host PMU by going via the fast easy short-term hack to just let the guest OS have the PMU, without any regard to how this impacts long-term feasible solutions. Maybe. And maybe the vendors will improve virtualization support for the pmu, rendering the pv approach obsolete on new hardware. I.e. you are a bit like the guy who would have told Linus in 1994: Dude, why dont you use the Windows APIs? It's far more compatible and that's the only way you could run any serious apps. Besides, it requires no upgrade. Admittedly it's a bit messy and 16-bit but hey, that's our installed base after all. Hey, maybe we'd have significant desktop market share if he'd done this (though a replay of the wine history is much more likely). But what are you suggesting? That we make Windows a second class guest? Most users run a mix of workloads, that will not go down well with them. The choice is between first-class Windows support vs becoming a hobby hypervisor. Let's make a kerner/user analogy again. Would you be in favour of GPL-only-ing new syscalls, to give open source applications an edge over proprietary apps (technically known as crap among some)? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 05:55 PM, Peter Zijlstra wrote: BTW, just wondering, why would a developer be running VTune in a guest anyway? I'd think that a developer that windows oriented would simply run windows on his desktop and VTune there. Cloud. You have an app running somewhere on a cloud, internally or externally (you may not even know). It's running a production workload and it isn't doing well. You can't reproduce it on your desktop (works for me, now go away). So you rdesktop to your guest and monitor it. You can't run anything on the host - you don't have access to it, you don't know who admins it (it's a program anyway), the host doesn't even exist, the guest moves around whenever the cloud feels like it. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PMU virtualization
On 02/26/2010 06:03 PM, Avi Kivity wrote: Note, I'll be away for a week, so will not be responsive for a while -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: repeatable hang with loop mount and heavy IO in guest
1 0 0 98 0 1| 0 0 | 66B 354B| 0 0 | 3011 1 1 0 98 0 0| 0 0 | 66B 354B| 0 0 | 2911 From that point onwards, nothing will happen. The host has disk IO to spare... So what is it waiting for?? Moved to an AMD64 host. No effect. Disabled swap before running the test. No effect. Moved the guest to a fully up-to-date FC12 server (2.6.31.6-145.fc12.x86_64), no effect. I have narrowed it down to the guest's filesystem used for backing the disk image which is loop mounted: although it was not completely full (and had enough inodes), freeing some space on it prevents the system from misbehaving. FYI: the disk image was clean and was fscked before each test. kvm had been updated to 0.12.3 The weird thing is that the same filesystem works fine (no system hang) if used directly from the host, it is only misbehaving via kvm... So I am not dismissing the possibility that kvm may be at least partly to blame, or that it is exposing a filesystem bug (race?) not normally encountered. (I have backed up the full 32GB virtual disk in case someone suggests further investigation) Cheers Antoine -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Enhance perf to support KVM
On 02/26/2010 01:17 PM, Ingo Molnar wrote: Nobody is really 'in charge' of how KVM gets delivered to the user. You isolated the fun kernel part for you and pushed out the boring bits to user-space. So if mundane things like mouse integration sucks 'hey that's a user-space tooling problem', if file integration sucks then 'hey, that's an admin problem', if it cannot be used over the network 'hey, that's an Xorg problem', etc. etc. btw, mouse integration works with -usbdevice tablet and recent Fedoras, 'it was an X.org driver problem'. Really, I don't understand your problems. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vlan disable TSO of virtio in KVM
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote: Hi, I read some kernel source. My basic understanding is that, in net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig created interface is defined to be dev-features |= real_dev-features real_dev-vlan_features; However, in drivers/net/virtio_net.c, vlan_features are never set (I will assume it would be 0). So, dev-features will be 0 for the ethX.vid interface. I verify it using ethtool -k on each KVM. # ethtool -k eth0 shows that rx/tx csum, sg, tso, gso are on # ethtool -k eth0.3003 all offloading features are off. I think that is why TSO is never enabled when running large package traffic between two vlan interfaces of different KVMs. I also took a look at VMware's pv implementation, which is drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features when probing, by netdev-vlan_features = netdev-features; I was wondering why vlan_features was not defined in virtio_net. Is it a BUG? Or, it is due to some constraints? Could any explain that? I saw the same issue some time back and submitted a couple of patches to address it, but were not accepted as the fix is not done at the right place. Not sure if we can do this right without updating virtio_net_hdr with vlan specific info. http://thread.gmane.org/gmane.linux.network/150197/focus=150838 http://thread.gmane.org/gmane.linux.network/150198/focus=150837 Thanks Sridhar Thanks, -D On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com wrote: Hi all, I have been deploying two KVMs on my Debian testing box. Two KVMs each use one tap device connecting to the host. When I doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0) using ethX on them, I could verify that TSO did happened by # tcpdump -nt -i tap1 I can see messages like, IP 192.168.101.2.39994 192.168.101.1.41316: Flags [P.], seq 7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr 874803], length 5792 So, according the 'length', skb didn't get segmented. However, When I (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface eth1.3003, eth0.3003 on both machines. (2) netperf between two new interface, TSO no longer showed up, # tcpdump -nt -i tap1 I only got, vlan 3003, p 0, IP 10.214.10.2.42324 10.214.10.1.56460: Flags [P.], seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr 926686], length 1448 So, all the large packages get segmented in virtio (is that right?) My KVM command line options are, kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56 -net tap,ifname=$TAP,script=no My question is whether it is the expected behavior? Can VLAN tagging coexist with TSO in virtio_net driver? If this is not desired result. Any hint for fixing the problem? Thanks -D -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vlan disable TSO of virtio in KVM
On Fri, 2010-02-26 at 10:51 +0800, David V. Cloud wrote: Hi, I read some kernel source. My basic understanding is that, in net/8021q/vlan_dev.c, vlan_dev_init, the dev-features of vconfig created interface is defined to be dev-features |= real_dev-features real_dev-vlan_features; However, in drivers/net/virtio_net.c, vlan_features are never set (I will assume it would be 0). So, dev-features will be 0 for the ethX.vid interface. I verify it using ethtool -k on each KVM. # ethtool -k eth0 shows that rx/tx csum, sg, tso, gso are on # ethtool -k eth0.3003 all offloading features are off. I think that is why TSO is never enabled when running large package traffic between two vlan interfaces of different KVMs. I also took a look at VMware's pv implementation, which is drivers/net/vmxnet3/vmxnet3_drv.c, they have enable dev-vlan_features when probing, by netdev-vlan_features = netdev-features; I was wondering why vlan_features was not defined in virtio_net. Is it a BUG? Or, it is due to some constraints? Could any explain that? I saw the same issue some time back and submitted a couple of patches to address it, but were not accepted as the fix is not done at the right place. Not sure if we can do this right without updating virtio_net_hdr with vlan specific info. http://thread.gmane.org/gmane.linux.network/150197/focus=150838 http://thread.gmane.org/gmane.linux.network/150198/focus=150837 Thanks Sridhar Thanks, -D On Thu, Feb 25, 2010 at 6:30 PM, David V. Cloud david.v.cl...@gmail.com wrote: Hi all, I have been deploying two KVMs on my Debian testing box. Two KVMs each use one tap device connecting to the host. When I doing netperf with large package size from KVM2 (tap1) to KVM1 (tap0) using ethX on them, I could verify that TSO did happened by # tcpdump -nt -i tap1 I can see messages like, IP 192.168.101.2.39994 192.168.101.1.41316: Flags [P.], seq 7912865:7918657, ack 0, win 92, options [nop,nop,TS val 874151 ecr 874803], length 5792 So, according the 'length', skb didn't get segmented. However, When I (1) setup VLAN using vconfig on KVM2, KVM1, and got two new interface eth1.3003, eth0.3003 on both machines. (2) netperf between two new interface, TSO no longer showed up, # tcpdump -nt -i tap1 I only got, vlan 3003, p 0, IP 10.214.10.2.42324 10.214.10.1.56460: Flags [P.], seq 2127976:2129424, ack 1, win 92, options [nop,nop,TS val 926034 ecr 926686], length 1448 So, all the large packages get segmented in virtio (is that right?) My KVM command line options are, kvm -hda $IMG -m 768 -net nic,model=virtio,macaddr=52:54:00:12:34:56 -net tap,ifname=$TAP,script=no My question is whether it is the expected behavior? Can VLAN tagging coexist with TSO in virtio_net driver? If this is not desired result. Any hint for fixing the problem? Thanks -D -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html