Re: Re: [RFC PATCH 0/2] kvm/vmx: Output TSC offset
Thank you for commenting on my patch set. (2012/11/14 11:31), Steven Rostedt wrote: On Tue, 2012-11-13 at 18:03 -0800, David Sharp wrote: On Tue, Nov 13, 2012 at 6:00 PM, Steven Rostedt rost...@goodmis.org wrote: On Wed, 2012-11-14 at 10:36 +0900, Yoshihiro YUNOMAE wrote: To merge the data like previous pattern, we apply this patch set. Then, we can get TSC offset of the guest as follows: $ dmesg | grep kvm [ 57.717180] kvm: (2687) write TSC offset 18446743360465545001, now clock ## | PID TSC offset | HOST TSC value --+ Using printk to export something like this is IMO a nasty hack. Can't we create a /sys or /proc file to export the same thing? Since the value changes over the course of the trace, and seems to be part of the context of the trace, I think I'd include it as a tracepoint. I'm fine with that too. Using some tracepoint is a nice idea, but there is one problem. Here, our discussion point is the event which TSC offset is changed does not frequently occur, but the buffer must keep the event data. There are two ideas for using tracepoint. First, we define new tracepoint for changed TSC offset. This is simple and the overhead will be low. However, this trace event stored in the buffer will be overwritten by other trace events because this TSC offset event does not frequently occur. Second, we add TSC offset information to the tracepoint frequently occured. For example, we assume that TSC offset information is added to arguments of trace_kvm_exit(). By adding the information to the arguments, we can avoid the situation where the TSC offset information is overwritten by other events. However, TSC offset is not frequently changed and same information is output many times because almost all data are waste. Therefore, only using tracepoint is not good idea. So, I suggest a hybrid method; record TSC offset change events and read the last TSC offset from procfs when collecting the trace data. In particular, the method is as follows: 1. Enable the tracepoint of TSC offset change and record the value before and after changing 2. Start tracing 3. Stop tracing 4. Collect trace data and read /proc/pid/kvm/* 5. Check if any trace event recording the two TSC offsets exists in the trace data if(existing) = use trace event (flow 6) else = use /proc/pid/kvm/* (flow 7) 6. Apply two TSC offsets of the trace event to the trace data and sort the trace data (Ex.) * = tracepoint of changing TSC offset . = another trace event [START]*[END] -- -- previous current TSC offset TSC offset 7. Apply TSC offset of /proc/pid/kvm/* to the trace data and sort the trace data (Ex.) . = another trace event(not tracepoint of changing TSC offset) [START].[END] --- current TSC offset Thanks, -- Yoshihiro YUNOMAE Software Platform Research Dept. Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory E-mail: yoshihiro.yunomae...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when loading kvm_intel module
于 2012年11月14日 05:22, Marcelo Tosatti 写道: On Thu, Nov 01, 2012 at 01:55:04PM +0800, zhangyanfei wrote: 于 2012年10月31日 17:01, Hatayama, Daisuke 写道: -Original Message- From: kexec-boun...@lists.infradead.org [mailto:kexec-boun...@lists.infradead.org] On Behalf Of zhangyanfei Sent: Wednesday, October 31, 2012 12:34 PM To: x...@kernel.org; ke...@lists.infradead.org; Avi Kivity; Marcelo Tosatti Cc: linux-ker...@vger.kernel.org; kvm@vger.kernel.org Subject: [PATCH v3 2/2] KVM: make crash_clear_loaded_vmcss valid when loading kvm_intel module Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com [...] @@ -7230,6 +7231,10 @@ static int __init vmx_init(void) if (r) goto out3; +#ifdef CONFIG_KEXEC + crash_clear_loaded_vmcss = vmclear_local_loaded_vmcss; +#endif + Assignment here cannot cover the case where NMI is initiated after VMX is on in kvm_init and before vmclear_local_loaded_vmcss is assigned, though rare but can happen. By saying VMX is on in kvm init, you mean kvm_init enables the VMX feature in the logical processor? No, only there is a vcpu to be created, kvm will enable the VMX feature. I think there is no difference with this assignment before or after kvm_init because the vmcs linked list must be empty before vmx_init is finished. The list is not initialized before hardware_enable(), though. Should move the assignment after that. Also, it is possible that the loaded_vmcss_on_cpu list is being modified _while_ crash executes say via NMI, correct? If that is the case, better flag that the list is under manipulation so the vmclear can be skipped. Thanks for your comments. In the new patchset, I didn't move the crash_clear_loaded_vmcss assignment. I added a new percpu variable vmclear_skipped to indicate everything: 1. Before the loaded_vmcss_on_cpu list is initialized, vmclear_skipped is 1 and this means if the machine crashes and doing kdump, crash_clear_loaded_vmcss still will not be called. 2. If the loaded_vmcss_on_cpu list is under manipulation, vmclear_skipped is set to 1 and after the manipulation is finished, the variable is set to 0. 3. After all loaded vmcss are vmcleared, vmclear_skipped is set to 1. So we needn't repeat to vmclear loaded vmcss in kdump path. Please refer to the new version of the patchset I sent. If you have any suggestions, that'll be helpful. Thanks Zhang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/2] x86: clear vmcss on all cpus when doing kdump if necessary
Currently, kdump just makes all the logical processors leave VMX operation by executing VMXOFF instruction, so any VMCSs active on the logical processors may be corrupted. But, sometimes, we need the VMCSs to debug guest images contained in the host vmcore. To prevent the corruption, we should VMCLEAR the VMCSs before executing the VMXOFF instruction. The patch set provides a way to VMCLEAR vmcss related to guests on all cpus before executing the VMXOFF when doing kdump. This is used to ensure the VMCSs in the vmcore updated and non-corrupted. Changelog from v3 to v4: 1. add a new percpu variable vmclear_skipped to skip vmclear in kdump in some conditions. Changelog from v2 to v3: 1. remove unnecessary conditions in function cpu_emergency_clear_loaded_vmcss as Marcelo suggested. Changelog from v1 to v2: 1. remove the sysctl and clear VMCSs unconditionally. Zhang Yanfei (2): x86/kexec: VMCLEAR vmcss on all cpus if necessary KVM: set/unset crash_clear_loaded_vmcss and vmclear_skipped in kvm_intel module arch/x86/include/asm/kexec.h |3 +++ arch/x86/kernel/crash.c | 32 arch/x86/kvm/vmx.c | 32 3 files changed, 67 insertions(+), 0 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/2] x86/kexec: VMCLEAR vmcss on all cpus if necessary
crash_clear_loaded_vmcss is added to VMCLEAR vmcss loaded on all cpus. And when loading kvm_intel module, the function pointer will be made valid. The percpu variable vmclear_skipped is added to flag the case that if loaded_vmcss_on_cpu list is being modified while the machine crashes and doing kdump, the vmclear here can be skipped. Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com --- arch/x86/include/asm/kexec.h |3 +++ arch/x86/kernel/crash.c | 32 2 files changed, 35 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h index 317ff17..d892211 100644 --- a/arch/x86/include/asm/kexec.h +++ b/arch/x86/include/asm/kexec.h @@ -163,6 +163,9 @@ struct kimage_arch { }; #endif +extern void (*crash_clear_loaded_vmcss)(void); +DECLARE_PER_CPU(int, vmclear_skipped); + #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_KEXEC_H */ diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c index 13ad899..b9f264e 100644 --- a/arch/x86/kernel/crash.c +++ b/arch/x86/kernel/crash.c @@ -16,6 +16,7 @@ #include linux/delay.h #include linux/elf.h #include linux/elfcore.h +#include linux/module.h #include asm/processor.h #include asm/hardirq.h @@ -30,6 +31,27 @@ int in_crash_kexec; +/* + * This is used to VMCLEAR vmcss loaded on all + * cpus. And when loading kvm_intel module, the + * function pointer will be made valid. + */ +void (*crash_clear_loaded_vmcss)(void) = NULL; +EXPORT_SYMBOL_GPL(crash_clear_loaded_vmcss); + +DEFINE_PER_CPU(int, vmclear_skipped) = 1; +EXPORT_SYMBOL_GPL(vmclear_skipped); + +static void cpu_emergency_clear_loaded_vmcss(void) +{ + int cpu = raw_smp_processor_id(); + int skipped; + + skipped = per_cpu(vmclear_skipped, cpu); + if (!skipped crash_clear_loaded_vmcss) + crash_clear_loaded_vmcss(); +} + #if defined(CONFIG_SMP) defined(CONFIG_X86_LOCAL_APIC) static void kdump_nmi_callback(int cpu, struct pt_regs *regs) @@ -46,6 +68,11 @@ static void kdump_nmi_callback(int cpu, struct pt_regs *regs) #endif crash_save_cpu(regs, cpu); + /* +* VMCLEAR vmcss loaded on all cpus if needed. +*/ + cpu_emergency_clear_loaded_vmcss(); + /* Disable VMX or SVM if needed. * * We need to disable virtualization on all CPUs. @@ -88,6 +115,11 @@ void native_machine_crash_shutdown(struct pt_regs *regs) kdump_nmi_shootdown_cpus(); + /* +* VMCLEAR vmcss loaded on this cpu if needed. +*/ + cpu_emergency_clear_loaded_vmcss(); + /* Booting kdump kernel with VMX or SVM enabled won't work, * because (among other limitations) we can't disable paging * with the virt flags. -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/2] KVM: set/unset crash_clear_loaded_vmcss and vmclear_skipped in kvm_intel module
Signed-off-by: Zhang Yanfei zhangyan...@cn.fujitsu.com --- arch/x86/kvm/vmx.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 4ff0ab9..029ec7b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -41,6 +41,7 @@ #include asm/i387.h #include asm/xcr.h #include asm/perf_event.h +#include asm/kexec.h #include trace.h @@ -963,6 +964,20 @@ static void vmcs_load(struct vmcs *vmcs) vmcs, phys_addr); } +static inline void enable_vmclear_in_kdump(int cpu) +{ +#ifdef CONFIG_KEXEC + per_cpu(vmclear_skipped, cpu) = 0; +#endif +} + +static inline void disable_vmclear_in_kdump(int cpu) +{ +#ifdef CONFIG_KEXEC + per_cpu(vmclear_skipped, cpu) = 1; +#endif +} + static void __loaded_vmcs_clear(void *arg) { struct loaded_vmcs *loaded_vmcs = arg; @@ -972,8 +987,10 @@ static void __loaded_vmcs_clear(void *arg) return; /* vcpu migration can race with cpu offline */ if (per_cpu(current_vmcs, cpu) == loaded_vmcs-vmcs) per_cpu(current_vmcs, cpu) = NULL; + disable_vmclear_in_kdump(cpu); list_del(loaded_vmcs-loaded_vmcss_on_cpu_link); loaded_vmcs_init(loaded_vmcs); + enable_vmclear_in_kdump(cpu); } static void loaded_vmcs_clear(struct loaded_vmcs *loaded_vmcs) @@ -1491,8 +1508,10 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu); local_irq_disable(); + disable_vmclear_in_kdump(cpu); list_add(vmx-loaded_vmcs-loaded_vmcss_on_cpu_link, per_cpu(loaded_vmcss_on_cpu, cpu)); + enable_vmclear_in_kdump(cpu); local_irq_enable(); /* @@ -2302,6 +2321,9 @@ static int hardware_enable(void *garbage) return -EBUSY; INIT_LIST_HEAD(per_cpu(loaded_vmcss_on_cpu, cpu)); + + enable_vmclear_in_kdump(cpu); + rdmsrl(MSR_IA32_FEATURE_CONTROL, old); test_bits = FEATURE_CONTROL_LOCKED; @@ -2333,6 +2355,8 @@ static void vmclear_local_loaded_vmcss(void) list_for_each_entry_safe(v, n, per_cpu(loaded_vmcss_on_cpu, cpu), loaded_vmcss_on_cpu_link) __loaded_vmcs_clear(v); + + disable_vmclear_in_kdump(cpu); } @@ -7230,6 +7254,10 @@ static int __init vmx_init(void) if (r) goto out3; +#ifdef CONFIG_KEXEC + crash_clear_loaded_vmcss = vmclear_local_loaded_vmcss; +#endif + vmx_disable_intercept_for_msr(MSR_FS_BASE, false); vmx_disable_intercept_for_msr(MSR_GS_BASE, false); vmx_disable_intercept_for_msr(MSR_KERNEL_GS_BASE, true); @@ -7265,6 +7293,10 @@ static void __exit vmx_exit(void) free_page((unsigned long)vmx_io_bitmap_b); free_page((unsigned long)vmx_io_bitmap_a); +#ifdef CONFIG_KEXEC + crash_clear_loaded_vmcss = NULL; +#endif + kvm_exit(); } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 10/16] x86: vdso: pvclock gettime support
On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote: Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + +static notrace cycle_t vread_pvclock(int *mode) +{ + const struct pvclock_vsyscall_time_info *pvti; + cycle_t ret; + u64 last; + u32 version; + u32 migrate_count; + u8 flags; + unsigned cpu, cpu1; + + + /* + * When looping to get a consistent (time-info, tsc) pair, we + * also need to deal with the possibility we can switch vcpus, + * so make sure we always re-fetch time-info for the current vcpu. + */ + do { + cpu = __getcpu() VGETCPU_CPU_MASK; + pvti = get_pvti(cpu); + + migrate_count = pvti-migrate_count; + + version = __pvclock_read_cycles(pvti-pvti, ret, flags); + + /* + * Test we're still on the cpu as well as the version. + * We could have been migrated just after the first + * vgetcpu but before fetching the version, so we + * wouldn't notice a version change. + */ + cpu1 = __getcpu() VGETCPU_CPU_MASK; + } while (unlikely(cpu != cpu1 || + (pvti-pvti.version 1) || + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); + We can put vcpu id into higher bits of pvti.version. This will save a couple of cycles by getting rid of __getcpu() calls. + if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + *mode = VCLOCK_NONE; + + /* refer to tsc.c read_tsc() comment for rationale */ + last = VVAR(vsyscall_gtod_data).clock.cycle_last; + + if (likely(ret = last)) + return ret; + + return last; +} +#endif + notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -80,7 +145,7 @@ notrace static long vdso_fallback_gtod(s } -notrace static inline u64 vgetsns(void) +notrace static inline u64 vgetsns(int *mode) { long v; cycles_t cycles; @@ -88,6 +153,8 @@ notrace static inline u64 vgetsns(void) cycles = vread_tsc(); else if (gtod-clock.vclock_mode == VCLOCK_HPET) cycles = vread_hpet(); + else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); else return 0; v = (cycles - gtod-clock.cycle_last) gtod-clock.mask; @@ -107,7 +174,7 @@ notrace static int __always_inline do_re mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-wall_time_sec; ns = gtod-wall_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); @@ -127,7 +194,7 @@ notrace static int do_monotonic(struct t mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-monotonic_time_sec; ns = gtod-monotonic_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); timespec_add_ns(ts, ns); Index: vsyscall/arch/x86/include/asm/vsyscall.h === --- vsyscall.orig/arch/x86/include/asm/vsyscall.h +++ vsyscall/arch/x86/include/asm/vsyscall.h @@ -33,6 +33,23 @@ extern void map_vsyscall(void); */ extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address); +#define VGETCPU_CPU_MASK 0xfff + +static inline unsigned int __getcpu(void) +{ + unsigned int p; + + if (VVAR(vgetcpu_mode) == VGETCPU_RDTSCP) { + /* Load per CPU data from RDTSCP */ +
Re: [PATCH] KVM: MMU: lazily drop large spte
On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote: Hi Marcelo, On 11/13/2012 07:10 AM, Marcelo Tosatti wrote: On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote: Do not drop large spte until it can be insteaded by small pages so that the guest can happliy read memory through it The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 34 +- 1 files changed, 9 insertions(+), 25 deletions(-) Its likely that other 4k pages are mapped read-write in the 2mb range covered by a read-only 2mb map. Therefore its not entirely useful to map read-only. It needs a page fault to install a pte even if it is the read access. After the change, the page fault can be avoided. Can you measure an improvement with this change? I have a test case to measure the read time which has been attached. It maps 4k pages at first (dirt-loggged), then switch to large sptes (stop dirt-logging), at the last, measure the read access time after write protect sptes. Before: 23314111 ns After: 11404197 ns Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow), that is: - large page must be destroyed when write protecting due to shadowed page. - with shadow, it does not make sense to write protect large sptes as mentioned earlier. So i wonder why is this part from your patch - if (level PT_PAGE_TABLE_LEVEL - has_wrprotected_page(vcpu-kvm, gfn, level)) { - ret = 1; - drop_spte(vcpu-kvm, sptep); - goto done; - } necessary (assuming EPT is in use). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: lazily drop large spte
On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote: Ccing live migration developers who should be interested in this work, On Mon, 12 Nov 2012 21:10:32 -0200 Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote: Do not drop large spte until it can be insteaded by small pages so that the guest can happliy read memory through it The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 34 +- 1 files changed, 9 insertions(+), 25 deletions(-) Its likely that other 4k pages are mapped read-write in the 2mb range covered by a read-only 2mb map. Therefore its not entirely useful to map read-only. Can you measure an improvement with this change? What we discussed at KVM Forum last week was about the jitter we could measure right after starting live migration: both Isaku and Chegu reported such jitter. So if this patch reduces such jitter for some real workloads, by lazily dropping largepage mappings and saving read faults until that point, that would be very nice! But sadly, what they measured included interactions with the outside of the guest, and the main cause was due to the big QEMU lock problem, they guessed. The order is so different that an improvement by a kernel side effort may not be seen easily. FWIW: I am now changing the initial write protection by kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum. ftrace said that 1ms was improved to 250-350us by the change for 10GB guest. My code still drops largepage mappings, so the initial write protection time itself may not be a such big issue here, I think. Again, if we can eliminate read faults to such an extent that guests can see measurable improvement, that should be very nice! Any thoughts? Thanks, Takuya OK, makes sense. I'm worried about shadow / oos interactions with large read-only mappings (trying to remember what was the case exactly, it might be non-existant now). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 0/3] vfio-pci for 1.3-rc0
Alex Williamson alex.william...@redhat.com writes: Hi Anthony, Please pull the tag below. I posted the linux-headers update separately on Oct-15; since it hasn't been applied and should be non-controversial, I include it again here. Thanks, Alex Pulled. Thanks. Regards, Anthony Liguori The following changes since commit f5022a135e4309a54d433c69b2a056756b2d0d6b: aio: fix aio_ctx_prepare with idle bottom halves (2012-11-12 20:02:09 +0400) are available in the git repository at: git://github.com/awilliam/qemu-vfio.git tags/vfio-pci-for-qemu-1.3.0-rc0 for you to fetch changes up to a771c51703cf9f91023c6570426258bdf5ec775b: vfio-pci: Use common msi_get_message (2012-11-13 12:27:40 -0700) vfio-pci: KVM INTx accel common msi_get_message Alex Williamson (3): linux-headers: Update to 3.7-rc5 vfio-pci: Add KVM INTx acceleration vfio-pci: Use common msi_get_message hw/vfio_pci.c| 210 +++ linux-headers/asm-powerpc/kvm_para.h | 6 +- linux-headers/asm-s390/kvm_para.h| 8 +- linux-headers/asm-x86/kvm.h | 17 +++ linux-headers/linux/kvm.h| 25 - linux-headers/linux/kvm_para.h | 6 +- linux-headers/linux/vfio.h | 6 +- linux-headers/linux/virtio_config.h | 6 +- linux-headers/linux/virtio_ring.h| 6 +- 9 files changed, 241 insertions(+), 49 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interrupt remapping support
On Wed, Nov 14, 2012 at 05:46:45PM +0100, emdel wrote: On Tue, Nov 13, 2012 at 10:29 AM, Gleb Natapov g...@redhat.com wrote: KVM does not implement VT-d spec if this is your question. Any help with this will be appreciated. Hello everybody, following this link [1] it looks like that we can configure pass-through devices for KVM guests, so if it is the case and as you said KVM doesn't implement any Vt-d specification, are there any protections in place for DMA attacks? KVM uses VT-d on a host for device assignment. Guest running inside KVM will not see VT-d though since KVM does not emulate it. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: interrupt remapping support
On Wed, Nov 14, 2012 at 06:31:30PM +0100, emdel wrote: On Wed, Nov 14, 2012 at 6:06 PM, Gleb Natapov g...@redhat.com wrote: KVM uses VT-d on a host for device assignment. Guest running inside KVM will not see VT-d though since KVM does not emulate it. So another question comes up in my mind: what's the purpose of the host devices assignment if I cannot use it for the guest? I do not think you understand what I am saying. You can use device assignment for a guest. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/20] KVM/MIPS32: MIPS arch specific APIs for KVM
On Nov 1, 2012, at 11:18 AM, Avi Kivity wrote: + +/* Set the appropriate status bits based on host CPU features, before we hit the scheduler */ +kvm_mips_set_c0_status(); + +local_irq_enable(); Ah, so you handle exits with interrupts enabled. But that's not how we usually do it; the standard pattern is while (can continue) disable interrupts enter guest enable interrupts process exit A bit more detail here. KVM/MIPS has its own set of exception handlers which are separate from the host kernel's handlers. We switch between the 2 sets of handlers by setting the Exception Base Register (EBASE). We enable host interrupts just before we switch to guest context so that we trap when the host gets a timer or I/O interrupt. When an exception does occur in guest context, the KVM/MIPS handlers will save the guest context, and switch back to the default host kernel exception handlers. We enter the C handler (kvm_mips_handle_exit()) with interrupts disabled, and explicitly enable them there. This allows the host kernel to handle any pending interrupts. The sequence is as follows while (can continue) disable interrupts trampoline code to save host kernel context, load guest context enable host interrupts enter guest context KVM/MIPS trap handler (called with interrupts disabled, per MIPS architecture) Restore host Linux context, setup stack to handle exception Jump to C handler Enable interrupts before handling VM exit. Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 10/16] x86: vdso: pvclock gettime support
On Wed, Nov 14, 2012 at 12:42:48PM +0200, Gleb Natapov wrote: On Wed, Oct 31, 2012 at 08:47:06PM -0200, Marcelo Tosatti wrote: Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,70 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const aligned_pvti_t *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (aligned_pvti_t *)__fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset].info; +} + +static notrace cycle_t vread_pvclock(int *mode) +{ + const struct pvclock_vsyscall_time_info *pvti; + cycle_t ret; + u64 last; + u32 version; + u32 migrate_count; + u8 flags; + unsigned cpu, cpu1; + + + /* +* When looping to get a consistent (time-info, tsc) pair, we +* also need to deal with the possibility we can switch vcpus, +* so make sure we always re-fetch time-info for the current vcpu. +*/ + do { + cpu = __getcpu() VGETCPU_CPU_MASK; + pvti = get_pvti(cpu); + + migrate_count = pvti-migrate_count; + + version = __pvclock_read_cycles(pvti-pvti, ret, flags); + + /* +* Test we're still on the cpu as well as the version. +* We could have been migrated just after the first +* vgetcpu but before fetching the version, so we +* wouldn't notice a version change. +*/ + cpu1 = __getcpu() VGETCPU_CPU_MASK; + } while (unlikely(cpu != cpu1 || + (pvti-pvti.version 1) || + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); + We can put vcpu id into higher bits of pvti.version. This will save a couple of cycles by getting rid of __getcpu() calls. Yes. Added as comment in the code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: lazily drop large spte
On 11/14/2012 10:37 PM, Marcelo Tosatti wrote: On Tue, Nov 13, 2012 at 04:26:16PM +0800, Xiao Guangrong wrote: Hi Marcelo, On 11/13/2012 07:10 AM, Marcelo Tosatti wrote: On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote: Do not drop large spte until it can be insteaded by small pages so that the guest can happliy read memory through it The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 34 +- 1 files changed, 9 insertions(+), 25 deletions(-) Its likely that other 4k pages are mapped read-write in the 2mb range covered by a read-only 2mb map. Therefore its not entirely useful to map read-only. It needs a page fault to install a pte even if it is the read access. After the change, the page fault can be avoided. Can you measure an improvement with this change? I have a test case to measure the read time which has been attached. It maps 4k pages at first (dirt-loggged), then switch to large sptes (stop dirt-logging), at the last, measure the read access time after write protect sptes. Before: 23314111 ns After: 11404197 ns Ok, i'm concerned about cases similar to e49146dce8c3dc6f44 (with shadow), that is: - large page must be destroyed when write protecting due to shadowed page. - with shadow, it does not make sense to write protect large sptes as mentioned earlier. This case is removed now, the code when e49146dce8c3dc6f44 was applied is: | |pt = sp-spt; |for (i = 0; i PT64_ENT_PER_PAGE; ++i) |/* avoid RMW */ |if (is_writable_pte(pt[i])) |update_spte(pt[i], pt[i] ~PT_WRITABLE_MASK); |} The real problem in this code is it would write-protect the spte even if it is not a last spte that caused the middle-level shadow page table was write-protected. So e49146dce8c3dc6f44 added this code: |if (sp-role.level != PT_PAGE_TABLE_LEVEL) |continue; | was good to fix this problem. Now, the current code is: | for (i = 0; i PT64_ENT_PER_PAGE; ++i) { | if (!is_shadow_present_pte(pt[i]) || | !is_last_spte(pt[i], sp-role.level)) | continue; | | spte_write_protect(kvm, pt[i], flush, false); | } It only write-protect the last spte. So, it allows large spte existent. (the large spte can be broken by drop_large_spte() on the page-fault path.) So i wonder why is this part from your patch - if (level PT_PAGE_TABLE_LEVEL - has_wrprotected_page(vcpu-kvm, gfn, level)) { - ret = 1; - drop_spte(vcpu-kvm, sptep); - goto done; - } necessary (assuming EPT is in use). This is safe, we change these code to: - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) { + if ((level PT_PAGE_TABLE_LEVEL + has_wrprotected_page(vcpu-kvm, gfn, level)) || + mmu_need_write_protect(vcpu, gfn, can_unsync)) { pgprintk(%s: found shadow page for %llx, marking ro\n, __func__, gfn); ret = 1; The spte become read-only which can ensure the shadow gfn can not be changed. Btw, the origin code allows to create readonly spte under this case if !(pte_access WRITEABBLE) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: MMU: lazily drop large spte
On 11/14/2012 10:44 PM, Marcelo Tosatti wrote: On Wed, Nov 14, 2012 at 12:33:50AM +0900, Takuya Yoshikawa wrote: Ccing live migration developers who should be interested in this work, On Mon, 12 Nov 2012 21:10:32 -0200 Marcelo Tosatti mtosa...@redhat.com wrote: On Mon, Nov 05, 2012 at 05:59:26PM +0800, Xiao Guangrong wrote: Do not drop large spte until it can be insteaded by small pages so that the guest can happliy read memory through it The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 34 +- 1 files changed, 9 insertions(+), 25 deletions(-) Its likely that other 4k pages are mapped read-write in the 2mb range covered by a read-only 2mb map. Therefore its not entirely useful to map read-only. Can you measure an improvement with this change? What we discussed at KVM Forum last week was about the jitter we could measure right after starting live migration: both Isaku and Chegu reported such jitter. So if this patch reduces such jitter for some real workloads, by lazily dropping largepage mappings and saving read faults until that point, that would be very nice! But sadly, what they measured included interactions with the outside of the guest, and the main cause was due to the big QEMU lock problem, they guessed. The order is so different that an improvement by a kernel side effort may not be seen easily. FWIW: I am now changing the initial write protection by kvm_mmu_slot_remove_write_access() to rmap based as I proposed at KVM Forum. ftrace said that 1ms was improved to 250-350us by the change for 10GB guest. My code still drops largepage mappings, so the initial write protection time itself may not be a such big issue here, I think. Again, if we can eliminate read faults to such an extent that guests can see measurable improvement, that should be very nice! Any thoughts? Thanks, Takuya OK, makes sense. I'm worried about shadow / oos interactions with large read-only mappings (trying to remember what was the case exactly, it might be non-existant now). Marcelo, i guess commit 38187c830cab84daecb41169948467f1f19317e3 is what you mentioned, but i do not know how it can Simplifies out of sync shadow. :( -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: PROBLEM: compilation issue, inline assembly arch/x86/kvm/emulate.c fails at -O0
On 11/14/2012 11:45 AM, Blower, Melanie wrote: [1.] gcc -O0 assembly arch/x86/kvm/emulate.c gets compilation failure -- incorrect register restrictions [2.] Full description of the problem/report: I'm trying to compile this file at -O0, but gcc chokes in register allocation at the inline assembly. In the ordinary Linux build, this file compiles with gcc at -O2, without compilation errors. Compiling with -O0 is not really expected to work (although -O1 *is*), although what you are reporting is an actual bug (+a : a should either be +a or =a : a). -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] [PULL] qemu-kvm.git uq/master queue
The following changes since commit ce34cf72fe508b27a78f83c184142e8d1e6a048a: Merge remote-tracking branch 'awilliam/tags/vfio-pci-for-qemu-1.3.0-rc0' into staging (2012-11-14 08:53:40 -0600) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master Jan Kiszka (1): kvm: Actually remove software breakpoints from list on cleanup Marcelo Tosatti (1): acpi_piix4: fix migration of gpe fields hw/acpi_piix4.c | 50 ++ kvm-all.c |2 ++ 2 files changed, 48 insertions(+), 4 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 11/18] x86: vdso: pvclock gettime support
Improve performance of time system calls when using Linux pvclock, by reading time info from fixmap visible copy of pvclock data. Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/vdso/vclock_gettime.c === --- vsyscall.orig/arch/x86/vdso/vclock_gettime.c +++ vsyscall/arch/x86/vdso/vclock_gettime.c @@ -22,6 +22,7 @@ #include asm/hpet.h #include asm/unistd.h #include asm/io.h +#include asm/pvclock.h #define gtod (VVAR(vsyscall_gtod_data)) @@ -62,6 +63,76 @@ static notrace cycle_t vread_hpet(void) return readl((const void __iomem *)fix_to_virt(VSYSCALL_HPET) + 0xf0); } +#ifdef CONFIG_PARAVIRT_CLOCK + +static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +{ + const struct pvclock_vsyscall_time_info *pvti_base; + int idx = cpu / (PAGE_SIZE/PVTI_SIZE); + int offset = cpu % (PAGE_SIZE/PVTI_SIZE); + + BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx PVCLOCK_FIXMAP_END); + + pvti_base = (struct pvclock_vsyscall_time_info *) + __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); + + return pvti_base[offset]; +} + +static notrace cycle_t vread_pvclock(int *mode) +{ + const struct pvclock_vsyscall_time_info *pvti; + cycle_t ret; + u64 last; + u32 version; + u32 migrate_count; + u8 flags; + unsigned cpu, cpu1; + + + /* +* When looping to get a consistent (time-info, tsc) pair, we +* also need to deal with the possibility we can switch vcpus, +* so make sure we always re-fetch time-info for the current vcpu. +*/ + do { + cpu = __getcpu() VGETCPU_CPU_MASK; + /* TODO: We can put vcpu id into higher bits of pvti.version. +* This will save a couple of cycles by getting rid of +* __getcpu() calls (Gleb). +*/ + + pvti = get_pvti(cpu); + + migrate_count = pvti-migrate_count; + + version = __pvclock_read_cycles(pvti-pvti, ret, flags); + + /* +* Test we're still on the cpu as well as the version. +* We could have been migrated just after the first +* vgetcpu but before fetching the version, so we +* wouldn't notice a version change. +*/ + cpu1 = __getcpu() VGETCPU_CPU_MASK; + } while (unlikely(cpu != cpu1 || + (pvti-pvti.version 1) || + pvti-pvti.version != version || + pvti-migrate_count != migrate_count)); + + if (unlikely(!(flags PVCLOCK_TSC_STABLE_BIT))) + *mode = VCLOCK_NONE; + + /* refer to tsc.c read_tsc() comment for rationale */ + last = VVAR(vsyscall_gtod_data).clock.cycle_last; + + if (likely(ret = last)) + return ret; + + return last; +} +#endif + notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -80,7 +151,7 @@ notrace static long vdso_fallback_gtod(s } -notrace static inline u64 vgetsns(void) +notrace static inline u64 vgetsns(int *mode) { long v; cycles_t cycles; @@ -88,6 +159,8 @@ notrace static inline u64 vgetsns(void) cycles = vread_tsc(); else if (gtod-clock.vclock_mode == VCLOCK_HPET) cycles = vread_hpet(); + else if (gtod-clock.vclock_mode == VCLOCK_PVCLOCK) + cycles = vread_pvclock(mode); else return 0; v = (cycles - gtod-clock.cycle_last) gtod-clock.mask; @@ -107,7 +180,7 @@ notrace static int __always_inline do_re mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-wall_time_sec; ns = gtod-wall_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); @@ -127,7 +200,7 @@ notrace static int do_monotonic(struct t mode = gtod-clock.vclock_mode; ts-tv_sec = gtod-monotonic_time_sec; ns = gtod-monotonic_time_snsec; - ns += vgetsns(); + ns += vgetsns(mode); ns = gtod-clock.shift; } while (unlikely(read_seqcount_retry(gtod-seq, seq))); timespec_add_ns(ts, ns); Index: vsyscall/arch/x86/include/asm/vsyscall.h === --- vsyscall.orig/arch/x86/include/asm/vsyscall.h +++ vsyscall/arch/x86/include/asm/vsyscall.h @@ -33,6 +33,23 @@ extern void map_vsyscall(void); */ extern bool emulate_vsyscall(struct pt_regs *regs, unsigned long address); +#define VGETCPU_CPU_MASK 0xfff + +static inline unsigned int __getcpu(void) +{ + unsigned int p; + + if (VVAR(vgetcpu_mode) ==
[patch 07/18] x86: pvclock: add note about rdtsc barriers
As noted by Gleb, not advertising SSE2 support implies no RDTSC barriers. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -74,6 +74,9 @@ unsigned __pvclock_read_cycles(const str u8 ret_flags; version = src-version; + /* Note: emulated platforms which do not advertise SSE2 support +* result in kvmclock not using the necessary RDTSC barriers. +*/ rdtsc_barrier(); offset = pvclock_get_nsec_offset(src); ret = src-system_time + offset; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 09/18] x86: pvclock: generic pvclock vsyscall initialization
Originally from Jeremy Fitzhardinge. Introduce generic, non hypervisor specific, pvclock initialization routines. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -17,6 +17,10 @@ #include linux/kernel.h #include linux/percpu.h +#include linux/notifier.h +#include linux/sched.h +#include linux/gfp.h +#include linux/bootmem.h #include asm/pvclock.h static u8 valid_flags __read_mostly = 0; @@ -122,3 +126,68 @@ void pvclock_read_wallclock(struct pvclo set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } + +static struct pvclock_vsyscall_time_info *pvclock_vdso_info; + +static struct pvclock_vsyscall_time_info * +pvclock_get_vsyscall_user_time_info(int cpu) +{ + if (!pvclock_vdso_info) { + BUG(); + return NULL; + } + + return pvclock_vdso_info[cpu]; +} + +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu) +{ + return pvclock_get_vsyscall_user_time_info(cpu)-pvti; +} + +int pvclock_task_migrate(struct notifier_block *nb, unsigned long l, void *v) +{ + struct task_migration_notifier *mn = v; + struct pvclock_vsyscall_time_info *pvti; + + pvti = pvclock_get_vsyscall_user_time_info(mn-from_cpu); + + /* this is NULL when pvclock vsyscall is not initialized */ + if (unlikely(pvti == NULL)) + return NOTIFY_DONE; + + pvti-migrate_count++; + + return NOTIFY_DONE; +} + +static struct notifier_block pvclock_migrate = { + .notifier_call = pvclock_task_migrate, +}; + +/* + * Initialize the generic pvclock vsyscall state. This will allocate + * a/some page(s) for the per-vcpu pvclock information, set up a + * fixmap mapping for the page(s) + */ + +int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, +int size) +{ + int idx; + + WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); + + pvclock_vdso_info = i; + + for (idx = 0; idx = (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { + __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, +__pa_symbol(i) + (idx*PAGE_SIZE), +PAGE_KERNEL_VVAR); + } + + + register_task_migration_notifier(pvclock_migrate); + + return 0; +} Index: vsyscall/arch/x86/include/asm/fixmap.h === --- vsyscall.orig/arch/x86/include/asm/fixmap.h +++ vsyscall/arch/x86/include/asm/fixmap.h @@ -19,6 +19,7 @@ #include asm/acpi.h #include asm/apicdef.h #include asm/page.h +#include asm/pvclock.h #ifdef CONFIG_X86_32 #include linux/threads.h #include asm/kmap_types.h @@ -81,6 +82,10 @@ enum fixed_addresses { VVAR_PAGE, VSYSCALL_HPET, #endif +#ifdef CONFIG_PARAVIRT_CLOCK + PVCLOCK_FIXMAP_BEGIN, + PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, +#endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -85,4 +85,16 @@ unsigned __pvclock_read_cycles(const str return version; } +struct pvclock_vsyscall_time_info { + struct pvclock_vcpu_time_info pvti; + u32 migrate_count; +} __attribute__((__aligned__(SMP_CACHE_BYTES))); + +#define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) +#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1) + +int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, +int size); +struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); + #endif /* _ASM_X86_PVCLOCK_H */ Index: vsyscall/arch/x86/include/asm/clocksource.h === --- vsyscall.orig/arch/x86/include/asm/clocksource.h +++ vsyscall/arch/x86/include/asm/clocksource.h @@ -8,6 +8,7 @@ #define VCLOCK_NONE 0 /* No vDSO clock available. */ #define VCLOCK_TSC 1 /* vDSO should use vread_tsc. */ #define VCLOCK_HPET 2 /* vDSO should use vread_hpet. */ +#define VCLOCK_PVCLOCK 3 /* vDSO should use vread_pvclock. */ struct arch_clocksource_data { int vclock_mode; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] kvm: Actually remove software breakpoints from list on cleanup
From: Jan Kiszka jan.kis...@siemens.com So far we only removed them from the guest, leaving its states in the list. This made it impossible for gdb to re-enable breakpoints on the same address after re-attaching. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- kvm-all.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index b6d0483..3bc3347 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -1905,6 +1905,8 @@ void kvm_remove_all_breakpoints(CPUArchState *current_env) } } } +QTAILQ_REMOVE(s-kvm_sw_breakpoints, bp, entry); +g_free(bp); } kvm_arch_remove_all_hw_breakpoints(); -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 00/18] pvclock vsyscall support + KVM hypervisor support (v4)
This patchset, based on earlier work by Jeremy Fitzhardinge, implements paravirtual clock vsyscall support. It should be possible to implement Xen support relatively easily. It reduces clock_gettime from 500 cycles to 200 cycles on my testbox. Please review. v4: - remove aligned_pvti structure, align directly (Glauber) - add comments to migration notifier (Glauber) - mark migration notifier condition as unlikely (Glauber) - add comment about rdtsc barrier dependency on sse2 (Gleb) - add idea to improve vdso gettime call (Gleb) - remove new msr interface, reuse kernel copy of pvclock data (Glauber) - move copying of timekeeping data from generic timekeeping code to kvm code (John) v3: - fix PVCLOCK_VSYSCALL_NR_PAGES definition (glommer) - fold flags race fix into pvclock refactoring (avi) - remove CONFIG_PARAVIRT_CLOCK_VSYSCALL (glommer) - add reference to tsc.c from vclock_gettime.c about cycle_last rationale (glommer) - fix whitespace damage (glommer) v2: - Do not allow visibility of different system_timestamp, tsc_timestamp tuples. - Add option to disable vsyscall. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 08/18] sched: add notifier for cross-cpu migrations
Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/include/linux/sched.h === --- vsyscall.orig/include/linux/sched.h +++ vsyscall/include/linux/sched.h @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void) extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void); +/* Notifier for when a task gets migrated to a new CPU */ +struct task_migration_notifier { + struct task_struct *task; + int from_cpu; + int to_cpu; +}; +extern void register_task_migration_notifier(struct notifier_block *n); + extern unsigned long get_parent_ip(unsigned long addr); struct seq_file; Index: vsyscall/kernel/sched/core.c === --- vsyscall.orig/kernel/sched/core.c +++ vsyscall/kernel/sched/core.c @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s rq-skip_clock_update = 1; } +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); + +void register_task_migration_notifier(struct notifier_block *n) +{ + atomic_notifier_chain_register(task_migration_notifier, n); +} + #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p, trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { + struct task_migration_notifier tmn; + p-se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); + + tmn.task = p; + tmn.from_cpu = task_cpu(p); + tmn.to_cpu = new_cpu; + + atomic_notifier_call_chain(task_migration_notifier, 0, tmn); } __set_task_cpu(p, new_cpu); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 15/18] KVM: x86: implement PVCLOCK_TSC_STABLE_BIT pvclock flag
KVM added a global variable to guarantee monotonicity in the guest. One of the reasons for that is that the time between 1. ktime_get_ts(timespec); 2. rdtscll(tsc); Is variable. That is, given a host with stable TSC, suppose that two VCPUs read the same time via ktime_get_ts() above. The time required to execute 2. is not the same on those two instances executing in different VCPUS (cache misses, interrupts...). If the TSC value that is used by the host to interpolate when calculating the monotonic time is the same value used to calculate the tsc_timestamp value stored in the pvclock data structure, and a single system_timestamp, tsc_timestamp tuple is visible to all vcpus simultaneously, this problem disappears. See comment on top of pvclock_update_vm_gtod_copy for details. Monotonicity is then guaranteed by synchronicity of the host TSCs and guest TSCs. Set TSC stable pvclock flag in that case, allowing the guest to read clock from userspace. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1186,21 +1186,166 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu EXPORT_SYMBOL_GPL(kvm_write_tsc); +static cycle_t read_tsc(void) +{ + cycle_t ret; + u64 last; + + /* +* Empirically, a fence (of type that depends on the CPU) +* before rdtsc is enough to ensure that rdtsc is ordered +* with respect to loads. The various CPU manuals are unclear +* as to whether rdtsc can be reordered with later loads, +* but no one has ever seen it happen. +*/ + rdtsc_barrier(); + ret = (cycle_t)vget_cycles(); + + last = pvclock_gtod_data.clock.cycle_last; + + if (likely(ret = last)) + return ret; + + /* +* GCC likes to generate cmov here, but this branch is extremely +* predictable (it's just a funciton of time and the likely is +* very likely) and there's a data dependence, so force GCC +* to generate a branch instead. I don't barrier() because +* we don't actually need a barrier, and if this function +* ever gets inlined it will generate worse code. +*/ + asm volatile (); + return last; +} + +static inline u64 vgettsc(cycle_t *cycle_now) +{ + long v; + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + + *cycle_now = read_tsc(); + + v = (*cycle_now - gtod-clock.cycle_last) gtod-clock.mask; + return v * gtod-clock.mult; +} + +static int do_monotonic(struct timespec *ts, cycle_t *cycle_now) +{ + unsigned long seq; + u64 ns; + int mode; + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + + ts-tv_nsec = 0; + do { + seq = read_seqcount_begin(gtod-seq); + mode = gtod-clock.vclock_mode; + ts-tv_sec = gtod-monotonic_time_sec; + ns = gtod-monotonic_time_snsec; + ns += vgettsc(cycle_now); + ns = gtod-clock.shift; + } while (unlikely(read_seqcount_retry(gtod-seq, seq))); + timespec_add_ns(ts, ns); + + return mode; +} + +/* returns true if host is using tsc clocksource */ +static bool kvm_get_time_and_clockread(s64 *kernel_ns, cycle_t *cycle_now) +{ + struct timespec ts; + + /* checked again under seqlock below */ + if (pvclock_gtod_data.clock.vclock_mode != VCLOCK_TSC) + return false; + + if (do_monotonic(ts, cycle_now) != VCLOCK_TSC) + return false; + + monotonic_to_bootbased(ts); + *kernel_ns = timespec_to_ns(ts); + + return true; +} + + +/* + * + * Assuming a stable TSC across physical CPUS, the following condition + * is possible. Each numbered line represents an event visible to both + * CPUs at the next numbered event. + * + * timespecX represents host monotonic time. tscX represents + * RDTSC value. + * + * VCPU0 on CPU0 | VCPU1 on CPU1 + * + * 1. read timespec0,tsc0 + * 2. | timespec1 = timespec0 + N + * | tsc1 = tsc0 + M + * 3. transition to guest | transition to guest + * 4. ret0 = timespec0 + (rdtsc - tsc0) | + * 5. | ret1 = timespec1 + (rdtsc - tsc1) + * | ret1 = timespec0 + N + (rdtsc - (tsc0 + M)) + * + * Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity: + * + * - ret0 ret1 + * - timespec0 + (rdtsc - tsc0) timespec0 + N + (rdtsc - (tsc0 + M)) + * ... + * - 0 N - M = M N + * + * That is, when timespec0 != timespec1, M N. Unfortunately that is not + * always the case (the difference between two distinct xtime instances + * might be smaller then the difference between corresponding
[patch 13/18] time: export time information for KVM pvclock
As suggested by John, export time data similarly to how its done by vsyscall support. This allows KVM to retrieve necessary information to implement vsyscall support in KVM guests. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/include/linux/pvclock_gtod.h === --- /dev/null +++ vsyscall/include/linux/pvclock_gtod.h @@ -0,0 +1,9 @@ +#ifndef _PVCLOCK_GTOD_H +#define _PVCLOCK_GTOD_H + +#include linux/notifier.h + +extern int pvclock_gtod_register_notifier(struct notifier_block *nb); +extern int pvclock_gtod_unregister_notifier(struct notifier_block *nb); + +#endif /* _PVCLOCK_GTOD_H */ Index: vsyscall/kernel/time/timekeeping.c === --- vsyscall.orig/kernel/time/timekeeping.c +++ vsyscall/kernel/time/timekeeping.c @@ -21,6 +21,7 @@ #include linux/time.h #include linux/tick.h #include linux/stop_machine.h +#include linux/pvclock_gtod.h static struct timekeeper timekeeper; @@ -180,6 +181,54 @@ static inline s64 timekeeping_get_ns_raw return nsec + arch_gettimeoffset(); } +static RAW_NOTIFIER_HEAD(pvclock_gtod_chain); + +static void update_pvclock_gtod(struct timekeeper *tk) +{ + raw_notifier_call_chain(pvclock_gtod_chain, 0, tk); +} + +/** + * pvclock_gtod_register_notifier - register a pvclock timedata update listener + * + * Must hold write on timekeeper.lock + */ +int pvclock_gtod_register_notifier(struct notifier_block *nb) +{ + struct timekeeper *tk = timekeeper; + unsigned long flags; + int ret; + + write_seqlock_irqsave(tk-lock, flags); + ret = raw_notifier_chain_register(pvclock_gtod_chain, nb); + /* update timekeeping data */ + update_pvclock_gtod(tk); + write_sequnlock_irqrestore(tk-lock, flags); + + return ret; +} +EXPORT_SYMBOL_GPL(pvclock_gtod_register_notifier); + +/** + * pvclock_gtod_unregister_notifier - unregister a pvclock + * timedata update listener + * + * Must hold write on timekeeper.lock + */ +int pvclock_gtod_unregister_notifier(struct notifier_block *nb) +{ + struct timekeeper *tk = timekeeper; + unsigned long flags; + int ret; + + write_seqlock_irqsave(tk-lock, flags); + ret = raw_notifier_chain_unregister(pvclock_gtod_chain, nb); + write_sequnlock_irqrestore(tk-lock, flags); + + return ret; +} +EXPORT_SYMBOL_GPL(pvclock_gtod_unregister_notifier); + /* must hold write on timekeeper.lock */ static void timekeeping_update(struct timekeeper *tk, bool clearntp) { @@ -188,6 +237,7 @@ static void timekeeping_update(struct ti ntp_clear(); } update_vsyscall(tk); + update_pvclock_gtod(tk); } /** -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 06/18] x86: pvclock: introduce helper to read flags
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -45,6 +45,19 @@ void pvclock_resume(void) atomic64_set(last_value, 0); } +u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src) +{ + unsigned version; + cycle_t ret; + u8 flags; + + do { + version = __pvclock_read_cycles(src, ret, flags); + } while ((src-version 1) || version != src-version); + + return flags valid_flags; +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { unsigned version; Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -6,6 +6,7 @@ /* some helper functions for xen and kvm pv clock sources */ cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src); +u8 pvclock_read_flags(struct pvclock_vcpu_time_info *src); void pvclock_set_flags(u8 flags); unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 02/18] x86: kvmclock: allocate pvclock shared memory area
We want to expose the pvclock shared memory areas, which the hypervisor periodically updates, to userspace. For a linear mapping from userspace, it is necessary that entire page sized regions are used for array of pvclock structures. There is no such guarantee with per cpu areas, therefore move to memblock_alloc based allocation. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/kvmclock.c === --- vsyscall.orig/arch/x86/kernel/kvmclock.c +++ vsyscall/arch/x86/kernel/kvmclock.c @@ -23,6 +23,7 @@ #include asm/apic.h #include linux/percpu.h #include linux/hardirq.h +#include linux/memblock.h #include asm/x86_init.h #include asm/reboot.h @@ -39,7 +40,11 @@ static int parse_no_kvmclock(char *arg) early_param(no-kvmclock, parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ -static DEFINE_PER_CPU_SHARED_ALIGNED(struct pvclock_vcpu_time_info, hv_clock); +struct pvclock_aligned_vcpu_time_info { + struct pvclock_vcpu_time_info clock; +} __attribute__((__aligned__(SMP_CACHE_BYTES))); + +static struct pvclock_aligned_vcpu_time_info *hv_clock; static struct pvclock_wall_clock wall_clock; /* @@ -52,15 +57,20 @@ static unsigned long kvm_get_wallclock(v struct pvclock_vcpu_time_info *vcpu_time; struct timespec ts; int low, high; + int cpu; + + preempt_disable(); + cpu = smp_processor_id(); low = (int)__pa_symbol(wall_clock); high = ((u64)__pa_symbol(wall_clock) 32); native_write_msr(msr_kvm_wall_clock, low, high); - vcpu_time = get_cpu_var(hv_clock); + vcpu_time = hv_clock[cpu].clock; pvclock_read_wallclock(wall_clock, vcpu_time, ts); - put_cpu_var(hv_clock); + + preempt_enable(); return ts.tv_sec; } @@ -74,9 +84,11 @@ static cycle_t kvm_clock_read(void) { struct pvclock_vcpu_time_info *src; cycle_t ret; + int cpu; preempt_disable_notrace(); - src = __get_cpu_var(hv_clock); + cpu = smp_processor_id(); + src = hv_clock[cpu].clock; ret = pvclock_clocksource_read(src); preempt_enable_notrace(); return ret; @@ -99,8 +111,15 @@ static cycle_t kvm_clock_get_cycles(stru static unsigned long kvm_get_tsc_khz(void) { struct pvclock_vcpu_time_info *src; - src = per_cpu(hv_clock, 0); - return pvclock_tsc_khz(src); + int cpu; + unsigned long tsc_khz; + + preempt_disable(); + cpu = smp_processor_id(); + src = hv_clock[cpu].clock; + tsc_khz = pvclock_tsc_khz(src); + preempt_enable(); + return tsc_khz; } static void kvm_get_preset_lpj(void) @@ -119,10 +138,14 @@ bool kvm_check_and_clear_guest_paused(vo { bool ret = false; struct pvclock_vcpu_time_info *src; + int cpu = smp_processor_id(); - src = __get_cpu_var(hv_clock); + if (!hv_clock) + return ret; + + src = hv_clock[cpu].clock; if ((src-flags PVCLOCK_GUEST_STOPPED) != 0) { - __this_cpu_and(hv_clock.flags, ~PVCLOCK_GUEST_STOPPED); + src-flags = ~PVCLOCK_GUEST_STOPPED; ret = true; } @@ -141,9 +164,10 @@ int kvm_register_clock(char *txt) { int cpu = smp_processor_id(); int low, high, ret; + struct pvclock_vcpu_time_info *src = hv_clock[cpu].clock; - low = (int)__pa(per_cpu(hv_clock, cpu)) | 1; - high = ((u64)__pa(per_cpu(hv_clock, cpu)) 32); + low = (int)__pa(src) | 1; + high = ((u64)__pa(src) 32); ret = native_write_msr_safe(msr_kvm_system_time, low, high); printk(KERN_INFO kvm-clock: cpu %d, msr %x:%x, %s\n, cpu, high, low, txt); @@ -197,9 +221,17 @@ static void kvm_shutdown(void) void __init kvmclock_init(void) { + unsigned long mem; + if (!kvm_para_available()) return; + mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS, +PAGE_SIZE); + if (!mem) + return; + hv_clock = __va(mem); + if (kvmclock kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) { msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 17/18] KVM: x86: require matched TSC offsets for master clock
With master clock, a pvclock clock read calculates: ret = system_timestamp + [ (rdtsc + tsc_offset) - tsc_timestamp ] Where 'rdtsc' is the host TSC. system_timestamp and tsc_timestamp are unique, one tuple per VM: the master clock. Given a host with synchronized TSCs, its obvious that guest TSC must be matched for the above to guarantee monotonicity. Allow master clock usage only if guest TSCs are synchronized. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/kvm_host.h === --- vsyscall.orig/arch/x86/include/asm/kvm_host.h +++ vsyscall/arch/x86/include/asm/kvm_host.h @@ -560,6 +560,7 @@ struct kvm_arch { u64 cur_tsc_write; u64 cur_tsc_offset; u8 cur_tsc_generation; + int nr_vcpus_matched_tsc; spinlock_t pvclock_gtod_sync_lock; bool use_master_clock; Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1097,12 +1097,38 @@ static u64 compute_guest_tsc(struct kvm_ return tsc; } +void kvm_track_tsc_matching(struct kvm_vcpu *vcpu) +{ + bool vcpus_matched; + bool do_request = false; + struct kvm_arch *ka = vcpu-kvm-arch; + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + + vcpus_matched = (ka-nr_vcpus_matched_tsc + 1 == +atomic_read(vcpu-kvm-online_vcpus)); + + if (vcpus_matched gtod-clock.vclock_mode == VCLOCK_TSC) + if (!ka-use_master_clock) + do_request = 1; + + if (!vcpus_matched ka-use_master_clock) + do_request = 1; + + if (do_request) + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu); + + trace_kvm_track_tsc(vcpu-vcpu_id, ka-nr_vcpus_matched_tsc, + atomic_read(vcpu-kvm-online_vcpus), + ka-use_master_clock, gtod-clock.vclock_mode); +} + void kvm_write_tsc(struct kvm_vcpu *vcpu, u64 data) { struct kvm *kvm = vcpu-kvm; u64 offset, ns, elapsed; unsigned long flags; s64 usdiff; + bool matched; raw_spin_lock_irqsave(kvm-arch.tsc_write_lock, flags); offset = kvm_x86_ops-compute_tsc_offset(vcpu, data); @@ -1145,6 +1171,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu offset = kvm_x86_ops-compute_tsc_offset(vcpu, data); pr_debug(kvm: adjusted tsc offset by %llu\n, delta); } + matched = true; } else { /* * We split periods of matched TSC writes into generations. @@ -1159,6 +1186,7 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu kvm-arch.cur_tsc_nsec = ns; kvm-arch.cur_tsc_write = data; kvm-arch.cur_tsc_offset = offset; + matched = false; pr_debug(kvm: new tsc generation %u, clock %llu\n, kvm-arch.cur_tsc_generation, data); } @@ -1182,6 +1210,15 @@ void kvm_write_tsc(struct kvm_vcpu *vcpu kvm_x86_ops-write_tsc_offset(vcpu, offset); raw_spin_unlock_irqrestore(kvm-arch.tsc_write_lock, flags); + + spin_lock(kvm-arch.pvclock_gtod_sync_lock); + if (matched) + kvm-arch.nr_vcpus_matched_tsc++; + else + kvm-arch.nr_vcpus_matched_tsc = 0; + + kvm_track_tsc_matching(vcpu); + spin_unlock(kvm-arch.pvclock_gtod_sync_lock); } EXPORT_SYMBOL_GPL(kvm_write_tsc); @@ -1271,8 +1308,9 @@ static bool kvm_get_time_and_clockread(s /* * - * Assuming a stable TSC across physical CPUS, the following condition - * is possible. Each numbered line represents an event visible to both + * Assuming a stable TSC across physical CPUS, and a stable TSC + * across virtual CPUs, the following condition is possible. + * Each numbered line represents an event visible to both * CPUs at the next numbered event. * * timespecX represents host monotonic time. tscX represents @@ -1305,7 +1343,7 @@ static bool kvm_get_time_and_clockread(s * copy of host monotonic time values. Update that master copy * in lockstep. * - * Rely on synchronization of host TSCs for monotonicity. + * Rely on synchronization of host TSCs and guest TSCs for monotonicity. * */ @@ -1313,20 +1351,27 @@ static void pvclock_update_vm_gtod_copy( { struct kvm_arch *ka = kvm-arch; int vclock_mode; + bool host_tsc_clocksource, vcpus_matched; + + vcpus_matched = (ka-nr_vcpus_matched_tsc + 1 == + atomic_read(kvm-online_vcpus)); /* * If the host uses TSC clock, then passthrough TSC as stable * to the guest. */ - ka-use_master_clock = kvm_get_time_and_clockread( + host_tsc_clocksource = kvm_get_time_and_clockread(
[patch 16/18] KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
TSC initialization will soon make use of online_vcpus. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/ia64/kvm/kvm-ia64.c === --- vsyscall.orig/arch/ia64/kvm/kvm-ia64.c +++ vsyscall/arch/ia64/kvm/kvm-ia64.c @@ -1330,6 +1330,11 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu return 0; } +int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) +{ + return 0; +} + int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) { return -EINVAL; Index: vsyscall/arch/powerpc/kvm/powerpc.c === --- vsyscall.orig/arch/powerpc/kvm/powerpc.c +++ vsyscall/arch/powerpc/kvm/powerpc.c @@ -354,6 +354,11 @@ struct kvm_vcpu *kvm_arch_vcpu_create(st return vcpu; } +void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) +{ + return 0; +} + void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) { /* Make sure we're not using the vcpu anymore */ Index: vsyscall/arch/s390/kvm/kvm-s390.c === --- vsyscall.orig/arch/s390/kvm/kvm-s390.c +++ vsyscall/arch/s390/kvm/kvm-s390.c @@ -355,6 +355,11 @@ static void kvm_s390_vcpu_initial_reset( atomic_set_mask(CPUSTAT_STOPPED, vcpu-arch.sie_block-cpuflags); } +void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) +{ + return 0; +} + int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu) { atomic_set(vcpu-arch.sie_block-cpuflags, CPUSTAT_ZARCH | Index: vsyscall/arch/x86/kvm/svm.c === --- vsyscall.orig/arch/x86/kvm/svm.c +++ vsyscall/arch/x86/kvm/svm.c @@ -1254,7 +1254,6 @@ static struct kvm_vcpu *svm_create_vcpu( svm-vmcb_pa = page_to_pfn(page) PAGE_SHIFT; svm-asid_generation = 0; init_vmcb(svm); - kvm_write_tsc(svm-vcpu, 0); err = fx_init(svm-vcpu); if (err) Index: vsyscall/arch/x86/kvm/vmx.c === --- vsyscall.orig/arch/x86/kvm/vmx.c +++ vsyscall/arch/x86/kvm/vmx.c @@ -3896,8 +3896,6 @@ static int vmx_vcpu_setup(struct vcpu_vm vmcs_writel(CR0_GUEST_HOST_MASK, ~0UL); set_cr4_guest_host_mask(vmx); - kvm_write_tsc(vmx-vcpu, 0); - return 0; } Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -6289,6 +6289,19 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu return r; } +int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu) +{ + int r; + + r = vcpu_load(vcpu); + if (r) + return r; + kvm_write_tsc(vcpu, 0); + vcpu_put(vcpu); + + return r; +} + void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu) { int r; Index: vsyscall/include/linux/kvm_host.h === --- vsyscall.orig/include/linux/kvm_host.h +++ vsyscall/include/linux/kvm_host.h @@ -583,6 +583,7 @@ void kvm_arch_vcpu_load(struct kvm_vcpu void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu); struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm, unsigned int id); int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu); +int kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu); void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu); int kvm_arch_vcpu_reset(struct kvm_vcpu *vcpu); Index: vsyscall/virt/kvm/kvm_main.c === --- vsyscall.orig/virt/kvm/kvm_main.c +++ vsyscall/virt/kvm/kvm_main.c @@ -1855,6 +1855,7 @@ static int kvm_vm_ioctl_create_vcpu(stru atomic_inc(kvm-online_vcpus); mutex_unlock(kvm-lock); + kvm_arch_vcpu_postcreate(vcpu); return r; unlock_vcpu_destroy: -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 04/18] x86: pvclock: remove pvclock_shadow_time
Originally from Jeremy Fitzhardinge. We can copy the information directly from struct pvclock_vcpu_time_info, remove pvclock_shadow_time. Reviewed-by: Glauber Costa glom...@parallels.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -19,21 +19,6 @@ #include linux/percpu.h #include asm/pvclock.h -/* - * These are perodically updated - *xen: magic shared_info page - *kvm: gpa registered via msr - * and then copied here. - */ -struct pvclock_shadow_time { - u64 tsc_timestamp; /* TSC at last update of time vals. */ - u64 system_timestamp; /* Time, in nanosecs, since boot.*/ - u32 tsc_to_nsec_mul; - int tsc_shift; - u32 version; - u8 flags; -}; - static u8 valid_flags __read_mostly = 0; void pvclock_set_flags(u8 flags) @@ -41,32 +26,11 @@ void pvclock_set_flags(u8 flags) valid_flags = flags; } -static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) +static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) { - u64 delta = native_read_tsc() - shadow-tsc_timestamp; - return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, - shadow-tsc_shift); -} - -/* - * Reads a consistent set of time-base values from hypervisor, - * into a shadow data area. - */ -static unsigned pvclock_get_time_values(struct pvclock_shadow_time *dst, - struct pvclock_vcpu_time_info *src) -{ - do { - dst-version = src-version; - rmb(); /* fetch version before data */ - dst-tsc_timestamp = src-tsc_timestamp; - dst-system_timestamp = src-system_time; - dst-tsc_to_nsec_mul = src-tsc_to_system_mul; - dst-tsc_shift = src-tsc_shift; - dst-flags = src-flags; - rmb(); /* test version after fetching data */ - } while ((src-version 1) || (dst-version != src-version)); - - return dst-version; + u64 delta = native_read_tsc() - src-tsc_timestamp; + return pvclock_scale_delta(delta, src-tsc_to_system_mul, + src-tsc_shift); } unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) @@ -90,21 +54,22 @@ void pvclock_resume(void) cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { - struct pvclock_shadow_time shadow; unsigned version; cycle_t ret, offset; u64 last; + u8 flags; do { - version = pvclock_get_time_values(shadow, src); + version = src-version; rdtsc_barrier(); - offset = pvclock_get_nsec_offset(shadow); - ret = shadow.system_timestamp + offset; + offset = pvclock_get_nsec_offset(src); + ret = src-system_time + offset; + flags = src-flags; rdtsc_barrier(); - } while (version != src-version); + } while ((src-version 1) || version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) - (shadow.flags PVCLOCK_TSC_STABLE_BIT)) + (flags PVCLOCK_TSC_STABLE_BIT)) return ret; /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 10/18] x86: kvm guest: pvclock vsyscall support
Hook into generic pvclock vsyscall code, with the aim to allow userspace to have visibility into pvclock data. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/kvmclock.c === --- vsyscall.orig/arch/x86/kernel/kvmclock.c +++ vsyscall/arch/x86/kernel/kvmclock.c @@ -40,11 +40,7 @@ static int parse_no_kvmclock(char *arg) early_param(no-kvmclock, parse_no_kvmclock); /* The hypervisor will put information about time periodically here */ -struct pvclock_aligned_vcpu_time_info { - struct pvclock_vcpu_time_info clock; -} __attribute__((__aligned__(SMP_CACHE_BYTES))); - -static struct pvclock_aligned_vcpu_time_info *hv_clock; +static struct pvclock_vsyscall_time_info *hv_clock; static struct pvclock_wall_clock wall_clock; /* @@ -67,7 +63,7 @@ static unsigned long kvm_get_wallclock(v native_write_msr(msr_kvm_wall_clock, low, high); - vcpu_time = hv_clock[cpu].clock; + vcpu_time = hv_clock[cpu].pvti; pvclock_read_wallclock(wall_clock, vcpu_time, ts); preempt_enable(); @@ -88,7 +84,7 @@ static cycle_t kvm_clock_read(void) preempt_disable_notrace(); cpu = smp_processor_id(); - src = hv_clock[cpu].clock; + src = hv_clock[cpu].pvti; ret = pvclock_clocksource_read(src); preempt_enable_notrace(); return ret; @@ -116,7 +112,7 @@ static unsigned long kvm_get_tsc_khz(voi preempt_disable(); cpu = smp_processor_id(); - src = hv_clock[cpu].clock; + src = hv_clock[cpu].pvti; tsc_khz = pvclock_tsc_khz(src); preempt_enable(); return tsc_khz; @@ -143,7 +139,7 @@ bool kvm_check_and_clear_guest_paused(vo if (!hv_clock) return ret; - src = hv_clock[cpu].clock; + src = hv_clock[cpu].pvti; if ((src-flags PVCLOCK_GUEST_STOPPED) != 0) { src-flags = ~PVCLOCK_GUEST_STOPPED; ret = true; @@ -164,7 +160,7 @@ int kvm_register_clock(char *txt) { int cpu = smp_processor_id(); int low, high, ret; - struct pvclock_vcpu_time_info *src = hv_clock[cpu].clock; + struct pvclock_vcpu_time_info *src = hv_clock[cpu].pvti; low = (int)__pa(src) | 1; high = ((u64)__pa(src) 32); @@ -226,7 +222,7 @@ void __init kvmclock_init(void) if (!kvm_para_available()) return; - mem = memblock_alloc(sizeof(struct pvclock_aligned_vcpu_time_info) * NR_CPUS, + mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS, PAGE_SIZE); if (!mem) return; @@ -265,3 +261,36 @@ void __init kvmclock_init(void) if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE_STABLE_BIT)) pvclock_set_flags(PVCLOCK_TSC_STABLE_BIT); } + +int kvm_setup_vsyscall_timeinfo(void) +{ + int cpu; + int ret; + u8 flags; + struct pvclock_vcpu_time_info *vcpu_time; + unsigned int size; + + size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS; + + preempt_disable(); + cpu = smp_processor_id(); + + vcpu_time = hv_clock[cpu].pvti; + flags = pvclock_read_flags(vcpu_time); + + if (!(flags PVCLOCK_TSC_STABLE_BIT)) { + preempt_enable(); + return 1; + } + + if ((ret = pvclock_init_vsyscall(hv_clock, size))) { + preempt_enable(); + return ret; + } + + preempt_enable(); + + kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; + return 0; +} + Index: vsyscall/arch/x86/kernel/kvm.c === --- vsyscall.orig/arch/x86/kernel/kvm.c +++ vsyscall/arch/x86/kernel/kvm.c @@ -42,6 +42,7 @@ #include asm/apic.h #include asm/apicdef.h #include asm/hypervisor.h +#include asm/kvm_guest.h static int kvmapf = 1; @@ -62,6 +63,15 @@ static int parse_no_stealacc(char *arg) early_param(no-steal-acc, parse_no_stealacc); +static int kvmclock_vsyscall = 1; +static int parse_no_kvmclock_vsyscall(char *arg) +{ +kvmclock_vsyscall = 0; +return 0; +} + +early_param(no-kvmclock-vsyscall, parse_no_kvmclock_vsyscall); + static DEFINE_PER_CPU(struct kvm_vcpu_pv_apf_data, apf_reason) __aligned(64); static DEFINE_PER_CPU(struct kvm_steal_time, steal_time) __aligned(64); static int has_steal_clock = 0; @@ -468,6 +478,9 @@ void __init kvm_guest_init(void) if (kvm_para_has_feature(KVM_FEATURE_PV_EOI)) apic_set_eoi_write(kvm_guest_apic_eoi_write); + if (kvmclock_vsyscall) + kvm_setup_vsyscall_timeinfo(); + #ifdef CONFIG_SMP smp_ops.smp_prepare_boot_cpu = kvm_smp_prepare_boot_cpu; register_cpu_notifier(kvm_cpu_notifier); Index: vsyscall/arch/x86/include/asm/kvm_guest.h === ---
[patch 18/18] KVM: x86: update pvclock area conditionally, on cpu migration
As requested by Glauber, do not update kvmclock area on vcpu-pcpu migration, in case the host has stable TSC. This is to reduce cacheline bouncing. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -2615,7 +2615,12 @@ void kvm_arch_vcpu_load(struct kvm_vcpu kvm_x86_ops-write_tsc_offset(vcpu, offset); vcpu-arch.tsc_catchup = 1; } - kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); + /* +* On a host with synchronized TSC, there is no need to update +* kvmclock on vcpu-cpu migration +*/ + if (!vcpu-kvm-arch.use_master_clock || vcpu-cpu == -1) + kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); if (vcpu-cpu != cpu) kvm_migrate_timers(vcpu); vcpu-cpu = cpu; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 01/18] KVM: x86: retain pvclock guest stopped bit in guest memory
Otherwise its possible for an unrelated KVM_REQ_UPDATE_CLOCK (such as due to CPU migration) to clear the bit. Noticed by Paolo Bonzini. Reviewed-by: Gleb Natapov g...@redhat.com Reviewed-by: Glauber Costa glom...@parallels.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1143,6 +1143,7 @@ static int kvm_guest_time_update(struct unsigned long this_tsc_khz; s64 kernel_ns, max_kernel_ns; u64 tsc_timestamp; + struct pvclock_vcpu_time_info *guest_hv_clock; u8 pvclock_flags; /* Keep irq disabled to prevent changes to the clock */ @@ -1226,13 +1227,6 @@ static int kvm_guest_time_update(struct vcpu-last_kernel_ns = kernel_ns; vcpu-last_guest_tsc = tsc_timestamp; - pvclock_flags = 0; - if (vcpu-pvclock_set_guest_stopped_request) { - pvclock_flags |= PVCLOCK_GUEST_STOPPED; - vcpu-pvclock_set_guest_stopped_request = false; - } - - vcpu-hv_clock.flags = pvclock_flags; /* * The interface expects us to write an even number signaling that the @@ -1243,6 +1237,18 @@ static int kvm_guest_time_update(struct shared_kaddr = kmap_atomic(vcpu-time_page); + guest_hv_clock = shared_kaddr + vcpu-time_offset; + + /* retain PVCLOCK_GUEST_STOPPED if set in guest copy */ + pvclock_flags = (guest_hv_clock-flags PVCLOCK_GUEST_STOPPED); + + if (vcpu-pvclock_set_guest_stopped_request) { + pvclock_flags |= PVCLOCK_GUEST_STOPPED; + vcpu-pvclock_set_guest_stopped_request = false; + } + + vcpu-hv_clock.flags = pvclock_flags; + memcpy(shared_kaddr + vcpu-time_offset, vcpu-hv_clock, sizeof(vcpu-hv_clock)); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 12/18] KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops-read_l1_tsc(). Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/include/asm/kvm_host.h === --- vsyscall.orig/arch/x86/include/asm/kvm_host.h +++ vsyscall/arch/x86/include/asm/kvm_host.h @@ -703,7 +703,7 @@ struct kvm_x86_ops { void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); u64 (*compute_tsc_offset)(struct kvm_vcpu *vcpu, u64 target_tsc); - u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu); + u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc); void (*get_exit_info)(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2); Index: vsyscall/arch/x86/kvm/lapic.c === --- vsyscall.orig/arch/x86/kvm/lapic.c +++ vsyscall/arch/x86/kvm/lapic.c @@ -1011,7 +1011,7 @@ static void start_apic_timer(struct kvm_ local_irq_save(flags); now = apic-lapic_timer.timer.base-get_time(); - guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu); + guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, native_read_tsc()); if (likely(tscdeadline guest_tsc)) { ns = (tscdeadline - guest_tsc) * 100ULL; do_div(ns, this_tsc_khz); Index: vsyscall/arch/x86/kvm/svm.c === --- vsyscall.orig/arch/x86/kvm/svm.c +++ vsyscall/arch/x86/kvm/svm.c @@ -3008,11 +3008,11 @@ static int cr8_write_interception(struct return 0; } -u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu) +u64 svm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { struct vmcb *vmcb = get_host_vmcb(to_svm(vcpu)); return vmcb-control.tsc_offset + - svm_scale_tsc(vcpu, native_read_tsc()); + svm_scale_tsc(vcpu, host_tsc); } static int svm_get_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 *data) Index: vsyscall/arch/x86/kvm/vmx.c === --- vsyscall.orig/arch/x86/kvm/vmx.c +++ vsyscall/arch/x86/kvm/vmx.c @@ -1839,11 +1839,10 @@ static u64 guest_read_tsc(void) * Like guest_read_tsc, but always returns L1's notion of the timestamp * counter, even if a nested guest (L2) is currently running. */ -u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu) +u64 vmx_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { - u64 host_tsc, tsc_offset; + u64 tsc_offset; - rdtscll(host_tsc); tsc_offset = is_guest_mode(vcpu) ? to_vmx(vcpu)-nested.vmcs01_tsc_offset : vmcs_read64(TSC_OFFSET); Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -1175,7 +1175,7 @@ static int kvm_guest_time_update(struct /* Keep irq disabled to prevent changes to the clock */ local_irq_save(flags); - tsc_timestamp = kvm_x86_ops-read_l1_tsc(v); + tsc_timestamp = kvm_x86_ops-read_l1_tsc(v, native_read_tsc()); kernel_ns = get_kernel_ns(); this_tsc_khz = __get_cpu_var(cpu_tsc_khz); if (unlikely(this_tsc_khz == 0)) { @@ -5429,7 +5429,8 @@ static int vcpu_enter_guest(struct kvm_v if (hw_breakpoint_active()) hw_breakpoint_restore(); - vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu); + vcpu-arch.last_guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu, + native_read_tsc()); vcpu-mode = OUTSIDE_GUEST_MODE; smp_wmb(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 14/18] KVM: x86: notifier for clocksource changes
Register a notifier for clocksource change event. In case the host switches to clock other than TSC, disable master clock usage. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kvm/x86.c === --- vsyscall.orig/arch/x86/kvm/x86.c +++ vsyscall/arch/x86/kvm/x86.c @@ -46,6 +46,8 @@ #include linux/uaccess.h #include linux/hash.h #include linux/pci.h +#include linux/timekeeper_internal.h +#include linux/pvclock_gtod.h #include trace/events/kvm.h #define CREATE_TRACE_POINTS @@ -899,6 +901,53 @@ static int do_set_msr(struct kvm_vcpu *v return kvm_set_msr(vcpu, index, *data); } +struct pvclock_gtod_data { + seqcount_t seq; + + struct { /* extract of a clocksource struct */ + int vclock_mode; + cycle_t cycle_last; + cycle_t mask; + u32 mult; + u32 shift; + } clock; + + /* open coded 'struct timespec' */ + u64 monotonic_time_snsec; + time_t monotonic_time_sec; +}; + +static struct pvclock_gtod_data pvclock_gtod_data; + +static void update_pvclock_gtod(struct timekeeper *tk) +{ + struct pvclock_gtod_data *vdata = pvclock_gtod_data; + + write_seqcount_begin(vdata-seq); + + /* copy pvclock gtod data */ + vdata-clock.vclock_mode= tk-clock-archdata.vclock_mode; + vdata-clock.cycle_last = tk-clock-cycle_last; + vdata-clock.mask = tk-clock-mask; + vdata-clock.mult = tk-mult; + vdata-clock.shift = tk-shift; + + vdata-monotonic_time_sec = tk-xtime_sec + + tk-wall_to_monotonic.tv_sec; + vdata-monotonic_time_snsec = tk-xtime_nsec + + (tk-wall_to_monotonic.tv_nsec +tk-shift); + while (vdata-monotonic_time_snsec = + (((u64)NSEC_PER_SEC) tk-shift)) { + vdata-monotonic_time_snsec -= + ((u64)NSEC_PER_SEC) tk-shift; + vdata-monotonic_time_sec++; + } + + write_seqcount_end(vdata-seq); +} + + static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock) { int version; @@ -995,6 +1044,8 @@ static inline u64 get_kernel_ns(void) return timespec_to_ns(ts); } +static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0); + static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz); unsigned long max_tsc_khz; @@ -1227,7 +1278,6 @@ static int kvm_guest_time_update(struct vcpu-last_kernel_ns = kernel_ns; vcpu-last_guest_tsc = tsc_timestamp; - /* * The interface expects us to write an even number signaling that the * update is finished. Since the guest won't see the intermediate @@ -4894,6 +4944,37 @@ static void kvm_set_mmio_spte_mask(void) kvm_mmu_set_mmio_spte_mask(mask); } +static void pvclock_gtod_update_fn(struct work_struct *work) +{ +} + +static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn); + +/* + * Notification about pvclock gtod data update. + */ +static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused, + void *priv) +{ + struct pvclock_gtod_data *gtod = pvclock_gtod_data; + struct timekeeper *tk = priv; + + update_pvclock_gtod(tk); + + /* disable master clock if host does not trust, or does not +* use, TSC clocksource +*/ + if (gtod-clock.vclock_mode != VCLOCK_TSC + atomic_read(kvm_guest_has_master_clock) != 0) + queue_work(system_long_wq, pvclock_gtod_work); + + return 0; +} + +static struct notifier_block pvclock_gtod_notifier = { + .notifier_call = pvclock_gtod_notify, +}; + int kvm_arch_init(void *opaque) { int r; @@ -4935,6 +5016,8 @@ int kvm_arch_init(void *opaque) host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK); kvm_lapic_init(); + pvclock_gtod_register_notifier(pvclock_gtod_notifier); + return 0; out: @@ -4949,6 +5032,7 @@ void kvm_arch_exit(void) cpufreq_unregister_notifier(kvmclock_cpufreq_notifier_block, CPUFREQ_TRANSITION_NOTIFIER); unregister_hotcpu_notifier(kvmclock_cpu_notifier_block); + pvclock_gtod_unregister_notifier(pvclock_gtod_notifier); kvm_x86_ops = NULL; kvm_mmu_module_exit(); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 13/18] time: export time information for KVM pvclock
On 11/14/2012 04:08 PM, Marcelo Tosatti wrote: As suggested by John, export time data similarly to how its done by vsyscall support. This allows KVM to retrieve necessary information to implement vsyscall support in KVM guests. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Thanks for the updates here. The notifier method is interesting, and if it works well, we may want to extend it later to cover the vsyscall code too, but that can be done in a later iteration. Acked-by: John Stultz johns...@us.ibm.com thanks -john -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] acpi_piix4: fix migration of gpe fields
Migrate 16 bytes for en/sts fields (which is the correct size), increase version to 3, and document how to support incoming migration from qemu-kvm 1.2. Acked-by: Paolo Bonzini pbonz...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- hw/acpi_piix4.c | 50 ++ 1 files changed, 46 insertions(+), 4 deletions(-) diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c index 15275cf..519269a 100644 --- a/hw/acpi_piix4.c +++ b/hw/acpi_piix4.c @@ -235,10 +235,9 @@ static int vmstate_acpi_post_load(void *opaque, int version_id) { \ .name = (stringify(_field)), \ .version_id = 0,\ - .num= GPE_LEN, \ .info = vmstate_info_uint16, \ .size = sizeof(uint16_t), \ - .flags = VMS_ARRAY | VMS_POINTER, \ + .flags = VMS_SINGLE | VMS_POINTER, \ .offset = vmstate_offset_pointer(_state, _field, uint8_t), \ } @@ -267,11 +266,54 @@ static const VMStateDescription vmstate_pci_status = { } }; +static int acpi_load_old(QEMUFile *f, void *opaque, int version_id) +{ +PIIX4PMState *s = opaque; +int ret, i; +uint16_t temp; + +ret = pci_device_load(s-dev, f); +if (ret 0) { +return ret; +} +qemu_get_be16s(f, s-ar.pm1.evt.sts); +qemu_get_be16s(f, s-ar.pm1.evt.en); +qemu_get_be16s(f, s-ar.pm1.cnt.cnt); + +ret = vmstate_load_state(f, vmstate_apm, opaque, 1); +if (ret) { +return ret; +} + +qemu_get_timer(f, s-ar.tmr.timer); +qemu_get_sbe64s(f, s-ar.tmr.overflow_time); + +qemu_get_be16s(f, (uint16_t *)s-ar.gpe.sts); +for (i = 0; i 3; i++) { +qemu_get_be16s(f, temp); +} + +qemu_get_be16s(f, (uint16_t *)s-ar.gpe.en); +for (i = 0; i 3; i++) { +qemu_get_be16s(f, temp); +} + +ret = vmstate_load_state(f, vmstate_pci_status, opaque, 1); +return ret; +} + +/* qemu-kvm 1.2 uses version 3 but advertised as 2 + * To support incoming qemu-kvm 1.2 migration, change version_id + * and minimum_version_id to 2 below (which breaks migration from + * qemu 1.2). + * + */ static const VMStateDescription vmstate_acpi = { .name = piix4_pm, -.version_id = 2, -.minimum_version_id = 1, +.version_id = 3, +.minimum_version_id = 3, .minimum_version_id_old = 1, +.load_state_old = acpi_load_old, .post_load = vmstate_acpi_post_load, .fields = (VMStateField []) { VMSTATE_PCI_DEVICE(dev, PIIX4PMState), -- 1.7.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 03/18] x86: pvclock: make sure rdtsc doesnt speculate out of region
Originally from Jeremy Fitzhardinge. pvclock_get_time_values, which contains the memory barriers will be removed by next patch. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -97,10 +97,10 @@ cycle_t pvclock_clocksource_read(struct do { version = pvclock_get_time_values(shadow, src); - barrier(); + rdtsc_barrier(); offset = pvclock_get_nsec_offset(shadow); ret = shadow.system_timestamp + offset; - barrier(); + rdtsc_barrier(); } while (version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 05/18] x86: pvclock: create helper for pvclock data retrieval
Originally from Jeremy Fitzhardinge. So code can be reused. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/arch/x86/kernel/pvclock.c === --- vsyscall.orig/arch/x86/kernel/pvclock.c +++ vsyscall/arch/x86/kernel/pvclock.c @@ -26,13 +26,6 @@ void pvclock_set_flags(u8 flags) valid_flags = flags; } -static u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) -{ - u64 delta = native_read_tsc() - src-tsc_timestamp; - return pvclock_scale_delta(delta, src-tsc_to_system_mul, - src-tsc_shift); -} - unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) { u64 pv_tsc_khz = 100ULL 32; @@ -55,17 +48,12 @@ void pvclock_resume(void) cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { unsigned version; - cycle_t ret, offset; + cycle_t ret; u64 last; u8 flags; do { - version = src-version; - rdtsc_barrier(); - offset = pvclock_get_nsec_offset(src); - ret = src-system_time + offset; - flags = src-flags; - rdtsc_barrier(); + version = __pvclock_read_cycles(src, ret, flags); } while ((src-version 1) || version != src-version); if ((valid_flags PVCLOCK_TSC_STABLE_BIT) Index: vsyscall/arch/x86/include/asm/pvclock.h === --- vsyscall.orig/arch/x86/include/asm/pvclock.h +++ vsyscall/arch/x86/include/asm/pvclock.h @@ -56,4 +56,32 @@ static inline u64 pvclock_scale_delta(u6 return product; } +static __always_inline +u64 pvclock_get_nsec_offset(const struct pvclock_vcpu_time_info *src) +{ + u64 delta = __native_read_tsc() - src-tsc_timestamp; + return pvclock_scale_delta(delta, src-tsc_to_system_mul, + src-tsc_shift); +} + +static __always_inline +unsigned __pvclock_read_cycles(const struct pvclock_vcpu_time_info *src, + cycle_t *cycles, u8 *flags) +{ + unsigned version; + cycle_t ret, offset; + u8 ret_flags; + + version = src-version; + rdtsc_barrier(); + offset = pvclock_get_nsec_offset(src); + ret = src-system_time + offset; + ret_flags = src-flags; + rdtsc_barrier(); + + *cycles = ret; + *flags = ret_flags; + return version; +} + #endif /* _ASM_X86_PVCLOCK_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 08/18] sched: add notifier for cross-cpu migrations
CCing Peter and Ingo. On Wed, Nov 14, 2012 at 10:08:31PM -0200, Marcelo Tosatti wrote: Originally from Jeremy Fitzhardinge. Signed-off-by: Marcelo Tosatti mtosa...@redhat.com Index: vsyscall/include/linux/sched.h === --- vsyscall.orig/include/linux/sched.h +++ vsyscall/include/linux/sched.h @@ -107,6 +107,14 @@ extern unsigned long this_cpu_load(void) extern void calc_global_load(unsigned long ticks); extern void update_cpu_load_nohz(void); +/* Notifier for when a task gets migrated to a new CPU */ +struct task_migration_notifier { + struct task_struct *task; + int from_cpu; + int to_cpu; +}; +extern void register_task_migration_notifier(struct notifier_block *n); + extern unsigned long get_parent_ip(unsigned long addr); struct seq_file; Index: vsyscall/kernel/sched/core.c === --- vsyscall.orig/kernel/sched/core.c +++ vsyscall/kernel/sched/core.c @@ -922,6 +922,13 @@ void check_preempt_curr(struct rq *rq, s rq-skip_clock_update = 1; } +static ATOMIC_NOTIFIER_HEAD(task_migration_notifier); + +void register_task_migration_notifier(struct notifier_block *n) +{ + atomic_notifier_chain_register(task_migration_notifier, n); +} + #ifdef CONFIG_SMP void set_task_cpu(struct task_struct *p, unsigned int new_cpu) { @@ -952,8 +959,16 @@ void set_task_cpu(struct task_struct *p, trace_sched_migrate_task(p, new_cpu); if (task_cpu(p) != new_cpu) { + struct task_migration_notifier tmn; + p-se.nr_migrations++; perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); + + tmn.task = p; + tmn.from_cpu = task_cpu(p); + tmn.to_cpu = new_cpu; + + atomic_notifier_call_chain(task_migration_notifier, 0, tmn); } __set_task_cpu(p, new_cpu); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html