Re: KVM with hugepages generate huge load with two guests
21.11.2010 03:24, Dmitry Golubev wrote: Hi, Seems that nobody is interested in this bug :( Anyway I wanted to add a bit more to this investigation. Once I put nohz=off highres=off clocksource=acpi_pm in guest kernel options, the guests started to behave better - they do not stay in the slow state, but rather get there for some seconds (usually up to minute, but sometimes 2-3 minutes) and then get out of it (this cycle Just out of curiocity: did you try updating the BIOS on your motherboard? The issus you're facing seems to be quite unique, and I've seen more than once how various different weird issues were fixed just by updating the BIOS. Provided they actually did they own homework and fixed something and released the fixes too... ;) P.S. I'm Not A Guru (tm) :) /mjt -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
Just out of curiocity: did you try updating the BIOS on your motherboard? The issus you're facing seems to be quite unique, and I've seen more than once how various different weird issues were fixed just by updating the BIOS. Provided they actually did they own homework and fixed something and released the fixes too... ;) Thank you for reply, I really appreciate that somebody found time to answer. Unfortunately for this investigation I managed to upgrade BIOS version few months ago. I just checked - there are no newer versions. I do see, however, that many people advise to change to acpi_pm ckocksource (and, thus, disable nohz option) in case similar problems are experienced - I did not invent this workaround (got the idea here: http://forum.proxmox.com/threads/5144-100-CPU-on-host-VM-hang-every-night?p=29143#post29143 ). Looks like an ancient bug. I even upgraded my qemu-kvm to version 0.13 without any significant changes to this behavior. It is really weird, however how one guest can work fine, but two start messing with each other. Shouldn't there be some kind of isolation between them? As they both start to behave exactly the same at exactly the same time. And it does not happen once a month or a year, but pretty frequently. Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
On 11/21/2010 02:24 AM, Dmitry Golubev wrote: Hi, Seems that nobody is interested in this bug :( It's because the information is somewhat confused. There's a way to prepare bug reports that gets developers competing to see who solves it first. Anyway I wanted to add a bit more to this investigation. Once I put nohz=off highres=off clocksource=acpi_pm in guest kernel options, the guests started to behave better - they do not stay in the slow state, but rather get there for some seconds (usually up to minute, but sometimes 2-3 minutes) and then get out of it (this cycle repeats once in a while - every approx 3-6 minutes). Once the situation became stable, so that I am able to leave the guests without very much worries, I also noticed that sometimes the predicted swapping occurs, although rarely (I waited about half an hour to catch the first swapping on the host). Here is a fragment of vmstat. Note that when the first column shows 8-9 - the slowness and huge load happens. You can also see how is appears and disappears (with nohz and kvm-clock it did not go out of slowness period, but with tsc clock the probability of getting out is significantly lower): Are you sure it is hugepages related? Can you post kvm_stat output while slowness is happening? 'perf top' on the host? and on the guest? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
VMLAUNCH hangs when enabling EPT
Hi all, I just implement the EPT support in my hypervisor (Very similar to Newbluepill). My problem is irrelevant with KVM. It's just about how to debug VTx implementation in drivers. Here is the story. First, I implement a driver to support partial VTx, and it works very well. Then I implement EPT to identically map gfn to mfn from 0x0 to 0xf. I suppose it should be OK. But the result is the Windows OS hangs (No reboot, No BSOD) when executing VMLAUNCH instruction. And my problem is that, the windbg just shows debuggee is running when the debuggee Windows OS hangs, even if I insert ud2 instruction before the next statement, #VMEXIT handler and the first instruction in non-root mode. VMLAUNCH should not make this happen according to Intel's manual 2B. Everything is OK if I set enable ept to be 0 or clear the EPT pointer field in VMCS. Can someone explain why this happens and what should I do to continue debugging? Both the hypervisor and the Windows OS is on x86_32 platform. I use windbg to debug the target machine via serial port. Some debug information: EPT pointer is 0x9ba801e, (pfn:0x9ba8, flag:0x1e, I have double checked this) PML4[0] = 0x_09cd8007, PDPT[0] = 0x_09cf3007, PD[0] = 0x_09cf2007, PT[0] = 0x_0077. Other entries are of the same scheme with different values. The debuggee platform is on Intel i5 650, multi-core disabled. Since KVM also concentrates on building hypervisor via loading driver, I am really looking forward for your help. Thanks, Miao -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] A little trace_kvm_exit improvement
Record the ISA (vmx or svm) and the exit qualification in trace_kvm_exit. Avi Kivity (2): KVM: Record instruction set in kvm_exit tracepoint KVM: Add instruction-set-specific exit qualifications to kvm_exit trace arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 12 +++- arch/x86/kvm/trace.h| 17 + arch/x86/kvm/vmx.c | 10 +- 4 files changed, 34 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: Record instruction set in kvm_exit tracepoint
exit_reason's meaning depend on the instruction set; record it so a trace taken on one machine can be interpreted on another. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/svm.c |2 +- arch/x86/kvm/trace.h |9 +++-- arch/x86/kvm/vmx.c |2 +- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index c6a7798..b83954e 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2980,7 +2980,7 @@ static int handle_exit(struct kvm_vcpu *vcpu) struct kvm_run *kvm_run = vcpu-run; u32 exit_code = svm-vmcb-control.exit_code; - trace_kvm_exit(exit_code, vcpu); + trace_kvm_exit(exit_code, vcpu, KVM_ISA_SVM); if (!(svm-vmcb-control.intercept_cr_write INTERCEPT_CR0_MASK)) vcpu-arch.cr0 = svm-vmcb-save.cr0; diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index a6544b8..1061022 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -178,21 +178,26 @@ TRACE_EVENT(kvm_apic, #define trace_kvm_apic_read(reg, val) trace_kvm_apic(0, reg, val) #define trace_kvm_apic_write(reg, val) trace_kvm_apic(1, reg, val) +#define KVM_ISA_VMX 1 +#define KVM_ISA_SVM 2 + /* * Tracepoint for kvm guest exit: */ TRACE_EVENT(kvm_exit, - TP_PROTO(unsigned int exit_reason, struct kvm_vcpu *vcpu), - TP_ARGS(exit_reason, vcpu), + TP_PROTO(unsigned int exit_reason, struct kvm_vcpu *vcpu, u32 isa), + TP_ARGS(exit_reason, vcpu, isa), TP_STRUCT__entry( __field(unsigned int, exit_reason ) __field(unsigned long, guest_rip ) + __field(u32,isa ) ), TP_fast_assign( __entry-exit_reason= exit_reason; __entry-guest_rip = kvm_rip_read(vcpu); + __entry-isa= isa; ), TP_printk(reason %s rip 0x%lx, diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index a9ad174..1d075de 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3700,7 +3700,7 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu) u32 exit_reason = vmx-exit_reason; u32 vectoring_info = vmx-idt_vectoring_info; - trace_kvm_exit(exit_reason, vcpu); + trace_kvm_exit(exit_reason, vcpu, KVM_ISA_VMX); /* If guest state is invalid, start emulating */ if (vmx-emulation_required emulate_invalid_guest_state) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: Add instruction-set-specific exit qualifications to kvm_exit trace
The exit reason alone is insufficient to understand exactly why an exit occured; add ISA-specific trace parameters for additional information. Because fetching these parameters is expensive on vmx, and because these parameters are fetched even if tracing is disabled, we fetch the parameters via a callback instead of as traditional trace arguments. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/kvm/svm.c | 10 ++ arch/x86/kvm/trace.h|8 ++-- arch/x86/kvm/vmx.c |8 4 files changed, 25 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b04c0fa..d815d9d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -594,6 +594,7 @@ struct kvm_x86_ops { void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset); + void (*get_exit_info)(struct kvm_vcpu *vcpu, u64* info1, u64* info2); const struct trace_print_flags *exit_reasons_str; }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index b83954e..2fd2f4d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2974,6 +2974,14 @@ void dump_vmcb(struct kvm_vcpu *vcpu) } +static void svm_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) +{ + struct vmcb_control_area *control = to_svm(vcpu)-vmcb-control; + + *info1 = control-exit_info_1; + *info2 = control-exit_info_2; +} + static int handle_exit(struct kvm_vcpu *vcpu) { struct vcpu_svm *svm = to_svm(vcpu); @@ -3678,7 +3686,9 @@ static struct kvm_x86_ops svm_x86_ops = { .get_tdp_level = get_npt_level, .get_mt_mask = svm_get_mt_mask, + .get_exit_info = svm_get_exit_info, .exit_reasons_str = svm_exit_reasons_str, + .get_lpage_level = svm_get_lpage_level, .cpuid_update = svm_cpuid_update, diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 1061022..1357d7c 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -192,18 +192,22 @@ TRACE_EVENT(kvm_exit, __field(unsigned int, exit_reason ) __field(unsigned long, guest_rip ) __field(u32,isa ) + __field(u64,info1 ) + __field(u64,info2 ) ), TP_fast_assign( __entry-exit_reason= exit_reason; __entry-guest_rip = kvm_rip_read(vcpu); __entry-isa= isa; + kvm_x86_ops-get_exit_info(vcpu, __entry-info1, + __entry-info2); ), - TP_printk(reason %s rip 0x%lx, + TP_printk(reason %s rip 0x%lx info %llx %llx, ftrace_print_symbols_seq(p, __entry-exit_reason, kvm_x86_ops-exit_reasons_str), -__entry-guest_rip) +__entry-guest_rip, __entry-info1, __entry-info2) ); /* diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 1d075de..8dbf095 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -3690,6 +3690,12 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u64 *info1, u64 *info2) +{ + *info1 = vmcs_readl(EXIT_QUALIFICATION); + *info2 = vmcs_read32(VM_EXIT_INTR_INFO); +} + /* * The guest has exited. See if we can fix it or if we need userspace * assistance. @@ -4352,7 +4358,9 @@ static struct kvm_x86_ops vmx_x86_ops = { .get_tdp_level = get_ept_level, .get_mt_mask = vmx_get_mt_mask, + .get_exit_info = vmx_get_exit_info, .exit_reasons_str = vmx_exit_reasons_str, + .get_lpage_level = vmx_get_lpage_level, .cpuid_update = vmx_cpuid_update, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM with hugepages generate huge load with two guests
Thanks for the answer. Are you sure it is hugepages related? Well, empirically it looked like either hugepages-related, or regression of qemu-kvm 0.12.3 - 0.12.5, as this did not happen until I upgraded (needed to avoid disk corruption caused by a bug in 0.12.3) and put hugepages. However as frequency of problem does seem related to memory each guest consumes (more memory = faster the problem appears) and in the beginning it might have been that the memory consumption of the guests did not hit some kind of threshold, maybe it is not really hugepages related. Can you post kvm_stat output while slowness is happening? 'perf top' on the host? and on the guest? OK, I will test this and write back. Thanks, Dmitry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v3 0/2] Type-safe ioport callbacks
On 11/17/2010 03:50 AM, Avi Kivity wrote: A not-so-recent qemu - qemu-kvm merge broke cpu hotplug without the compiler complaining because of the type-unsafeness of the ioport callbacks. This patchset adds a type-safe variant of ioport callbacks and coverts a sample ioport. Converting the other 300-odd registrations is left as an excercise to the community. Applied all. Thanks. Regards, Anthony Liguori v3: - define a common IORange that can also be used for mmio - move start/length into IORange - make access width a parameter of the access functions instead of having a callback per access size v2: - const correctness - avoid return void Avi Kivity (2): Type-safe ioport callbacks piix4 acpi: convert io BAR to type-safe ioport callbacks hw/acpi_piix4.c | 55 +++ ioport.c| 64 +++ ioport.h|2 + iorange.h | 30 + 4 files changed, 118 insertions(+), 33 deletions(-) create mode 100644 iorange.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] optionrom: fix bugs in signrom.sh
On 11/16/2010 08:33 AM, Avi Kivity wrote: signrom.sh has multiple bugs: - the last byte is considered when calculating the existing checksum, but not when computing the correction - apprently the 'expr' expression overflows and produces incorrect results with larger roms - if the checksum happened to be zero, we calculated the correction byte to be 256 Instead of rewriting this in half a line of python, this patch fixes the bugs. Signed-off-by: Avi Kivitya...@redhat.com Applied. Thanks. Regards, Anthony Liguori --- pc-bios/optionrom/signrom.sh |7 +++ 1 files changed, 3 insertions(+), 4 deletions(-) diff --git a/pc-bios/optionrom/signrom.sh b/pc-bios/optionrom/signrom.sh index 975b27d..9dc5c63 100755 --- a/pc-bios/optionrom/signrom.sh +++ b/pc-bios/optionrom/signrom.sh @@ -31,14 +31,13 @@ x=`dd if=$1 bs=1 count=1 skip=2 2/dev/null | od -t u1 -A n` size=$(( $x * 512 - 1 )) # now get the checksum -nums=`od -A n -t u1 -v $1` +nums=`od -A n -t u1 -v -N $size $1` for i in ${nums}; do # add each byte's value to sum -sum=`expr $sum + $i` +sum=`expr \( $sum + $i \) % 256` done -sum=$(( $sum % 256 )) -sum=$(( 256 - $sum )) +sum=$(( (256 - $sum) % 256 )) sum_octal=$( printf %o $sum ) # and write the output file -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH v3] virtio-9p: fix build on !CONFIG_UTIMENSAT
On 11/14/2010 08:15 PM, Hidetoshi Seto wrote: This patch introduce a fallback mechanism for old systems that do not support utimensat(). This fix build failure with following warnings: hw/virtio-9p-local.c: In function 'local_utimensat': hw/virtio-9p-local.c:479: warning: implicit declaration of function 'utimensat' hw/virtio-9p-local.c:479: warning: nested extern declaration of 'utimensat' and: hw/virtio-9p.c: In function 'v9fs_setattr_post_chmod': hw/virtio-9p.c:1410: error: 'UTIME_NOW' undeclared (first use in this function) hw/virtio-9p.c:1410: error: (Each undeclared identifier is reported only once hw/virtio-9p.c:1410: error: for each function it appears in.) hw/virtio-9p.c:1413: error: 'UTIME_OMIT' undeclared (first use in this function) hw/virtio-9p.c: In function 'v9fs_wstat_post_chmod': hw/virtio-9p.c:2905: error: 'UTIME_OMIT' undeclared (first use in this function) v3: - Use better alternative handling for UTIME_NOW/OMIT - Move qemu_utimensat() to cutils.c V2: - Introduce qemu_utimensat() Signed-off-by: Hidetoshi Setoseto.hideto...@jp.fujitsu.com Applied. Thanks. Regards, Anthony Liguori --- cutils.c | 43 +++ hw/virtio-9p-local.c |4 ++-- qemu-common.h| 10 ++ 3 files changed, 55 insertions(+), 2 deletions(-) diff --git a/cutils.c b/cutils.c index 536ee93..3c18941 100644 --- a/cutils.c +++ b/cutils.c @@ -288,3 +288,46 @@ int fcntl_setfl(int fd, int flag) } #endif +int qemu_utimensat(int dirfd, const char *path, const struct timespec *times, + int flags) +{ +#ifdef CONFIG_UTIMENSAT +return utimensat(dirfd, path, times, flags); +#else +/* Fallback: use utimes() instead of utimensat() */ +struct timeval tv[2], tv_now; +struct stat st; +int i; + +/* happy if special cases */ +if (times[0].tv_nsec == UTIME_OMIT times[1].tv_nsec == UTIME_OMIT) { +return 0; +} +if (times[0].tv_nsec == UTIME_NOW times[1].tv_nsec == UTIME_NOW) { +return utimes(path, NULL); +} + +/* prepare for hard cases */ +if (times[0].tv_nsec == UTIME_NOW || times[1].tv_nsec == UTIME_NOW) { +gettimeofday(tv_now, NULL); +} +if (times[0].tv_nsec == UTIME_OMIT || times[1].tv_nsec == UTIME_OMIT) { +stat(path,st); +} + +for (i = 0; i 2; i++) { +if (times[i].tv_nsec == UTIME_NOW) { +tv[i].tv_sec = tv_now.tv_sec; +tv[i].tv_usec = 0; +} else if (times[i].tv_nsec == UTIME_OMIT) { +tv[i].tv_sec = (i == 0) ? st.st_atime : st.st_mtime; +tv[i].tv_usec = 0; +} else { +tv[i].tv_sec = times[i].tv_sec; +tv[i].tv_usec = times[i].tv_nsec / 1000; +} +} + +return utimes(path,tv[0]); +#endif +} diff --git a/hw/virtio-9p-local.c b/hw/virtio-9p-local.c index 0d52020..41603ea 100644 --- a/hw/virtio-9p-local.c +++ b/hw/virtio-9p-local.c @@ -480,9 +480,9 @@ static int local_chown(FsContext *fs_ctx, const char *path, FsCred *credp) } static int local_utimensat(FsContext *s, const char *path, - const struct timespec *buf) + const struct timespec *buf) { -return utimensat(AT_FDCWD, rpath(s, path), buf, AT_SYMLINK_NOFOLLOW); +return qemu_utimensat(AT_FDCWD, rpath(s, path), buf, AT_SYMLINK_NOFOLLOW); } static int local_remove(FsContext *ctx, const char *path) diff --git a/qemu-common.h b/qemu-common.h index 2fbc27f..7fe4c16 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -146,6 +146,16 @@ time_t mktimegm(struct tm *tm); int qemu_fls(int i); int qemu_fdatasync(int fd); int fcntl_setfl(int fd, int flag); +#ifndef CONFIG_UTIMENSAT +#ifndef UTIME_NOW +# define UTIME_NOW ((1l 30) - 1l) +#endif +#ifndef UTIME_OMIT +# define UTIME_OMIT((1l 30) - 2l) +#endif +#endif +int qemu_utimensat(int dirfd, const char *path, const struct timespec *times, +int flags); /* path.c */ void init_paths(const char *prefix); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH vgabios] Add 1280x768 mode
Signed-off-by: Avi Kivity a...@redhat.com --- vbetables-gen.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/vbetables-gen.c b/vbetables-gen.c index 550935a..76b8842 100644 --- a/vbetables-gen.c +++ b/vbetables-gen.c @@ -55,6 +55,9 @@ ModeInfo modes[] = { { 1152, 864, 16 , 0x14a}, { 1152, 864, 24 , 0x14b}, { 1152, 864, 32 , 0x14c}, +{ 1280, 768, 16 , 0x175}, +{ 1280, 768, 24 , 0x176}, +{ 1280, 768, 32 , 0x177}, { 1280, 800, 16 , 0x178}, { 1280, 800, 24 , 0x179}, { 1280, 800, 32 , 0x17a}, -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH vgabios] Add 1280x768 mode
On 11/21/2010 05:33 PM, Avi Kivity wrote: Signed-off-by: Avi Kivitya...@redhat.com --- vbetables-gen.c |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/vbetables-gen.c b/vbetables-gen.c index 550935a..76b8842 100644 --- a/vbetables-gen.c +++ b/vbetables-gen.c @@ -55,6 +55,9 @@ ModeInfo modes[] = { { 1152, 864, 16 , 0x14a}, { 1152, 864, 24 , 0x14b}, { 1152, 864, 32 , 0x14c}, +{ 1280, 768, 16 , 0x175}, +{ 1280, 768, 24 , 0x176}, +{ 1280, 768, 32 , 0x177}, This is of course from qemu-kvm, and was added at a user's request. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 6/6] KVM: MMU: delay flush all tlbs on sync_page path
On 11/20/2010 12:11 AM, Marcelo Tosatti wrote: void kvm_flush_remote_tlbs(struct kvm *kvm) { +int dirty_count = atomic_read(kvm-tlbs_dirty); + +smp_mb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; +atomic_sub(dirty_count, kvm-tlbs_dirty); } This is racy because kvm_flush_remote_tlbs might be called without mmu_lock protection. Sorry for my carelessness, it should be 'cmpxchg' here. You could decrease the counter on invalidate_page/invalidate_range_start only, I want to avoid a unnecessary tlbs flush, if tlbs have been flushed after sync_page, then we don't need flush tlbs on invalidate_page/ invalidate_range_start path. these are not fast paths anyway. How about below patch? it just needs one atomic operation. --- arch/x86/kvm/paging_tmpl.h |4 ++-- include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c|7 ++- 3 files changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index dfb906f..e64192f 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -781,14 +781,14 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) gfn = gpte_to_gfn(gpte); if (FNAME(map_invalid_gpte)(vcpu, sp, sp-spt[i], gpte)) { - kvm_flush_remote_tlbs(vcpu-kvm); + vcpu-kvm-tlbs_dirty++; continue; } if (gfn != sp-gfns[i]) { drop_spte(vcpu-kvm, sp-spt[i], shadow_trap_nonpresent_pte); - kvm_flush_remote_tlbs(vcpu-kvm); + vcpu-kvm-tlbs_dirty++; continue; } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4bd663d..dafd90e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -249,6 +249,7 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; + long tlbs_dirty; #endif }; @@ -377,6 +378,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu); void kvm_resched(struct kvm_vcpu *vcpu); void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); + void kvm_flush_remote_tlbs(struct kvm *kvm); void kvm_reload_remote_mmus(struct kvm *kvm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index fb93ff9..fe0a1a7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -168,8 +168,12 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { + long dirty_count = kvm-tlbs_dirty; + + smp_mb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; + cmpxchg(kvm-tlbs_dirty, dirty_count, 0); } void kvm_reload_remote_mmus(struct kvm *kvm) @@ -249,7 +253,7 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, idx = srcu_read_lock(kvm-srcu); spin_lock(kvm-mmu_lock); kvm-mmu_notifier_seq++; - need_tlb_flush = kvm_unmap_hva(kvm, address); + need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm-tlbs_dirty; spin_unlock(kvm-mmu_lock); srcu_read_unlock(kvm-srcu, idx); @@ -293,6 +297,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, kvm-mmu_notifier_count++; for (; start end; start += PAGE_SIZE) need_tlb_flush |= kvm_unmap_hva(kvm, start); + need_tlb_flush |= kvm-tlbs_dirty; spin_unlock(kvm-mmu_lock); srcu_read_unlock(kvm-srcu, idx); -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] KVM: fix race condition in kvm_mmu_slot_remove_write_access()
Suggested by Avi. Dropping PT_WRITABLE_MASK is under race condition with the hardware. Use update_spte() to avoid this. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/mmu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index bdb9fa9..e716def 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3449,7 +3449,7 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) for (i = 0; i PT64_ENT_PER_PAGE; ++i) /* avoid RMW */ if (is_writable_pte(pt[i])) - pt[i] = ~PT_WRITABLE_MASK; + update_spte(pt[i], pt[i] ~PT_WRITABLE_MASK); } kvm_flush_remote_tlbs(kvm); } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] KVM: introduce helper to access lpage_info
The index calculation for accessing lpage_info appears a few times. This patch replaces these with a new lpage_idx() helper. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/mmu.c | 20 +++- 1 files changed, 11 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index e716def..2139309 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -476,6 +476,12 @@ static void kvm_mmu_page_set_gfn(struct kvm_mmu_page *sp, int index, gfn_t gfn) sp-gfns[index] = gfn; } +static unsigned long lpage_idx(gfn_t gfn, gfn_t base_gfn, int level) +{ + return (gfn KVM_HPAGE_GFN_SHIFT(level)) - + (base_gfn KVM_HPAGE_GFN_SHIFT(level)); +} + /* * Return the pointer to the largepage write count for a given * gfn, handling slots that are not large page aligned. @@ -484,10 +490,8 @@ static int *slot_largepage_idx(gfn_t gfn, struct kvm_memory_slot *slot, int level) { - unsigned long idx; + unsigned long idx = lpage_idx(gfn, slot-base_gfn, level); - idx = (gfn KVM_HPAGE_GFN_SHIFT(level)) - - (slot-base_gfn KVM_HPAGE_GFN_SHIFT(level)); return slot-lpage_info[level - 2][idx].write_count; } @@ -591,8 +595,7 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level) if (likely(level == PT_PAGE_TABLE_LEVEL)) return slot-rmap[gfn - slot-base_gfn]; - idx = (gfn KVM_HPAGE_GFN_SHIFT(level)) - - (slot-base_gfn KVM_HPAGE_GFN_SHIFT(level)); + idx = lpage_idx(gfn, slot-base_gfn, level); return slot-lpage_info[level - 2][idx].rmap_pde; } @@ -887,11 +890,10 @@ static int kvm_handle_hva(struct kvm *kvm, unsigned long hva, for (j = 0; j KVM_NR_PAGE_SIZES - 1; ++j) { unsigned long idx; - int sh; - sh = KVM_HPAGE_GFN_SHIFT(PT_DIRECTORY_LEVEL+j); - idx = ((memslot-base_gfn+gfn_offset) sh) - - (memslot-base_gfn sh); + idx = lpage_idx(memslot-base_gfn + gfn_offset, + memslot-base_gfn, + PT_DIRECTORY_LEVEL + j); ret |= handler(kvm, memslot-lpage_info[j][idx].rmap_pde, data); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] KVM: count the number of dirty pages for each memslot
This patch introduces the counter to hold the number of dirty pages in each memslot. This eliminate the need to scan dirty bitmap for get_dirty_log. We will also use this to optimize dirty logging later. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp --- arch/x86/kvm/x86.c |9 +++-- include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c |6 +- 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 410d2d1..e9cf381 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3199,10 +3199,9 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) { - int r, i; + int r; struct kvm_memory_slot *memslot; unsigned long n; - unsigned long is_dirty = 0; mutex_lock(kvm-slots_lock); @@ -3217,11 +3216,8 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, n = kvm_dirty_bitmap_bytes(memslot); - for (i = 0; !is_dirty i n/sizeof(long); i++) - is_dirty = memslot-dirty_bitmap[i]; - /* If nothing is dirty, don't bother messing with page tables. */ - if (is_dirty) { + if (memslot-nr_dirty_pages) { struct kvm_memslots *slots, *old_slots; unsigned long *dirty_bitmap; @@ -3236,6 +3232,7 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, goto out; memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots)); slots-memslots[log-slot].dirty_bitmap = dirty_bitmap; + slots-memslots[log-slot].nr_dirty_pages = 0; slots-generation++; old_slots = kvm-memslots; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4bd663d..ce101b1 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -152,6 +152,7 @@ struct kvm_memory_slot { unsigned long *rmap; unsigned long *dirty_bitmap; unsigned long *dirty_bitmap_head; + unsigned long nr_dirty_pages; struct { unsigned long rmap_pde; int write_count; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index fb93ff9..e1022ad 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -565,6 +565,7 @@ static int kvm_create_dirty_bitmap(struct kvm_memory_slot *memslot) return -ENOMEM; memslot-dirty_bitmap_head = memslot-dirty_bitmap; + memslot-nr_dirty_pages = 0; return 0; } @@ -1388,7 +1389,10 @@ void mark_page_dirty_in_slot(struct kvm *kvm, struct kvm_memory_slot *memslot, if (memslot memslot-dirty_bitmap) { unsigned long rel_gfn = gfn - memslot-base_gfn; - generic___set_le_bit(rel_gfn, memslot-dirty_bitmap); + if (!generic_test_le_bit(rel_gfn, memslot-dirty_bitmap)) { + generic___set_le_bit(rel_gfn, memslot-dirty_bitmap); + memslot-nr_dirty_pages++; + } } } -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 4/4] KVM: selective write protection using dirty bitmap
Lai Jiangshan once tried to rewrite kvm_mmu_slot_remove_write_access() using rmap: kvm: rework remove-write-access for a slot http://www.spinics.net/lists/kvm/msg35871.html One problem pointed out there was that this approach might hurt cache locality and make things slow down. But if we restrict the story to dirty logging, we notice that only small portion of pages are actually needed to be write protected. So this patch uses his approach with small modification to use switched out dirty bitmap as a hint to restrict the rmap travel. We can also use this to selectively write protect pages to reduce unwanted page faults in the future. Conditions at which this hack has advantage: - few dirty pages compared to shadow pages - under reasonable, not pathological, workloads Note that slots for frame buffers are also affected by the total number of shadow pages in the original implmentation. Actually, I observed that the larger RAM became the more time kvm_mmu_slot_remove_write_access() took for frame buffer updates. This problem can be solved by our approach. Also, in the usual workloads, live-migration does not see so many dirty pages which makes this approach not effective. Performance gain: During x11perf, in the condition of nr_dirty_pages/npages = 375/4096, our method was 20 times faster than the original. Live-migration also got similar improvement, and the effect for low workloads was more siginificant. Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp Cc: Lai Jiangshan la...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |2 + arch/x86/kvm/mmu.c | 42 +++ arch/x86/kvm/x86.c | 37 - 3 files changed, 79 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b04c0fa..bc72c0d 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -617,6 +617,8 @@ void kvm_mmu_set_mask_ptes(u64 user_mask, u64 accessed_mask, int kvm_mmu_reset_context(struct kvm_vcpu *vcpu); void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot); +void kvm_mmu_slot_remove_write_access_mask(struct kvm *kvm, + struct kvm_memory_slot *slot, unsigned long *dirty_bitmap); void kvm_mmu_zap_all(struct kvm *kvm); unsigned int kvm_mmu_calculate_mmu_pages(struct kvm *kvm); void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int kvm_nr_mmu_pages); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2139309..978e806 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3456,6 +3456,48 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot) kvm_flush_remote_tlbs(kvm); } +static void remove_write_access_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte = rmap_next(kvm, rmapp, NULL); + + while (spte) { + update_spte(spte, *spte ~PT_WRITABLE_MASK); + spte = rmap_next(kvm, rmapp, spte); + } +} + +/* + * Write protect the pages marked dirty in a given bitmap. + */ +void kvm_mmu_slot_remove_write_access_mask(struct kvm *kvm, + struct kvm_memory_slot *slot, + unsigned long *dirty_bitmap) +{ + int i; + gfn_t gfn_offset; + unsigned long idx; + long last_idx[KVM_NR_PAGE_SIZES - 1]; + + for (i = 0; i (KVM_NR_PAGE_SIZES - 1); ++i) + last_idx[i] = -1; + + for_each_set_bit(gfn_offset, dirty_bitmap, slot-npages) { + remove_write_access_rmapp(kvm, slot-rmap[gfn_offset]); + + for (i = 0; i (KVM_NR_PAGE_SIZES - 1); ++i) { + idx = lpage_idx(slot-base_gfn + gfn_offset, + slot-base_gfn, PT_DIRECTORY_LEVEL + i); + if (idx == last_idx[i]) + continue; + + remove_write_access_rmapp(kvm, + slot-lpage_info[i][idx].rmap_pde); + last_idx[i] = idx; + } + } + kvm_flush_remote_tlbs(kvm); +} + void kvm_mmu_zap_all(struct kvm *kvm) { struct kvm_mmu_page *sp, *node; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e9cf381..222af5e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3193,6 +3193,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, return 0; } +enum kvm_dirty_level { + KVM_DIRTY_LEVEL_NOTHING, + KVM_DIRTY_LEVEL_LIGHT, + KVM_DIRTY_LEVEL_HEAVY +}; + +/* + * Decide which write protection functions we should use. + * + * Key factors: + * - number of dirty pages + * - number of shadow pages + * - direct mode or shadow mode + */ +static enum kvm_dirty_level dirty_level_memslot(struct kvm_memory_slot *memslot) +{ + if (!memslot-nr_dirty_pages) + return
Re: [PATCH-v2 0/4] KVM: dirty logging optimization
(2010/11/22 14:40), Takuya Yoshikawa wrote: Hi, I updated my patch set with your comments reflected. Changelog: - used enum - lpage_idx issues - function names - measurements - [PATCH 1/4] KVM: fix race condition in kvm_mmu_slot_remove_write_access() - [PATCH 2/4] KVM: introduce helper to access lpage_info - [PATCH 3/4] KVM: count the number of dirty pages for each memslot - [RFC PATCH 4/4] KVM: selective write protection using dirty bitmap The first one is a small fix and maybe worth applying seperately. The second and third are preparation for the last one, but I believe each has its own value. I measured the time consumed by kvm_mmu_slot_remove_write_access[_mask]() to decide at which conditions we choose *_mask(). Please check below! * I implemented dirty_level_memslot() based on these results. See [PATCH 4] Is it possible to apply [PATCH 1 and 2] first? It will help me a lot to concentrate on the main part. Takuya === Test - On core 2 duo machine (without EPT) = just because there was already test environment. - 1GB RAM for the guest - Workloads: x11perf for frame buffer test find / loop for live-migration test - Time was measured by rdtscll kvm mask: our method kvm orig: original method a) Frame buffer --- kernel: [ 5584.954209] kvm: nr_dirty_pages/npages = 8/4096 kernel: [ 5584.954240] kvm mask: time(tsc) = 9485 kernel: [ 5992.185125] kvm: nr_dirty_pages/npages = 7/4096 kernel: [ 5992.185389] kvm orig: time(tsc) = 444724 45 times faster --- kernel: [ 5588.330585] kvm: nr_dirty_pages/npages = 80/4096 kernel: [ 5588.330623] kvm mask: time(tsc) = 16688 kernel: [ 5996.183982] kvm: nr_dirty_pages/npages = 80/4096 kernel: [ 5996.184275] kvm orig: time(tsc) = 495551 30 times faster --- kernel: [ 5578.156641] kvm: nr_dirty_pages/npages = 195/4096 kernel: [ 5578.156690] kvm mask: time(tsc) = 27335 kernel: [ 5985.109688] kvm: nr_dirty_pages/npages = 195/4096 kernel: [ 5985.109820] kvm orig: time(tsc) = 194530 7 times faster --- kernel: [ 5605.821017] kvm: nr_dirty_pages/npages = 375/4096 kernel: [ 5605.821077] kvm mask: time(tsc) = 35049 kernel: [ 6014.627767] kvm: nr_dirty_pages/npages = 375/4096 kernel: [ 6014.628190] kvm orig: time(tsc) = 702737 20 times faster --- kernel: [ 5573.551459] kvm: nr_dirty_pages/npages = 576/4096 kernel: [ 5573.551532] kvm mask: time(tsc) = 46718 kernel: [ 5981.486862] kvm: nr_dirty_pages/npages = 576/4096 kernel: [ 5981.487036] kvm orig: time(tsc) = 210385 5 times faster --- b) Live-migration (1GB RAM) --- kernel: [ 5011.599573] kvm: nr_dirty_pages/npages = 52/261888 kernel: [ 5011.599624] kvm mask: time(tsc) = 33712 kernel: [ 4297.465401] kvm: nr_dirty_pages/npages = 50/261888 kernel: [ 4297.465918] kvm orig: time(tsc) = 894593 25 times faster --- kernel: [ 5009.045842] kvm: nr_dirty_pages/npages = 120/261888 kernel: [ 5009.045940] kvm mask: time(tsc) = 50575 kernel: [ 4295.438092] kvm: nr_dirty_pages/npages = 106/261888 kernel: [ 4295.438427] kvm orig: time(tsc) = 525140 10 times faster --- kernel: [ 5016.914031] kvm: nr_dirty_pages/npages = 251/261888 kernel: [ 5016.914160] kvm mask: time(tsc) = 78750 kernel: [ 4297.265937] kvm: nr_dirty_pages/npages = 260/261888 kernel: [ 4297.266411] kvm orig: time(tsc) = 782446 10 times faster --- kernel: [ 5015.717535] kvm: nr_dirty_pages/npages = 526/261888 kernel: [ 5015.717697] kvm mask: time(tsc) = 130137 kernel: [ 4295.370101] kvm: nr_dirty_pages/npages = 596/261888 kernel: [ 4295.370589] kvm orig: time(tsc) = 644805 5 times faster --- kernel: [ 5011.693010] kvm: nr_dirty_pages/npages = 1029/261888 kernel: [ 5011.693177] kvm mask: time(tsc) = 219863 kernel: [ 4295.760635] kvm: nr_dirty_pages/npages = 1025/261888 kernel: [ 4295.761078] kvm orig: time(tsc) = 767333 3 times faster --- kernel: [ 5014.805403] kvm: nr_dirty_pages/npages = 2219/261888 kernel: [ 5014.805676] kvm mask: time(tsc) = 454314 kernel: [ 4302.186712] kvm: nr_dirty_pages/npages = 1982/261888 kernel: [ 4302.191418] kvm orig: time(tsc) = 2294187 5 times faster --- kernel: [ 5015.305794] kvm: nr_dirty_pages/npages = 4143/261888 kernel: [ 5015.306291] kvm mask: time(tsc) =
Re: [Qemu-devel] [PATCH v3] virtio-9p: fix build on !CONFIG_UTIMENSAT
On 11/21/10 16:22, Anthony Liguori wrote: On 11/14/2010 08:15 PM, Hidetoshi Seto wrote: This patch introduce a fallback mechanism for old systems that do not support utimensat(). This fix build failure with following warnings: hw/virtio-9p-local.c: In function 'local_utimensat': hw/virtio-9p-local.c:479: warning: implicit declaration of function 'utimensat' hw/virtio-9p-local.c:479: warning: nested extern declaration of 'utimensat' and: hw/virtio-9p.c: In function 'v9fs_setattr_post_chmod': hw/virtio-9p.c:1410: error: 'UTIME_NOW' undeclared (first use in this function) hw/virtio-9p.c:1410: error: (Each undeclared identifier is reported only once hw/virtio-9p.c:1410: error: for each function it appears in.) hw/virtio-9p.c:1413: error: 'UTIME_OMIT' undeclared (first use in this function) hw/virtio-9p.c: In function 'v9fs_wstat_post_chmod': hw/virtio-9p.c:2905: error: 'UTIME_OMIT' undeclared (first use in this function) v3: - Use better alternative handling for UTIME_NOW/OMIT - Move qemu_utimensat() to cutils.c V2: - Introduce qemu_utimensat() Signed-off-by: Hidetoshi Setoseto.hideto...@jp.fujitsu.com Applied. Thanks. Regards, Anthony Liguori Anthony, Did you actually apply this one? I don't see it in the git tree. However if you did, that was a mistake, the qemu_utimensat() should not have gone into cutils.c as I pointed out earlier, it is inconsistent. Cheers, Jes -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html