[PATCH] KVM: APIC: avoid instruction emulation for EOI writes
Hi, Avi, Here comes the patch: KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian kevin.t...@intel.com Based on earlier work from: Signed-off-by: Eddie Dong eddie.d...@intel.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..933187e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + if (apic) + apic_set_eoi(vcpu-arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu-arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..35e4af7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* +* Sane guest uses MOV instead of string operations to +* write EOI, with written value not cared. So make a +* short-circuit here by avoiding heavy instruction +* emulation. +*/ + if ((access_type == 1) (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Thanks Kevin -Original Message- From: Avi Kivity [mailto:a...@redhat.com] Sent: Thursday, August 25, 2011 12:39 PM To: Tian, Kevin Cc: kvm@vger.kernel.org; Nakajima, Jun Subject: Re: about vEOI optimization On 08/25/2011 05:24 AM, Tian, Kevin wrote: Another option is the hyper-V EOI support, which can also eliminate the EOI exit when no additional interrupt is pending. This can improve EOI performance even more. yes, and this is an orthogonal option. So if you agree, I'll send out an updated patch atop their work. Thanks. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. 20110829_kvm_eoi_opt.patch Description: 20110829_kvm_eoi_opt.patch
RE: [PATCH] KVM: emulate lapic tsc deadline timer for hvm
From: Marcelo Tosatti Sent: Tuesday, August 23, 2011 6:47 PM + if (!apic-lapic_timer.tscdeadline) + return; + + tsc_target = kvm_x86_ops- + guest_to_host_tsc(apic-lapic_timer.tscdeadline); + rdtscll(tsc_now); + tsc_delta = tsc_target - tsc_now; This only works if we have a constant tsc, that's not true for large that type of machine exposes tricky issues to time virtualization in general. multiboard machines. Need to do this with irqs disabled as well (reading both 'now' and 'tsc_now' in the same critical section). Should look like this: local_irq_disable(); u64 guest_tsc = kvm_x86_ops-read_l1_tsc(vcpu); if (guest_tsc = tscdeadline) hrtimer_start(now); else { ns = convert_to_ns(guest_tsc - tscdeadline); hrtimer_start(now + ns); } local_irq_enable(); Above is an overkill. only calculating guest tsc delta needs be protected. On the other hand, I don't think masking irq here adds obvious benefit. Virtualization doesn't ensure micro-level time accuracy, since there're many events delaying virtual time interrupt injection even when timer emulation logic triggers it timely. That's even the case on bare metal though with smaller scope, and thus most guests are able to handle such situation. masking irq in non-critical regions simply slow down the system response on other events. Note the vcpus tsc can have different frequency than the hosts, so vcpu_tsc_khz() should be used to convert to nanoseconds, not tsc_khz. yes, that's a good suggestion. btw, Jinsong is on vacation this week. He will update the patch according to you comment when back. Thanks Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 08/29/2011 09:09 AM, Tian, Kevin wrote: static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* +* Sane guest uses MOV instead of string operations to +* write EOI, with written value not cared. So make a +* short-circuit here by avoiding heavy instruction +* emulation. +*/ + if ((access_type == 1) (offset == APIC_EOI)) { Please replace this 1 with a #define. + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Looks good otherwise. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
From: Avi Kivity [mailto:a...@redhat.com] Sent: Monday, August 29, 2011 3:24 PM On 08/29/2011 09:09 AM, Tian, Kevin wrote: static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* +* Sane guest uses MOV instead of string operations to +* write EOI, with written value not cared. So make a +* short-circuit here by avoiding heavy instruction +* emulation. +*/ + if ((access_type == 1) (offset == APIC_EOI)) { Please replace this 1 with a #define. updated. -- KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian kevin.t...@intel.com Based on earlier work from: Signed-off-by: Eddie Dong eddie.d...@intel.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..933187e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + if (apic) + apic_set_eoi(vcpu-arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu-arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..34c094f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) return 1; } +#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */ static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* +* Sane guest uses MOV instead of string operations to +* write EOI, with written value not cared. So make a +* short-circuit here by avoiding heavy instruction +* emulation. +*/ + if ((access_type == APIC_ACCESS_TYPE_W) (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Thanks Kevin 20110829_kvm_eoi_opt.patch Description: 20110829_kvm_eoi_opt.patch
RE: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On Mon, 2011-08-29 at 15:35 +0800, Tian, Kevin wrote: From: Avi Kivity [mailto:a...@redhat.com] Sent: Monday, August 29, 2011 3:24 PM On 08/29/2011 09:09 AM, Tian, Kevin wrote: static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ + if ((access_type == 1) (offset == APIC_EOI)) { Please replace this 1 with a #define. updated. -- KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian kevin.t...@intel.com Based on earlier work from: Signed-off-by: Eddie Dong eddie.d...@intel.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..933187e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + if (apic) + apic_set_eoi(vcpu-arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu-arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..34c094f 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4538,8 +4538,25 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) return 1; } +#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */ static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; Maybe it's worth moving that #define above, and '#define'ing all the magic that happens in these 2 lines in vmx.h along with all the other exit qualification parameters? + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ + if ((access_type == APIC_ACCESS_TYPE_W) (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Thanks Kevin -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 39412] Win Vista and Win2K8 guests' network breaks down
https://bugzilla.kernel.org/show_bug.cgi?id=39412 Jay Ren yongjie@intel.com changed: What|Removed |Added Status|NEEDINFO|CLOSED Resolution||CODE_FIX --- Comment #8 from Jay Ren yongjie@intel.com 2011-08-29 08:48:53 --- I tried this on Linus' linux 3.1.0-rc4 tree and found this bug got fixed. The commit I tried is below. commit c6a389f123b9f68d605bb7e0f9b32ec1e3e14132 Author: Linus Torvalds torva...@linux-foundation.org Date: Sun Aug 28 21:16:01 2011 -0700 -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 08/29/2011 11:15 AM, Sasha Levin wrote: +#define APIC_ACCESS_TYPE_W 1 /* Linear write access during inst execution */ static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; Maybe it's worth moving that #define above, and '#define'ing all the magic that happens in these 2 lines in vmx.h along with all the other exit qualification parameters? Sure, that's what vmx.h is for. And 0xfff is ~PAGE_MASK. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 39412] Win Vista and Win2K8 guests' network breaks down
https://bugzilla.kernel.org/show_bug.cgi?id=39412 Florian Mickler flor...@mickler.org changed: What|Removed |Added Resolution|CODE_FIX|UNREPRODUCIBLE --- Comment #9 from Florian Mickler flor...@mickler.org 2011-08-29 08:53:53 --- Thank you. [Because we don't exactly know what fixed it, our rule is to close this as unreproducible.] -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] Migration thread mutex
On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpande udesh...@redhat.com wrote: This patch implements migrate_ram mutex, which protects the RAMBlock list traversal in the migration thread during the transfer of a ram from their addition/removal from the iothread. Note: Combination of iothread mutex and migration thread mutex works as a rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM block list. Signed-off-by: Umesh Deshpande udesh...@redhat.com --- arch_init.c | 21 + cpu-all.h | 3 +++ exec.c | 23 +++ qemu-common.h | 2 ++ 4 files changed, 49 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 484b39d..9d02270 100644 --- a/arch_init.c +++ b/arch_init.c @@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset; +static uint64_t last_version; [...] typedef struct RAMList { + QemuMutex mutex; /* Protects RAM block list */ uint8_t *phys_dirty; + uint32_t version; /* To detect ram block addition/removal */ Is there a reason why RAMList.version is uint32_t but last_version is uint64_t? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Separate migration thread
On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpande udesh...@redhat.com wrote: This patch creates a separate thread for the guest migration on the source side. All exits (on completion/error) from the migration thread are handled by a bottom handler, which is called from the iothread. Signed-off-by: Umesh Deshpande udesh...@redhat.com --- buffered_file.c | 76 migration.c | 105 ++ migration.h | 8 qemu-thread-posix.c | 10 + qemu-thread.h | 1 + Will this patch break Windows builds by adding a function to qemu-thread-posix.c which is not implemented in qemu-thread-win32.c? Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Separate thread for VM migration
On 08/27/2011 08:09 PM, Umesh Deshpande wrote: Following patch series deals with VCPU and iothread starvation during the migration of a guest. Currently the iothread is responsible for performing the guest migration. It holds qemu_mutex during the migration and doesn't allow VCPU to enter the qemu mode and delays its return to the guest. The guest migration, executed as an iohandler also delays the execution of other iohandlers. In the following patch series, The migration has been moved to a separate thread to reduce the qemu_mutex contention and iohandler starvation. Umesh Deshpande (5): vm_stop from non-io threads MRU ram block list migration thread mutex separate migration bitmap separate migration thread arch_init.c | 38 + buffered_file.c | 76 ++--- cpu-all.h | 42 ++ cpus.c |4 +- exec.c | 97 -- migration.c | 117 --- migration.h |9 qemu-common.h |2 + qemu-thread-posix.c | 10 qemu-thread.h |1 + 10 files changed, 302 insertions(+), 94 deletions(-) I think this patchset is quite good. These are the problems I found: 1) the locking rules in patch 3 are a bit too clever, and the cleverness will become obsolete once RCU is in place. The advantage of the clever stuff is that rwlock code looks more like RCU code; the disadvantage is that it is harder to read and easier to bikeshed about. 2) it breaks Windows build, but that's easy to fix. 3) there are a _lot_ of cleanups possible on top of patch 5 (I would not be too surprised if the final version has an almost-neutral diffstat), and whether to prefer good or perfect is another very popular topic. 4) I'm not sure block migration has been tested in all scenarios (I'm curious about coroutine-heavy ones). This may mean that the migration thread is blocked onto Marcelo's live streaming work, which would provide the ingredients to remove block migration altogether. A round of Autotest testing is the minimum required to include this, and I'm not sure if this was done either. That said, I find the code to be quite good overall, and I wouldn't oppose inclusion with only (2) fixed---may even take care of it myself---and more testing results apart from the impressive performance numbers. About performance, I'm curious how you measured it. Was the buffer cache empty? That is, how many compressible pages were found? I toyed with vectorizing is_dup_page, but I need to get some numbers before posting. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm 0.15.0 boot order not working
Hi, when I specify something like qemu -boot order=dc -cdrom image.iso -drive file=img.raw,if=virtio,boot=yes or qemu -boot order=n -cdrom image.iso -drive file=img.raw,if=virtio,boot=yes with qemu-kvm 0.15.0 it will always directly boot from the hardrive and not from cdrom or network. is this on purpose? the behaviour was different in earlier versions. If i omit the boot=yes in the drive specification the -boot order parameter is working as expected. Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 2011-08-29 08:09, Tian, Kevin wrote: Hi, Avi, Here comes the patch: KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian kevin.t...@intel.com Based on earlier work from: Signed-off-by: Eddie Dong eddie.d...@intel.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..933187e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + if (apic) + apic_set_eoi(vcpu-arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu-arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..35e4af7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ Is there no cheap way to validate this assumption and fall back to the slow path in case it doesn't apply? E.g. reading the first instruction byte and matching it against a whitelist? Even if the ignored scenarios are highly unlikely, I think we so far tried hard to provide both fast and accurate results to the guest in all cases. + if ((access_type == 1) (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Thanks Kevin Nice optimization otherwise! Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HPET configuration in Seabios
On 2011-08-29 07:32, Avi Kivity wrote: On 08/29/2011 01:14 AM, Kevin O'Connor wrote: On Sun, Aug 28, 2011 at 10:42:49PM +0200, Jan Kiszka wrote: On 2011-08-28 20:54, Alexander Graf wrote: On 28.08.2011, at 02:42, Avi Kivity wrote: On 08/26/2011 08:32 AM, ya su wrote: hi,Avi: I met the same problem, tons of hpet vm_exits(vector 209, fault address is in the guest vm's hpet mmio range), even I disable hpet device in win7 guest vm, it still produce a larget amount of vm_exits when trace-cmd ; I add -no-hpet to start the vm, it still has HPET device inside VM. Does that means the HPET device in VM does not depend on the emulated hpet device in qemu-kvm? Is there any way to disable the VM HPET device to prevent so many vm_exits? Thansk. Looks like a bug to me. IIRC disabling the HPET device doesn't remove the entry from the DSDT, no? So the guest OS might still think it's there while nothing responds (read returns -1). Exactly. We have a fw_cfg interface in place for quite a while now (though I wonder how the firmware is supposed to tell -no-hpet apart from QEMU versions that don't provide this data - both return count = 255), but SeaBios still exposes one HPET block at a hard-coded address unconditionally. There was quite some discussion about the corresponding Seabios patches back then but apparently no consensus was found. Re-reading it, I think Kevin asked for passing the necessary DSDT fragments from QEMU to the firmware instead of using a new, proprietary fw_cfg format. Is that still the key requirement for any patch finally fixing this bug? My preference would be to use the existing ACPI table passing interface (fw_cfg slot 0x8000) to pass different ACPI tables to SeaBIOS. SeaBIOS doesn't currently allow that interface to override tables SeaBIOS builds itself, but it's a simple change to rectify that. When this was last proposed, it was raised that the header information in the ACPI table may then not match the tables that SeaBIOS builds. I think I proposed at that time that SeaBIOS could use the header of the first fw_cfg table (or some other fw_cfg interface) to populate the headers of its table headers. However, there was no consensus. Note - the above is in regard to the HPET table. If the HPET entry in the DSDT needs to be removed then that's a bigger change. Can't seabios just poke at the hpet itself and see if it exists or not? Would be hard for the BIOS to guess the locations of the blocks unless we define the addresses used by QEMU as something like base + hpet_no * block_size in all cases. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AUTOTEST][PATCH][KVM] Add test for problem with killing guest when network is under load.
Thanks, this patch works well. Acked-by: Lukas Doktor ldok...@redhat.com Dne 26.8.2011 10:31, Jiří Župka napsal(a): This patch contain two tests. 1) Try kill guest when guest netwok is under loading. 2) Try kill guest after multiple adding and removing network drivers. Signed-off-by: Jiří Župkajzu...@redhat.com --- client/tests/kvm/tests_base.cfg.sample| 18 client/virt/tests/netstress_kill_guest.py | 147 + 2 files changed, 165 insertions(+), 0 deletions(-) create mode 100644 client/virt/tests/netstress_kill_guest.py diff --git a/client/tests/kvm/tests_base.cfg.sample b/client/tests/kvm/tests_base.cfg.sample index ec1b48d..a7ff29f 100644 --- a/client/tests/kvm/tests_base.cfg.sample +++ b/client/tests/kvm/tests_base.cfg.sample @@ -845,6 +845,24 @@ variants: restart_vm = yes kill_vm_on_error = yes +- netstress_kill_guest: install setup unattended_install.cdrom +only Linux +type = netstress_kill_guest +image_snapshot = yes +nic_mode = tap +# There should be enough vms for build topology. +variants: +-driver: +mode = driver +-load: +mode = load +netperf_files = netperf-2.4.5.tar.bz2 wait_before_data.patch +packet_size = 1500 +setup_cmd = cd %s tar xvfj netperf-2.4.5.tar.bz2 cd netperf-2.4.5 patch -p0 ../wait_before_data.patch ./configure make +clean_cmd = while killall -9 netserver; do True test; done; +netserver_cmd = %s/netperf-2.4.5/src/netserver +netperf_cmd = %s/netperf-2.4.5/src/netperf -t %s -H %s -l 60 -- -m %s + - set_link: install setup image_copy unattended_install.cdrom type = set_link test_timeout = 1000 diff --git a/client/virt/tests/netstress_kill_guest.py b/client/virt/tests/netstress_kill_guest.py new file mode 100644 index 000..7452e09 --- /dev/null +++ b/client/virt/tests/netstress_kill_guest.py @@ -0,0 +1,147 @@ +import logging, os, signal, re, time +from autotest_lib.client.common_lib import error +from autotest_lib.client.bin import utils +from autotest_lib.client.virt import aexpect, virt_utils + + +def run_netstress_kill_guest(test, params, env): + +Try stop network interface in VM when other VM try to communicate. + +@param test: kvm test object +@param params: Dictionary with the test parameters +@param env: Dictionary with test environment. + +def get_corespond_ip(ip): + +Get local ip address which is used for contact ip. + +@param ip: Remote ip +@return: Local corespond IP. + +result = utils.run(ip route get %s % (ip)).stdout +ip = re.search(src (.+), result) +if ip is not None: +ip = ip.groups()[0] +return ip + + +def get_ethernet_driver(session): + +Get driver of network cards. + +@param session: session to machine + +modules = [] +out = session.cmd(ls -l /sys/class/net/*/device/driver/module) +for module in out.split(\n): +modules.append(module.split(/)[-1]) +modules.remove() +return set(modules) + + +def kill_and_check(vm): +vm_pid = vm.get_pid() +vm.destroy(gracefully=False) +time.sleep(2) +try: +os.kill(vm_pid, 0) +logging.error(VM is not dead.) +raise error.TestFail(Problem with killing guest.) +except OSError: +logging.info(VM is dead.) + + +def netload_kill_problem(session_serial): +netperf_dir = os.path.join(os.environ['AUTODIR'], tests/netperf2) +setup_cmd = params.get(setup_cmd) +clean_cmd = params.get(clean_cmd) +firewall_flush = iptables -F + +try: +utils.run(firewall_flush) +except: +logging.warning(Could not flush firewall rules on guest) + +try: +session_serial.cmd(firewall_flush) +except aexpect.ShellError: +logging.warning(Could not flush firewall rules on guest) + +for i in params.get(netperf_files).split(): +vm.copy_files_to(os.path.join(netperf_dir, i), /tmp) + +guest_ip = vm.get_address(0) +server_ip = get_corespond_ip(guest_ip) + +logging.info(Setup and run netperf on host and guest) +session_serial.cmd(setup_cmd % /tmp, timeout=200) +utils.run(setup_cmd % netperf_dir) + +try: +session_serial.cmd(clean_cmd) +except: +pass +session_serial.cmd(params.get(netserver_cmd) % /tmp) + +utils.run(clean_cmd, ignore_status=True) +utils.run(params.get(netserver_cmd) % netperf_dir) + +server_netperf_cmd = params.get(netperf_cmd) % (netperf_dir, TCP_STREAM, +guest_ip,
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 08/29/2011 01:24 PM, Jan Kiszka wrote: static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ Is there no cheap way to validate this assumption and fall back to the slow path in case it doesn't apply? E.g. reading the first instruction byte and matching it against a whitelist? Even if the ignored scenarios are highly unlikely, I think we so far tried hard to provide both fast and accurate results to the guest in all cases. Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HPET configuration in Seabios
On 08/29/2011 01:25 PM, Jan Kiszka wrote: Can't seabios just poke at the hpet itself and see if it exists or not? Would be hard for the BIOS to guess the locations of the blocks unless we define the addresses used by QEMU as something like base + hpet_no * block_size in all cases. Currently we have a fixed address. We could do: if available in fw_cfg: use that (may indicate no hpet) elif fixed address works: use that else no hpet -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 2011-08-29 12:59, Avi Kivity wrote: On 08/29/2011 01:24 PM, Jan Kiszka wrote: static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification 12) 0xf; + offset = exit_qualification 0xfff; + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ Is there no cheap way to validate this assumption and fall back to the slow path in case it doesn't apply? E.g. reading the first instruction byte and matching it against a whitelist? Even if the ignored scenarios are highly unlikely, I think we so far tried hard to provide both fast and accurate results to the guest in all cases. Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access). And what about caching the result of the first walk? Usually, a sane guest won't have many code pages that issue the EIO. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HPET configuration in Seabios
On 2011-08-29 13:00, Avi Kivity wrote: On 08/29/2011 01:25 PM, Jan Kiszka wrote: Can't seabios just poke at the hpet itself and see if it exists or not? Would be hard for the BIOS to guess the locations of the blocks unless we define the addresses used by QEMU as something like base + hpet_no * block_size in all cases. Currently we have a fixed address. We could do: if available in fw_cfg: use that (may indicate no hpet) elif fixed address works: use that else no hpet Currently, we also only have a single HPET block, but that's just because of some QEMU limitations that will vanish sooner or later. Then nothing will prevent multiple -device hpet,base=XXX. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 08/29/2011 02:03 PM, Jan Kiszka wrote: Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access). And what about caching the result of the first walk? Usually, a sane guest won't have many code pages that issue the EIO. There's no way to know when to invalidate the cache. We could go a bit further, and cache the the whole thing. On the first exit, do the entire emulation, and remember %rip. On the second exit, if %rip matches, skip directly to kvm_lapic_eoi(). But I don't think it's worth it. This also has failure modes, and really, no guest will ever write to EOI with stosl. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HPET configuration in Seabios
On 08/29/2011 02:05 PM, Jan Kiszka wrote: On 2011-08-29 13:00, Avi Kivity wrote: On 08/29/2011 01:25 PM, Jan Kiszka wrote: Can't seabios just poke at the hpet itself and see if it exists or not? Would be hard for the BIOS to guess the locations of the blocks unless we define the addresses used by QEMU as something like base + hpet_no * block_size in all cases. Currently we have a fixed address. We could do: if available in fw_cfg: use that (may indicate no hpet) elif fixed address works: use that else no hpet Currently, we also only have a single HPET block, but that's just because of some QEMU limitations that will vanish sooner or later. Then nothing will prevent multiple -device hpet,base=XXX. Yes, so we should enable the fw_cfg interface before that happens. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: HPET configuration in Seabios
On 2011-08-29 13:05, Jan Kiszka wrote: On 2011-08-29 13:00, Avi Kivity wrote: On 08/29/2011 01:25 PM, Jan Kiszka wrote: Can't seabios just poke at the hpet itself and see if it exists or not? Would be hard for the BIOS to guess the locations of the blocks unless we define the addresses used by QEMU as something like base + hpet_no * block_size in all cases. Currently we have a fixed address. We could do: if available in fw_cfg: use that (may indicate no hpet) elif fixed address works: use that else no hpet Currently, we also only have a single HPET block, but that's just because of some QEMU limitations that will vanish sooner or later. Then nothing will prevent multiple -device hpet,base=XXX. That said, some HPET probing (without any fw_cfg) may be a short-term workaround to fix Seabios until we defined The solution for communicating HPET block configurations. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester
On 08/26/2011 01:17 AM, Lucas Meneghel Rodrigues wrote: I still haven't gone through all the code, but it's a good idea to put a MODULE_LICENSE(GPL) macro around here, so the build system doesn't complain about it: WARNING: modpost: missing MODULE_LICENSE() in arch/x86/kvm/test-emulator.o see include/linux/module.h for more information Thanks; fixed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Emulator fuzz tester
On 08/26/2011 03:11 AM, Lucas Meneghel Rodrigues wrote: Lucas, how would we go about integrating this into kvm-autotest? I have applied the 3 patches on your latest tree, compiled the kernel but I'm having trouble in running the test the way you described. One thing I've noticed here: I can only compile the test as a kernel module, not in the kernel image (menuconfig only gives me (N/m/?). So I believe there's no way to test it the way you have described... In any case I did try what you have suggested, then the kernel panics due to the lack of a filesystem/init. After some reading, I learned to create a bogus fs with a bogus init in it, but still, the test does not run (I guess it's because the test module is not compiled into the bzImage). I assume there are some details you forgot to mention to get this done... Would you mind posting a more detailed procedure? The module depends on KVM, so if that is a module, then you can only build the test as a module. If you set CONFIG_KVM=y then you'll be able to build the fuzzer in. The simplest kernel you can build is probably make defconfig set CONFIG_KVM=y set CONFIG_KVM_EMULATOR_TEST=y Note there is no need to build kvm-intel or kvm-amd for that. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm tools: Add ivshmem device
The patch adds an ivshmem device which can be used to share memory between guests on the same host. This implementation is lacking inter-guest communication which should be implemented later once information regarding the client-server protocol is gathered, though infrastructure used to add and remove clients already exists in the patch (but isn't used anywhere). Patch is based on David Evansky's shmem device. Signed-off-by: David Evensky even...@sandia.gov Signed-off-by: Sasha Levin levinsasha...@gmail.com --- tools/kvm/Makefile |1 + tools/kvm/builtin-run.c| 134 tools/kvm/hw/pci-shmem.c | 266 tools/kvm/include/kvm/pci-shmem.h | 28 tools/kvm/include/kvm/pci.h|3 +- tools/kvm/include/kvm/virtio-pci-dev.h |3 + tools/kvm/pci.c|5 +- tools/kvm/virtio/pci.c |4 +- 8 files changed, 438 insertions(+), 6 deletions(-) create mode 100644 tools/kvm/hw/pci-shmem.c create mode 100644 tools/kvm/include/kvm/pci-shmem.h diff --git a/tools/kvm/Makefile b/tools/kvm/Makefile index 25cbd7e..efa032d 100644 --- a/tools/kvm/Makefile +++ b/tools/kvm/Makefile @@ -81,6 +81,7 @@ OBJS += virtio/9p.o OBJS += virtio/9p-pdu.o OBJS += hw/vesa.o OBJS += hw/i8042.o +OBJS += hw/pci-shmem.o FLAGS_BFD := $(CFLAGS) -lbfd has_bfd := $(call try-cc,$(SOURCE_BFD),$(FLAGS_BFD)) diff --git a/tools/kvm/builtin-run.c b/tools/kvm/builtin-run.c index 38612b6..b9efde2 100644 --- a/tools/kvm/builtin-run.c +++ b/tools/kvm/builtin-run.c @@ -28,6 +28,7 @@ #include kvm/sdl.h #include kvm/vnc.h #include kvm/guest_compat.h +#include kvm/pci-shmem.h #include linux/types.h @@ -52,6 +53,8 @@ #define DEFAULT_SCRIPT none #define MB_SHIFT (20) +#define KB_SHIFT (10) +#define GB_SHIFT (30) #define MIN_RAM_SIZE_MB(64ULL) #define MIN_RAM_SIZE_BYTE (MIN_RAM_SIZE_MB MB_SHIFT) @@ -151,6 +154,131 @@ static int virtio_9p_rootdir_parser(const struct option *opt, const char *arg, i return 0; } +static int shmem_parser(const struct option *opt, const char *arg, int unset) +{ + const uint64_t default_size = SHMEM_DEFAULT_SIZE; + const uint64_t default_phys_addr = SHMEM_DEFAULT_ADDR; + const char *default_handle = SHMEM_DEFAULT_HANDLE; + struct shmem_info *si = malloc(sizeof(struct shmem_info)); + enum { PCI, UNK } addr_type = PCI; + uint64_t phys_addr; + uint64_t size; + char *handle = NULL; + int create = 0; + const char *p = arg; + char *next; + int base = 10; + int verbose = 0; + + const int skip_pci = strlen(pci:); + if (verbose) + pr_info(shmem_parser(%p,%s,%d), opt, arg, unset); + /* parse out optional addr family */ + if (strcasestr(p, pci:)) { + p += skip_pci; + addr_type = PCI; + } else if (strcasestr(p, mem:)) { + die(I can't add to E820 map yet.\n); + } + /* parse out physical addr */ + base = 10; + if (strcasestr(p, 0x)) + base = 16; + phys_addr = strtoll(p, next, base); + if (next == p phys_addr == 0) { + pr_info(shmem: no physical addr specified, using default.); + phys_addr = default_phys_addr; + } + if (*next != ':' *next != '\0') + die(shmem: unexpected chars after phys addr.\n); + if (*next == '\0') + p = next; + else + p = next + 1; + /* parse out size */ + base = 10; + if (strcasestr(p, 0x)) + base = 16; + size = strtoll(p, next, base); + if (next == p size == 0) { + pr_info(shmem: no size specified, using default.); + size = default_size; + } + /* look for [KMGkmg][Bb]* uses base 2. */ + int skip_B = 0; + if (strspn(next, KMGkmg)) { /* might have a prefix */ + if (*(next + 1) == 'B' || *(next + 1) == 'b') + skip_B = 1; + switch (*next) { + case 'K': + case 'k': + size = size KB_SHIFT; + break; + case 'M': + case 'm': + size = size MB_SHIFT; + break; + case 'G': + case 'g': + size = size GB_SHIFT; + break; + default: + die(shmem: bug in detecting size prefix.); + break; + } + next += 1 + skip_B; + } + if (*next != ':' *next != '\0') { + die(shmem: unexpected chars after phys size. %c%c\n, + *next, *p); + } + if (*next == '\0') + p
[PATCH] KVM: Document KVM_IRQFD
Avi Kivity a...@redhat.com Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- Documentation/virtual/kvm/api.txt | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2d510b6..d1150b6 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_IRQFD + +Capability: KVM_CAP_IRQFD +Architectures: all +Type: vm ioctl +Parameters: struct kvm_irqfd (in) +Returns: 0 on success, !0 on error + +This ioctl attaches or detaches an eventfd to a GSI within the guest. +While the eventfd is assigned to the guest, any write to the eventfd +would trigger the GSI within the guest. + +struct kvm_irqfd { + __u32 fd; + __u32 gsi; + __u32 flags; + __u8 pad[20]; +}; + +The following flags are defined: + +#define KVM_IRQFD_FLAG_DEASSIGN (1 0) + +If deassign flag is set, the eventfd will be deassigned from the GSI and +further writes to the eventfd won't trigger the GSI. + + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by -- 1.7.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH resend] KVM: Document KVM_IRQFD
Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- Documentation/virtual/kvm/api.txt | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2d510b6..d1150b6 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_IRQFD + +Capability: KVM_CAP_IRQFD +Architectures: all +Type: vm ioctl +Parameters: struct kvm_irqfd (in) +Returns: 0 on success, !0 on error + +This ioctl attaches or detaches an eventfd to a GSI within the guest. +While the eventfd is assigned to the guest, any write to the eventfd +would trigger the GSI within the guest. + +struct kvm_irqfd { + __u32 fd; + __u32 gsi; + __u32 flags; + __u8 pad[20]; +}; + +The following flags are defined: + +#define KVM_IRQFD_FLAG_DEASSIGN (1 0) + +If deassign flag is set, the eventfd will be deassigned from the GSI and +further writes to the eventfd won't trigger the GSI. + + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by -- 1.7.6.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] KVM: Document KVM_IRQFD
On 08/29/2011 03:34 PM, Sasha Levin wrote: Cc: Avi Kivitya...@redhat.com Cc: Marcelo Tosattimtosa...@redhat.com Signed-off-by: Sasha Levinlevinsasha...@gmail.com Thanks, applied. +Returns: 0 on success, !0 on error + -1 (I fixed this). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: x86 emulator: fuzz tester
On 08/25/2011 10:04 PM, Avi Kivity wrote: Also fuzzing from an actual guest is useful to test the real backend functions. What problem did you encounter? The new testsuite scheme seems a good fit for that (with the exception of being locked to 32-bit mode). Mostly that I forgot it exists. Other issues are that it's harder to force random values through it - though I could allocate a couple GB and fill it with random values. We also lose the ability to test inputs to callbacks (not that I do much of that here). Further issues would be: - much slower - heavyweight exit on every insn, KVM_SET_SREGS, etc. - need to set up GDT/LDT, I guess we can do this once and fill it with random entries - much more care in setting up registers so we can get a context that runs - need to figure out where %rip EA is so we can put insn there, hope it doesn't conflict with other code So it will probably work, but the result will be of lower quality. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] KVM: Document KVM_IRQFD
On Mon, 2011-08-29 at 15:37 +0300, Avi Kivity wrote: On 08/29/2011 03:34 PM, Sasha Levin wrote: Cc: Avi Kivitya...@redhat.com Cc: Marcelo Tosattimtosa...@redhat.com Signed-off-by: Sasha Levinlevinsasha...@gmail.com Thanks, applied. +Returns: 0 on success, !0 on error + -1 (I fixed this). Why -1? It could be anything 0, no? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH resend] KVM: Document KVM_IRQFD
On 08/29/2011 03:43 PM, Sasha Levin wrote: On Mon, 2011-08-29 at 15:37 +0300, Avi Kivity wrote: On 08/29/2011 03:34 PM, Sasha Levin wrote: Cc: Avi Kivitya...@redhat.com Cc: Marcelo Tosattimtosa...@redhat.com Signed-off-by: Sasha Levinlevinsasha...@gmail.com Thanks, applied. +Returns: 0 on success, !0 on error + -1 (I fixed this). Why -1? It could be anything 0, no? No. ioctl()s (and most system calls) return -1 on error and set errno. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] Migration thread mutex
On 08/29/2011 05:04 AM, Stefan Hajnoczi wrote: On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpandeudesh...@redhat.com wrote: This patch implements migrate_ram mutex, which protects the RAMBlock list traversal in the migration thread during the transfer of a ram from their addition/removal from the iothread. Note: Combination of iothread mutex and migration thread mutex works as a rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM block list. Signed-off-by: Umesh Deshpandeudesh...@redhat.com --- arch_init.c | 21 + cpu-all.h |3 +++ exec.c| 23 +++ qemu-common.h |2 ++ 4 files changed, 49 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 484b39d..9d02270 100644 --- a/arch_init.c +++ b/arch_init.c @@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset; +static uint64_t last_version; [...] typedef struct RAMList { +QemuMutex mutex;/* Protects RAM block list */ uint8_t *phys_dirty; +uint32_t version; /* To detect ram block addition/removal */ Is there a reason why RAMList.version is uint32_t but last_version is uint64_t? No, my bad, they both should be consistent. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Separate migration thread
On 08/29/2011 05:09 AM, Stefan Hajnoczi wrote: On Sat, Aug 27, 2011 at 7:09 PM, Umesh Deshpandeudesh...@redhat.com wrote: This patch creates a separate thread for the guest migration on the source side. All exits (on completion/error) from the migration thread are handled by a bottom handler, which is called from the iothread. Signed-off-by: Umesh Deshpandeudesh...@redhat.com --- buffered_file.c | 76 migration.c | 105 ++ migration.h |8 qemu-thread-posix.c | 10 + qemu-thread.h |1 + Will this patch break Windows builds by adding a function to qemu-thread-posix.c which is not implemented in qemu-thread-win32.c? Yes, equivalent function needs to be added in qemu-thread.win32.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 2011-08-29 13:11, Avi Kivity wrote: On 08/29/2011 02:03 PM, Jan Kiszka wrote: Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access). And what about caching the result of the first walk? Usually, a sane guest won't have many code pages that issue the EIO. There's no way to know when to invalidate the cache. Set the affected code page read-only? We could go a bit further, and cache the the whole thing. On the first exit, do the entire emulation, and remember %rip. On the second exit, if %rip matches, skip directly to kvm_lapic_eoi(). But I don't think it's worth it. This also has failure modes, and really, no guest will ever write to EOI with stosl. ...or add/sub/and/or etc. Well, we've done other crazy things in the past just to keep even the unlikely case correct. I was just wondering if that policy changed. However, I just realized that user space is able to avoid this inaccuracy for potentially insane guests by not using in-kernel irqchips. So we have at least a knob. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
subscribe -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: APIC: avoid instruction emulation for EOI writes
On 08/29/2011 04:55 PM, Jan Kiszka wrote: On 2011-08-29 13:11, Avi Kivity wrote: On 08/29/2011 02:03 PM, Jan Kiszka wrote: Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access). And what about caching the result of the first walk? Usually, a sane guest won't have many code pages that issue the EIO. There's no way to know when to invalidate the cache. Set the affected code page read-only? The virt-phys mapping could change too. And please, don't think of new reasons to write protect pages, they break up my lovely 2M maps. We could go a bit further, and cache the the whole thing. On the first exit, do the entire emulation, and remember %rip. On the second exit, if %rip matches, skip directly to kvm_lapic_eoi(). But I don't think it's worth it. This also has failure modes, and really, no guest will ever write to EOI with stosl. ...or add/sub/and/or etc. Argh, yes, flags can be updated. Actually, this might work - if we get a read access first as part of the RMW, we'll emulate the instruction. No idea what the hardware does in this case. Well, we've done other crazy things in the past just to keep even the unlikely case correct. I was just wondering if that policy changed. I can't answer yes to that question. But I see no way to make it work both fast and correct. However, I just realized that user space is able to avoid this inaccuracy for potentially insane guests by not using in-kernel irqchips. So we have at least a knob. Could/should have a flag to disable this in the kernel as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
So how about something like the following? Compile tested only, and I'm not sure I got the ipr and iov error handling right. Comments? Subject: [PATCH] pci: fail block usercfg access on reset Anyone who wants to block usercfg access will also want to block reset from userspace. On the other hand, reset from userspace should block any other access from userspace. Finally, make it possible to detect reset in pregress by returning an error status from pci_block_user_cfg_access. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/pci/access.c | 45 drivers/pci/iov.c | 19 drivers/pci/pci.c |4 +- drivers/scsi/ipr.c| 22 ++- drivers/uio/uio_pci_generic.c |7 +- include/linux/pci.h |6 - 6 files changed, 87 insertions(+), 16 deletions(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index fdaa42a..2492365 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev) raw_spin_unlock_irq(pci_lock); schedule(); raw_spin_lock_irq(pci_lock); - } while (dev-block_ucfg_access); + } while (dev-block_ucfg_access || dev-reset_in_progress); __remove_wait_queue(pci_ucfg_wait, wait); } @@ -153,7 +153,8 @@ int pci_user_read_config_##size \ if (PCI_##size##_BAD) \ return -EINVAL; \ raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-read(dev-bus, dev-devfn, \ pos, sizeof(type), data); \ raw_spin_unlock_irq(pci_lock); \ @@ -171,8 +172,9 @@ int pci_user_write_config_##size \ int ret = -EIO; \ if (PCI_##size##_BAD) \ return -EINVAL; \ - raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + raw_spin_lock_irq(pci_lock); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-write(dev-bus, dev-devfn,\ pos, sizeof(type), val);\ raw_spin_unlock_irq(pci_lock); \ @@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate); * sleep until access is unblocked again. We don't allow nesting of * block/unblock calls. */ -void pci_block_user_cfg_access(struct pci_dev *dev) +int pci_block_user_cfg_access(struct pci_dev *dev) { unsigned long flags; int was_blocked; + int in_progress; raw_spin_lock_irqsave(pci_lock, flags); was_blocked = dev-block_ucfg_access; dev-block_ucfg_access = 1; + in_progress = dev-reset_in_progress; raw_spin_unlock_irqrestore(pci_lock, flags); + if (in_progress) + return -EIO; /* If we BUG() inside the pci_lock, we're guaranteed to hose * the machine */ BUG_ON(was_blocked); + return 0; } EXPORT_SYMBOL_GPL(pci_block_user_cfg_access); @@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev) raw_spin_unlock_irqrestore(pci_lock, flags); } EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access); + +void pci_reset_start(struct pci_dev *dev) +{ + int was_started; + + raw_spin_lock_irq(pci_lock); + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) + pci_wait_ucfg(dev); + was_started = dev-reset_in_progress; + dev-reset_in_progress = 1; + raw_spin_unlock_irq(pci_lock); + + /* If we BUG() inside the pci_lock, we're guaranteed to hose +* the machine */ + BUG_ON(was_started); +} +void pci_reset_end(struct pci_dev *dev) +{ + raw_spin_lock_irq(pci_lock); + + /* This indicates a problem in the caller, but we don't need +* to kill them, unlike a double-reset above. */ + WARN_ON(!dev-reset_in_progress); + + dev-reset_in_progress = 0; + wake_up_all(pci_ucfg_wait); + raw_spin_unlock_irq(pci_lock); +} diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index 42fae47..464d9b5 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@
Re: Questions regarding ivshmem spec
On Thu, 2011-08-25 at 16:29 +0300, Sasha Levin wrote: Hello, I am looking to implement an ivshmem device for KVM tools, the purpose is to provide same functionality as QEMU and interoperability with QEMU. [snip] 1. File handles and guest IDs are passed between the server and the peers using sockets, is the protocol itself documented anywhere? I would like to be able to work alongside QEMU servers/peers. I'm still wondering if someone could do a quick sketch of the client-server protocol or possibly point me to something that documents it? -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
On 2011-08-29 17:05, Michael S. Tsirkin wrote: So how about something like the following? Compile tested only, and I'm not sure I got the ipr and iov error handling right. Comments? This still doesn't synchronize two callers of pci_block_user_cfg_access unless one was reset. We may not have such a scenario yet, but that's how the old code started as well... And it makes the interface more convoluted (from 1 meter, why should pci_block_user_cfg_access return an error if reset is running?). Subject: [PATCH] pci: fail block usercfg access on reset Anyone who wants to block usercfg access will also want to block reset from userspace. On the other hand, reset from userspace should block any other access from userspace. Finally, make it possible to detect reset in pregress by returning an error status from pci_block_user_cfg_access. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/pci/access.c | 45 drivers/pci/iov.c | 19 drivers/pci/pci.c |4 +- drivers/scsi/ipr.c| 22 ++- drivers/uio/uio_pci_generic.c |7 +- include/linux/pci.h |6 - 6 files changed, 87 insertions(+), 16 deletions(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index fdaa42a..2492365 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev) raw_spin_unlock_irq(pci_lock); schedule(); raw_spin_lock_irq(pci_lock); - } while (dev-block_ucfg_access); + } while (dev-block_ucfg_access || dev-reset_in_progress); __remove_wait_queue(pci_ucfg_wait, wait); } @@ -153,7 +153,8 @@ int pci_user_read_config_##size \ if (PCI_##size##_BAD) \ return -EINVAL; \ raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-read(dev-bus, dev-devfn, \ pos, sizeof(type), data); \ raw_spin_unlock_irq(pci_lock); \ @@ -171,8 +172,9 @@ int pci_user_write_config_##size \ int ret = -EIO; \ if (PCI_##size##_BAD) \ return -EINVAL; \ - raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + raw_spin_lock_irq(pci_lock); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-write(dev-bus, dev-devfn,\ pos, sizeof(type), val);\ raw_spin_unlock_irq(pci_lock); \ @@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate); * sleep until access is unblocked again. We don't allow nesting of * block/unblock calls. */ -void pci_block_user_cfg_access(struct pci_dev *dev) +int pci_block_user_cfg_access(struct pci_dev *dev) { unsigned long flags; int was_blocked; + int in_progress; raw_spin_lock_irqsave(pci_lock, flags); was_blocked = dev-block_ucfg_access; dev-block_ucfg_access = 1; + in_progress = dev-reset_in_progress; raw_spin_unlock_irqrestore(pci_lock, flags); + if (in_progress) + return -EIO; /* If we BUG() inside the pci_lock, we're guaranteed to hose * the machine */ BUG_ON(was_blocked); + return 0; } EXPORT_SYMBOL_GPL(pci_block_user_cfg_access); @@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev) raw_spin_unlock_irqrestore(pci_lock, flags); } EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access); + +void pci_reset_start(struct pci_dev *dev) +{ + int was_started; + + raw_spin_lock_irq(pci_lock); + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) + pci_wait_ucfg(dev); + was_started = dev-reset_in_progress; + dev-reset_in_progress = 1; + raw_spin_unlock_irq(pci_lock); + + /* If we BUG() inside the pci_lock, we're guaranteed to hose + * the machine */ + BUG_ON(was_started); +} +void pci_reset_end(struct pci_dev *dev) +{ + raw_spin_lock_irq(pci_lock); + + /* This indicates a problem in the
Re: Broken pci_block_user_cfg_access interface
On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote: On 2011-08-29 17:05, Michael S. Tsirkin wrote: So how about something like the following? Compile tested only, and I'm not sure I got the ipr and iov error handling right. Comments? This still doesn't synchronize two callers of pci_block_user_cfg_access unless one was reset. We may not have such a scenario yet, but that's how the old code started as well... And it makes the interface more convoluted (from 1 meter, why should pci_block_user_cfg_access return an error if reset is running?). Well I made the error EIO but it really is something like EAGAIN which means 'I would block if I could, but I was asked not to'. Subject: [PATCH] pci: fail block usercfg access on reset Anyone who wants to block usercfg access will also want to block reset from userspace. On the other hand, reset from userspace should block any other access from userspace. Finally, make it possible to detect reset in pregress by returning an error status from pci_block_user_cfg_access. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/pci/access.c | 45 drivers/pci/iov.c | 19 drivers/pci/pci.c |4 +- drivers/scsi/ipr.c| 22 ++- drivers/uio/uio_pci_generic.c |7 +- include/linux/pci.h |6 - 6 files changed, 87 insertions(+), 16 deletions(-) diff --git a/drivers/pci/access.c b/drivers/pci/access.c index fdaa42a..2492365 100644 --- a/drivers/pci/access.c +++ b/drivers/pci/access.c @@ -139,7 +139,7 @@ static noinline void pci_wait_ucfg(struct pci_dev *dev) raw_spin_unlock_irq(pci_lock); schedule(); raw_spin_lock_irq(pci_lock); - } while (dev-block_ucfg_access); + } while (dev-block_ucfg_access || dev-reset_in_progress); __remove_wait_queue(pci_ucfg_wait, wait); } @@ -153,7 +153,8 @@ int pci_user_read_config_##size \ if (PCI_##size##_BAD) \ return -EINVAL; \ raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-read(dev-bus, dev-devfn, \ pos, sizeof(type), data); \ raw_spin_unlock_irq(pci_lock); \ @@ -171,8 +172,9 @@ int pci_user_write_config_##size \ int ret = -EIO; \ if (PCI_##size##_BAD) \ return -EINVAL; \ - raw_spin_lock_irq(pci_lock); \ - if (unlikely(dev-block_ucfg_access)) pci_wait_ucfg(dev); \ + raw_spin_lock_irq(pci_lock); \ + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) \ + pci_wait_ucfg(dev); \ ret = dev-bus-ops-write(dev-bus, dev-devfn,\ pos, sizeof(type), val);\ raw_spin_unlock_irq(pci_lock); \ @@ -408,19 +410,24 @@ EXPORT_SYMBOL(pci_vpd_truncate); * sleep until access is unblocked again. We don't allow nesting of * block/unblock calls. */ -void pci_block_user_cfg_access(struct pci_dev *dev) +int pci_block_user_cfg_access(struct pci_dev *dev) { unsigned long flags; int was_blocked; + int in_progress; raw_spin_lock_irqsave(pci_lock, flags); was_blocked = dev-block_ucfg_access; dev-block_ucfg_access = 1; + in_progress = dev-reset_in_progress; raw_spin_unlock_irqrestore(pci_lock, flags); + if (in_progress) + return -EIO; /* If we BUG() inside the pci_lock, we're guaranteed to hose * the machine */ BUG_ON(was_blocked); + return 0; } EXPORT_SYMBOL_GPL(pci_block_user_cfg_access); @@ -445,3 +452,31 @@ void pci_unblock_user_cfg_access(struct pci_dev *dev) raw_spin_unlock_irqrestore(pci_lock, flags); } EXPORT_SYMBOL_GPL(pci_unblock_user_cfg_access); + +void pci_reset_start(struct pci_dev *dev) +{ + int was_started; + + raw_spin_lock_irq(pci_lock); + if (unlikely(dev-block_ucfg_access || dev-reset_in_progress)) + pci_wait_ucfg(dev); + was_started = dev-reset_in_progress; + dev-reset_in_progress = 1; + raw_spin_unlock_irq(pci_lock); + + /* If we BUG() inside the
Re: Broken pci_block_user_cfg_access interface
On 2011-08-29 17:58, Michael S. Tsirkin wrote: On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote: I still don't get what prevents converting ipr to allow plain mutex synchronization. My vision is: - push reset-on-error of ipr into workqueue (or threaded IRQ?) - require mutex synchronization for common config space access Meaning pci_user_ read/write config? And pci_dev_reset, yes. and the full reset cycle - only exception: INTx status/masking access = use pci_lock + test for reset_in_progress, skip operation if that is the case That would allow to drop the whole block_user_cfg infrastructure. Jan We still need to block userspace access while INTx does the status/masking access, right? Yes, pci_lock would do that for us. We should consider making the related bits for INTx test mask/unmask generic PCI services so that no user (uio_pci_generic, kvm, vfio) needs to worry about the locking details. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
On Mon, Aug 29, 2011 at 06:14:39PM +0200, Jan Kiszka wrote: On 2011-08-29 17:58, Michael S. Tsirkin wrote: On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote: I still don't get what prevents converting ipr to allow plain mutex synchronization. My vision is: - push reset-on-error of ipr into workqueue (or threaded IRQ?) - require mutex synchronization for common config space access Meaning pci_user_ read/write config? And pci_dev_reset, yes. and the full reset cycle - only exception: INTx status/masking access = use pci_lock + test for reset_in_progress, skip operation if that is the case That would allow to drop the whole block_user_cfg infrastructure. Jan We still need to block userspace access while INTx does the status/masking access, right? Yes, pci_lock would do that for us. Well this means block_user_cfg is not going away, this is what it really is: pci_lock + a bit to lock out userspace. We should consider making the related bits for INTx test mask/unmask generic PCI services so that no user (uio_pci_generic, kvm, vfio) needs to worry about the locking details. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 39412] Win Vista and Win2K8 guests' network breaks down
https://bugzilla.kernel.org/show_bug.cgi?id=39412 --- Comment #10 from Rafael J. Wysocki r...@sisk.pl 2011-08-29 16:23:57 --- On Monday, August 29, 2011, Ren, Yongjie wrote: I've verified this bug in Linus' tree 3.1.0-rc4. It got fixed, so I closed the bug. The commit I tried is below. commit c6a389f123b9f68d605bb7e0f9b32ec1e3e14132 Author: Linus Torvalds torva...@linux-foundation.org Date: Sun Aug 28 21:16:01 2011 -0700 Best Regards, Yongjie Ren (Jay) -Original Message- From: Rafael J. Wysocki [mailto:r...@sisk.pl] Sent: Monday, August 29, 2011 3:01 AM To: Linux Kernel Mailing List Cc: Kernel Testers List; Maciej Rutecki; Florian Mickler; Avi Kivity; Ren, Yongjie; Marcelo Tosatti; Stefan Hajnoczi Subject: [Bug #39412] Win Vista and Win2K8 guests' network breaks down This message has been generated automatically as a part of a report of regressions introduced between 2.6.39 and 3.0. The following bug entry is on the current list of known regressions introduced between 2.6.39 and 3.0. Please verify if it still should be listed and let the tracking team know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=39412 Subject : Win Vista and Win2K8 guests' network breaks down Submitter : Jay Ren yongjie@intel.com Date: 2011-07-15 15:31 (45 days old) -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. You are watching the assignee of the bug.-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
On 2011-08-29 18:23, Michael S. Tsirkin wrote: On Mon, Aug 29, 2011 at 06:14:39PM +0200, Jan Kiszka wrote: On 2011-08-29 17:58, Michael S. Tsirkin wrote: On Mon, Aug 29, 2011 at 05:42:16PM +0200, Jan Kiszka wrote: I still don't get what prevents converting ipr to allow plain mutex synchronization. My vision is: - push reset-on-error of ipr into workqueue (or threaded IRQ?) - require mutex synchronization for common config space access Meaning pci_user_ read/write config? And pci_dev_reset, yes. and the full reset cycle - only exception: INTx status/masking access = use pci_lock + test for reset_in_progress, skip operation if that is the case That would allow to drop the whole block_user_cfg infrastructure. Jan We still need to block userspace access while INTx does the status/masking access, right? Yes, pci_lock would do that for us. Well this means block_user_cfg is not going away, this is what it really is: pci_lock + a bit to lock out userspace. I does as we only end up with a mutex and pci_lock. No more hand-crafted queuing/blocking/waking. INTx masking is a bit special as it's the only thing that truly requires atomic context. But that's something we should address generically anyway. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: vfio / device assignment -- layout of device fd files
Alex Graf, Scott Wood, and I met last week to try to flesh out some details as to how vfio could work for non-PCI devices, like we have in embedded systems. This most likely will require a different kernel driver than vfio-- for now we are calling it dtio (for device tree I/O) as there is no way to discover these devices except from the device tree. But the dtio driver would use the same architecture and interfaces as vfio. For devices on a system bus and represented in a device tree we have some different requirements than PCI for what is exposed in the device fd file. A device may have multiple address regions, multiple interrupts, a variable length device tree path, whether a region is mmapable, etc. With existing vfio, the device fd file layout is something like: 0xF Config space offset ... 0x6 ROM offset 0x5 BAR 5 offset 0x4 BAR 4 offset 0x3 BAR 3 offset 0x2 BAR 2 offset 0x1 BAR 1 offset 0x0 BAR 0 offset We have an alternate proposal that we think is more flexible, extensible, and will accommodate both PCI and system bus type devices (and others). Instead of config space fixed at 0xf, we would propose a header and multiple 'device info' records at offset 0x0 that would encode everything that user space needs to know about the device: 0x0 +-+-+ | magic | version | u64 // magic u64 identifies the type of | vfio| | // passthru I/O, plus version # | dtio| | // vfio - PCI devices +-+-+ // dtio - device tree devices | flags | u32 // encodes any flags (TBD) +---+ | dev info record N| |type | u32 // type of record |rec_len| u32 // length in bytes of record | | (including record header) |flags | u32 // type specific flags |...content... | // record content, which could +---+ // include sub-records | dev info record N+1 | +---+ | dev info record N+2 | +---+ ... The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device PCI_INFO - domain:bus:device:func PCI_BAR_INFO - information about the BAR for a device For a device tree type device the file may look like: 0x0+---+ |header | +---+ | type = REGION | | rec_len | | flags = | u32 // region specific flags | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // phys addr of region | | +---+ \ type = DTPATH \ // a sub-region | rec_len| | flags | | dev tree path | char[] // device tree path +---+ \ type = DTINDEX \ // a sub-region | rec_len| | flags | | prop_type | u32 // REG, RANGES | prop_index | u32 // index into resource list +---+ | type = INTERRUPT | | rec_len | | flags| u32 | ioctl_handle | u32 // argument to ioctl to get interrupts | | +---+ \ type = DTPATH \ // a sub-region | rec_len | | flags | | dev tree path | char[] // device tree path +---+ \ type = DTINDEX \ // a sub-region | rec_len | | flags | | prop_type | u32 // INTERRUPT,INTERRUPT_MAP | prop_index| u32 // index PCI devices would have a PCI specific encoding. Instead of config space and the mappable BAR regions being at specific predetermined offsets, the device info records
Re: Questions regarding ivshmem spec
On Mon, Aug 29, 2011 at 9:25 AM, Sasha Levin levinsasha...@gmail.com wrote: On Thu, 2011-08-25 at 16:29 +0300, Sasha Levin wrote: Hello, I am looking to implement an ivshmem device for KVM tools, the purpose is to provide same functionality as QEMU and interoperability with QEMU. [snip] 1. File handles and guest IDs are passed between the server and the peers using sockets, is the protocol itself documented anywhere? I would like to be able to work alongside QEMU servers/peers. I'm still wondering if someone could do a quick sketch of the client-server protocol or possibly point me to something that documents it? Hi Sasha, I have something like that. I'll be in touch when I find it. Cam -- Sasha. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
On 2011-08-29 17:42, Jan Kiszka wrote: I still don't get what prevents converting ipr to allow plain mutex synchronization. My vision is: - push reset-on-error of ipr into workqueue (or threaded IRQ?) I'm starting to like your proposal: I had a look at ipr, but it turned out to be anything but trivial to convert that driver. It runs its complete state machine under spin_lock_irq, and the functions calling pci_block/unblock_user_cfg_access are deep inside this thing. I have no hardware to test whatever change, and I feel a bit uncomfortable asking Brian to redesign his driver that massively. So back to your idea: I would generalize pci_block_user_cfg_access to pci_block_cfg_access. It should fail when some other site already holds the access lock, but it should remain non-blocking - for the sake of ipr. We should still provide generic pci-2.3 IRQ masking services, but that could be done in a second step. I could have a look at this. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Support for vm_stop from the migration thread
On Sat, Aug 27, 2011 at 02:09:44PM -0400, Umesh Deshpande wrote: Currently, when any thread other than iothread calls vm_stop, it is scheduled to be executed later by the iothread. This patch allows the execution of vm_stop from threads other than iothread. This is especially helpful when the migration is moved into a separate thread. Signed-off-by: Umesh Deshpande udesh...@redhat.com --- cpus.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cpus.c b/cpus.c index de70e02..f35f683 100644 --- a/cpus.c +++ b/cpus.c @@ -122,8 +122,8 @@ static void do_vm_stop(int reason) { if (vm_running) { cpu_disable_ticks(); -vm_running = 0; pause_all_vcpus(); +vm_running = 0; vm_state_notify(0, reason); qemu_aio_flush(); bdrv_flush_all(); Why this change? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] Migration thread mutex
On Sat, Aug 27, 2011 at 02:09:46PM -0400, Umesh Deshpande wrote: This patch implements migrate_ram mutex, which protects the RAMBlock list traversal in the migration thread during the transfer of a ram from their addition/removal from the iothread. Note: Combination of iothread mutex and migration thread mutex works as a rw-lock. Both mutexes are acquired while modifying the ram_list members or RAM block list. Signed-off-by: Umesh Deshpande udesh...@redhat.com --- arch_init.c | 21 + cpu-all.h |3 +++ exec.c| 23 +++ qemu-common.h |2 ++ 4 files changed, 49 insertions(+), 0 deletions(-) diff --git a/arch_init.c b/arch_init.c index 484b39d..9d02270 100644 --- a/arch_init.c +++ b/arch_init.c @@ -109,6 +109,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset; +static uint64_t last_version; static int ram_save_block(QEMUFile *f) { @@ -170,6 +171,7 @@ static int ram_save_block(QEMUFile *f) last_block = block; last_offset = offset; +last_version = ram_list.version; return bytes_sent; } @@ -270,6 +272,7 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) bytes_transferred = 0; last_block = NULL; last_offset = 0; +last_version = ram_list.version = 0; sort_ram_list(); /* Make sure all dirty bits are set */ @@ -298,6 +301,17 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) bytes_transferred_last = bytes_transferred; bwidth = qemu_get_clock_ns(rt_clock); +if (stage != 3) { +qemu_mutex_lock_migrate_ram(); +qemu_mutex_unlock_iothread(); +} Dropping the iothread lock from within a timer handler is not safe. This change to ram_save_live should be moved to the patch where migration thread is introduced. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Separate migration thread
On Sat, Aug 27, 2011 at 02:09:48PM -0400, Umesh Deshpande wrote: This patch creates a separate thread for the guest migration on the source side. All exits (on completion/error) from the migration thread are handled by a bottom handler, which is called from the iothread. Signed-off-by: Umesh Deshpande udesh...@redhat.com --- buffered_file.c | 76 migration.c | 105 ++ migration.h |8 qemu-thread-posix.c | 10 + qemu-thread.h |1 + 5 files changed, 124 insertions(+), 76 deletions(-) diff --git a/buffered_file.c b/buffered_file.c index 41b42c3..c31852e 100644 --- a/buffered_file.c +++ b/buffered_file.c @@ -16,6 +16,8 @@ #include qemu-timer.h #include qemu-char.h #include buffered_file.h +#include migration.h +#include qemu-thread.h //#define DEBUG_BUFFERED_FILE @@ -28,13 +30,14 @@ typedef struct QEMUFileBuffered void *opaque; QEMUFile *file; int has_error; +int closed; int freeze_output; size_t bytes_xfer; size_t xfer_limit; uint8_t *buffer; size_t buffer_size; size_t buffer_capacity; -QEMUTimer *timer; +QemuThread thread; } QEMUFileBuffered; #ifdef DEBUG_BUFFERED_FILE @@ -155,14 +158,6 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, in offset = size; } -if (pos == 0 size == 0) { -DPRINTF(file is ready\n); -if (s-bytes_xfer = s-xfer_limit) { -DPRINTF(notifying client\n); -s-put_ready(s-opaque); -} -} - return offset; } @@ -173,22 +168,25 @@ static int buffered_close(void *opaque) DPRINTF(closing\n); -while (!s-has_error s-buffer_size) { -buffered_flush(s); -if (s-freeze_output) -s-wait_for_unfreeze(s); -} +s-closed = 1; -ret = s-close(s-opaque); +qemu_mutex_unlock_migrate_ram(); +qemu_mutex_unlock_iothread(); This is using the ram mutex to protect migration thread specific data. A new lock should be introduced for that purpose. -qemu_del_timer(s-timer); -qemu_free_timer(s-timer); +qemu_thread_join(s-thread); +/* Waits for the completion of the migration thread */ + +qemu_mutex_lock_iothread(); +qemu_mutex_lock_migrate_ram(); + +ret = s-close(s-opaque); qemu_free(s-buffer); qemu_free(s); return ret; } + static int buffered_rate_limit(void *opaque) { QEMUFileBuffered *s = opaque; @@ -228,26 +226,37 @@ static int64_t buffered_get_rate_limit(void *opaque) return s-xfer_limit; } -static void buffered_rate_tick(void *opaque) +static void *migrate_vm(void *opaque) { buffered_file.c was generic code that has now become migration specific (although migration was the only user). So it should either stop pretending to be generic code, by rename to migration_thread.c along with un-exporting interfaces, or it should remain generic and therefore all migration specific knowledge moved somewhere else. Anthony? +int64_t current_time, expire_time = qemu_get_clock_ms(rt_clock) + 100; +struct timeval tv = { .tv_sec = 0, .tv_usec = 10}; qemu_get_clock_ms should happen under iothread lock. -if (s-freeze_output) -return; +current_time = qemu_get_clock_ms(rt_clock); +if (!s-closed (expire_time current_time)) { +tv.tv_usec = 1000 * (expire_time - current_time); +select(0, NULL, NULL, NULL, tv); +continue; +} -s-bytes_xfer = 0; +s-bytes_xfer = 0; -buffered_flush(s); +expire_time = qemu_get_clock_ms(rt_clock) + 100; +if (!s-closed) { +s-put_ready(s-opaque); +} else { +buffered_flush(s); +} +} -/* Add some checks around this */ -s-put_ready(s-opaque); +return NULL; } QEMUFile *qemu_fopen_ops_buffered(void *opaque, @@ -267,15 +276,14 @@ QEMUFile *qemu_fopen_ops_buffered(void *opaque, s-put_ready = put_ready; s-wait_for_unfreeze = wait_for_unfreeze; s-close = close; +s-closed = 0; s-file = qemu_fopen_ops(s, buffered_put_buffer, NULL, buffered_close, buffered_rate_limit, buffered_set_rate_limit, - buffered_get_rate_limit); - -s-timer = qemu_new_timer_ms(rt_clock, buffered_rate_tick, s); + buffered_get_rate_limit); -qemu_mod_timer(s-timer, qemu_get_clock_ms(rt_clock) + 100); +qemu_thread_create(s-thread, migrate_vm, s); return s-file; } diff --git a/migration.c b/migration.c index af3a1f2..5df186d 100644 --- a/migration.c +++ b/migration.c @@ -149,10 +149,12 @@ int do_migrate_set_speed(Monitor *mon,
Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote: Alex Graf, Scott Wood, and I met last week to try to flesh out some details as to how vfio could work for non-PCI devices, like we have in embedded systems. This most likely will require a different kernel driver than vfio-- for now we are calling it dtio (for device tree I/O) as there is no way to discover these devices except from the device tree. But the dtio driver would use the same architecture and interfaces as vfio. For devices on a system bus and represented in a device tree we have some different requirements than PCI for what is exposed in the device fd file. A device may have multiple address regions, multiple interrupts, a variable length device tree path, whether a region is mmapable, etc. With existing vfio, the device fd file layout is something like: 0xF Config space offset ... 0x6 ROM offset 0x5 BAR 5 offset 0x4 BAR 4 offset 0x3 BAR 3 offset 0x2 BAR 2 offset 0x1 BAR 1 offset 0x0 BAR 0 offset We have an alternate proposal that we think is more flexible, extensible, and will accommodate both PCI and system bus type devices (and others). Instead of config space fixed at 0xf, we would propose a header and multiple 'device info' records at offset 0x0 that would encode everything that user space needs to know about the device: Why not just use an ioctl with a proper struct? The config space is weird for PCI access because it's mirroring a well known binary blob. It's not something to replicate if you're inventing something new. Regards, Anthony Liguori -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken pci_block_user_cfg_access interface
On Mon, Aug 29, 2011 at 08:47:07PM +0200, Jan Kiszka wrote: On 2011-08-29 17:42, Jan Kiszka wrote: I still don't get what prevents converting ipr to allow plain mutex synchronization. My vision is: - push reset-on-error of ipr into workqueue (or threaded IRQ?) I'm starting to like your proposal: I had a look at ipr, but it turned out to be anything but trivial to convert that driver. It runs its complete state machine under spin_lock_irq, and the functions calling pci_block/unblock_user_cfg_access are deep inside this thing. I have no hardware to test whatever change, and I feel a bit uncomfortable asking Brian to redesign his driver that massively. So back to your idea: I would generalize pci_block_user_cfg_access to pci_block_cfg_access. It should fail when some other site already holds the access lock, but it should remain non-blocking - for the sake of ipr. It would be easy to have blocking and non-blocking variants. But - I have no idea whether supporting sysfs config/reset access while ipr is active makes any sense - I know we need it for uio. - reset while uio handles interrupt needs to block, not fail I think We should still provide generic pci-2.3 IRQ masking services, but that could be done in a second step. I could have a look at this. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] RFC: vfio / device assignment -- layout of device fd files
On 08/29/2011 02:04 PM, Anthony Liguori wrote: On 08/29/2011 11:51 AM, Yoder Stuart-B08248 wrote: Instead of config space fixed at 0xf, we would propose a header and multiple 'device info' records at offset 0x0 that would encode everything that user space needs to know about the device: Why not just use an ioctl with a proper struct? This is more extensible than a struct -- both in features, and in the number of each type of resource that you can have, length of strings you can have, etc. The config space is weird for PCI access because it's mirroring a well known binary blob. It's not something to replicate if you're inventing something new. There's no intent to replicate config space in general -- config space is provided as-is. There's little overlap between config space and the extra information provided. Length can be had from config space, but only by modifying it. Physical address sort-of overlaps, though bus addresess could be different from CPU physical addresses[1]. In both cases, it'd be nice to stay consistent with device-tree regions. BAR type is overlap, but doesn't seem too unreasonable to me. -Scott [1] The user is probably less likely to care about the physical address at all in the PCI case, but consistency is nice. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio / device assignment -- layout of device fd files
On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote: Alex Graf, Scott Wood, and I met last week to try to flesh out some details as to how vfio could work for non-PCI devices, like we have in embedded systems. This most likely will require a different kernel driver than vfio-- for now we are calling it dtio (for device tree I/O) as there is no way to discover these devices except from the device tree. But the dtio driver would use the same architecture and interfaces as vfio. Why is this a different kernel driver? The difference will primarily be in what bus types vfio registers drivers and the set of device types the device fds support. The group and iommu interfaces will be shared. This sounds more like vfio .config options (CONFIG_VFIO_PCI, CONFIG_VFIO_DT). For devices on a system bus and represented in a device tree we have some different requirements than PCI for what is exposed in the device fd file. A device may have multiple address regions, multiple interrupts, a variable length device tree path, whether a region is mmapable, etc. With existing vfio, the device fd file layout is something like: 0xF Config space offset ... 0x6 ROM offset 0x5 BAR 5 offset 0x4 BAR 4 offset 0x3 BAR 3 offset 0x2 BAR 2 offset 0x1 BAR 1 offset 0x0 BAR 0 offset We have an alternate proposal that we think is more flexible, extensible, and will accommodate both PCI and system bus type devices (and others). Instead of config space fixed at 0xf, we would propose a header and multiple 'device info' records at offset 0x0 that would encode everything that user space needs to know about the device: 0x0 +-+-+ | magic | version | u64 // magic u64 identifies the type of | vfio| | // passthru I/O, plus version # | dtio| | // vfio - PCI devices +-+-+ // dtio - device tree devices Maybe magic = pci, dt, ... | flags | u32 // encodes any flags (TBD) +---+ | dev info record N| |type | u32 // type of record |rec_len| u32 // length in bytes of record | | (including record header) |flags | u32 // type specific flags |...content... | // record content, which could +---+ // include sub-records | dev info record N+1 | +---+ | dev info record N+2 | +---+ ... The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) I don't quite understand if these are physical or virtual. INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device I would have expected this to be a REGION with a property of PCI_CONFIG_SPACE. PCI_INFO - domain:bus:device:func Not entirely sure we need this. How are you imagining we get from a group fd to a device fd (wondering if you're only including this for enumeration)? I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD ioctl that takes a char* parameter that contains the dev_name() for the device requested. The list of devices under each group can be found by read()ing the group fd. If we keep this, we should make the interfaces similar, in fact, maybe this is how we describe the capabilities of the iommu too, reading a table from the iommu fd. PCI_BAR_INFO - information about the BAR for a device For a device tree type device the file may look like: 0x0+---+ |header | +---+ | type = REGION | | rec_len | | flags = | u32 // region specific flags | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // phys addr of region Would we only need to expose phys addr for 1:1 mapping requirements? I'm not sure why we'd care to expose this otherwise. | | +---+ \ type = DTPATH \ // a sub-region | rec_len| | flags | | dev tree path
Re: [PATCH resend] KVM: Document KVM_IRQFD
On 08/29/2011 07:34 AM, Sasha Levin wrote: Cc: Avi Kivity a...@redhat.com Cc: Marcelo Tosatti mtosa...@redhat.com Signed-off-by: Sasha Levin levinsasha...@gmail.com --- Documentation/virtual/kvm/api.txt | 27 +++ 1 files changed, 27 insertions(+), 0 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 2d510b6..d1150b6 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -1450,6 +1450,33 @@ is supported; 2 if the processor requires all virtual machines to have an RMA, or 1 if the processor can use an RMA but doesn't require it, because it supports the Virtual RMA (VRMA) facility. +4.64 KVM_IRQFD + +Capability: KVM_CAP_IRQFD +Architectures: all +Type: vm ioctl +Parameters: struct kvm_irqfd (in) +Returns: 0 on success, !0 on error + +This ioctl attaches or detaches an eventfd to a GSI within the guest. +While the eventfd is assigned to the guest, any write to the eventfd +would trigger the GSI within the guest. + +struct kvm_irqfd { + __u32 fd; + __u32 gsi; + __u32 flags; + __u8 pad[20]; +}; Should define gsi (and how it's used by KVM) somewhere. AFAICT it's an ACPI-ism. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio / device assignment -- layout of device fd files
On 08/29/2011 02:51 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote: The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) I don't quite understand if these are physical or virtual. If what are physical or virtual? INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device I would have expected this to be a REGION with a property of PCI_CONFIG_SPACE. Could be, if physical address is made optional. PCI_INFO - domain:bus:device:func Not entirely sure we need this. How are you imagining we get from a group fd to a device fd (wondering if you're only including this for enumeration)? I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD ioctl that takes a char* parameter that contains the dev_name() for the device requested. The list of devices under each group can be found by read()ing the group fd. Seems like it would be nice to keep this information around, rather than require the user to pass it around separately. Shouldn't cost much. If we keep this, we should make the interfaces similar, Yes. PCI_BAR_INFO - information about the BAR for a device For a device tree type device the file may look like: 0x0+---+ |header | +---+ | type = REGION | | rec_len | | flags = | u32 // region specific flags | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // phys addr of region Would we only need to expose phys addr for 1:1 mapping requirements? I'm not sure why we'd care to expose this otherwise. It's more important for non-PCI, where it avoids the need for userspace to parse the device tree to find the guest address (we'll usually want 1:1), or to consolidate pages shared by multiple regions. It could be nice for debugging, as well. Seems like something that's better to have and not need, than the other way around. It would need to be optional, though, if we want to be able to describe things without a physical address like config space. | | +---+ \ type = DTPATH \ // a sub-region | rec_len| | flags | | dev tree path | char[] // device tree path +---+ \ type = DTINDEX \ // a sub-region | rec_len| | flags | | prop_type | u32 // REG, RANGES | prop_index | u32 // index into resource list +---+ | type = INTERRUPT | | rec_len | | flags| u32 | ioctl_handle | u32 // argument to ioctl to get interrupts Is this a dynamic ioctl number or just a u32 parameter to an ioctl like VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)? It's a parameter to VFIO_DEVICE_GET_IRQ_FD. +---+ | type = PCI_INFO | | rec_len | | flags = 0x0 | | dom:bus:dev:func| u32 // pci device info +---+ | type = REGION | | rec_len | | flags = | | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // physical addr of region +---+ \ type = PCI_BAR_INFO\ | rec_len| | flags | | bar_type | // pio | | // prefetable mmio | | // non-prefetchable mmmio | bar_index | // index of bar in device Aren't a lot of these typical region attributes? Wondering if we should just make them part of the REGION flags or we'll have a growing number of sub-regions describing common things. It'd be nice to keep the base region record common among PCI/DT/whatever.
KVM call agenda for August 30
Hi Please send in any agenda items you are interested in covering. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio / device assignment -- layout of device fd files
On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote: On 08/29/2011 02:51 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote: The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) I don't quite understand if these are physical or virtual. If what are physical or virtual? Can you give an example of a path vs an index? I don't understand enough about these to ask a useful question about what they're describing. INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device I would have expected this to be a REGION with a property of PCI_CONFIG_SPACE. Could be, if physical address is made optional. Or physical address is also a property, aka sub-region. PCI_INFO - domain:bus:device:func Not entirely sure we need this. How are you imagining we get from a group fd to a device fd (wondering if you're only including this for enumeration)? I'm currently coding it as a VFIO_GROUP_GET_DEVICE_FD ioctl that takes a char* parameter that contains the dev_name() for the device requested. The list of devices under each group can be found by read()ing the group fd. Seems like it would be nice to keep this information around, rather than require the user to pass it around separately. Shouldn't cost much. Seems redundant. The user had to know the d:b:d.f to get the device fd in the first place. At best it's a sanity check. If we keep this, we should make the interfaces similar, Yes. PCI_BAR_INFO - information about the BAR for a device For a device tree type device the file may look like: 0x0+---+ |header | +---+ | type = REGION | | rec_len | | flags = | u32 // region specific flags | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // phys addr of region Would we only need to expose phys addr for 1:1 mapping requirements? I'm not sure why we'd care to expose this otherwise. It's more important for non-PCI, where it avoids the need for userspace to parse the device tree to find the guest address (we'll usually want 1:1), or to consolidate pages shared by multiple regions. It could be nice for debugging, as well. So the device tree path is ripped straight from the system, so it's the actual 1:1, matching physical hardware, path. Seems like something that's better to have and not need, than the other way around. It would need to be optional, though, if we want to be able to describe things without a physical address like config space. sub-region? | | +---+ \ type = DTPATH \ // a sub-region | rec_len| | flags | | dev tree path | char[] // device tree path +---+ \ type = DTINDEX \ // a sub-region | rec_len| | flags | | prop_type | u32 // REG, RANGES | prop_index | u32 // index into resource list +---+ | type = INTERRUPT | | rec_len | | flags| u32 | ioctl_handle | u32 // argument to ioctl to get interrupts Is this a dynamic ioctl number or just a u32 parameter to an ioctl like VFIO_DEVICE_GET_IRQ_FD (ie. an instance number)? It's a parameter to VFIO_DEVICE_GET_IRQ_FD. +---+ | type = PCI_INFO | | rec_len | | flags = 0x0 | | dom:bus:dev:func| u32 // pci device info +---+ | type = REGION | | rec_len | | flags = | | is_mmapable | | offset | u64 // seek offset to region from | |from beginning | len | u64 // length of region | addr| u64 // physical addr of region +---+ \
Re: RFC: vfio / device assignment -- layout of device fd files
On 08/29/2011 05:46 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote: On 08/29/2011 02:51 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote: The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) I don't quite understand if these are physical or virtual. If what are physical or virtual? Can you give an example of a path vs an index? I don't understand enough about these to ask a useful question about what they're describing. You'd have both path and index. Example, for this tree: / { ... foo { ... bar { reg = 0x1000 64 0x1800 64; ranges = 0 0x2 0x1; ... child { reg = 0x100 0x100; ... }; }; }; }; There would be 4 regions if you bind to /foo/bar: // this is 64 bytes at 0x1000 DTPATH /foo/bar DTINDEX prop_type=REG prop_index=0 // this is 64 bytes at 0x1800 DTPATH /foo/bar DTINDEX prop_type=REG prop_index=1 // this is 16K at 0x2 DTPATH /foo/bar DTINDEX prop_type=RANGES prop_index=0 // this is 256 bytes at 0x20100 DTPATH /foo/bar/child DTINDEX prop_type=REG prop_index=0 Both ranges and the child reg are needed, since ranges could be a simple ranges; that passes everything with no translation, and child nodes could be absent-but-implied in some other cases (such as when they represent PCI devices which can be probed -- we still need to map the ranges that correspond to PCI controller windows). INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device I would have expected this to be a REGION with a property of PCI_CONFIG_SPACE. Could be, if physical address is made optional. Or physical address is also a property, aka sub-region. A subrecord of REGION is fine with me. Would we only need to expose phys addr for 1:1 mapping requirements? I'm not sure why we'd care to expose this otherwise. It's more important for non-PCI, where it avoids the need for userspace to parse the device tree to find the guest address (we'll usually want 1:1), or to consolidate pages shared by multiple regions. It could be nice for debugging, as well. So the device tree path is ripped straight from the system, so it's the actual 1:1, matching physical hardware, path. Yes. Even for non-PCI we need to know if the region is pio/mmio32/mmio64/prefetchable/etc. Outside of PCI, what standardized form would you put such information in? Where would the kernel get this information? What does mmio32/mmio64 mean in this context? I could imagine a platform device described by ACPI that might want to differentiate. The physical device doesn't get moved of course, but guest drivers might care how the device is described if we need to rebuild those ACPI tables. ACPI might even be a good place to leverage these data structures... /me ducks. ACPI info could be another subrecord type, but in the device tree system-bus case we generally don't have this information at the generic infrastructure level. Drivers are expected to know how their devices' regions should be mapped. BAR index could really just translate to a REGION instance number. How would that work if you make non-BAR things (such as config space) into regions? Put their instance numbers outside of the BAR region? We have a fixed REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 == instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7). Seems more awkward than just having each region say what it is. What do you do to fill in the gaps? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] Add generic stubs for kvm stop check functions
This function is called from the watchdog code when a soft lockup is detected. If this is an arch that does not support pvclock, this function is used. Signed-off-by: Eric B Munson emun...@mgebm.net --- include/asm-generic/pvclock.h | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/pvclock.h diff --git a/include/asm-generic/pvclock.h b/include/asm-generic/pvclock.h new file mode 100644 index 000..ff046b6 --- /dev/null +++ b/include/asm-generic/pvclock.h @@ -0,0 +1,14 @@ +#ifndef _ASM_GENERIC_PVCLOCK_H +#define _ASM_GENERIC_PVCLOCK_H + + +/* + * These functions are used by architectures that support kvm to avoid issuing + * false soft lockup messages. + */ +static inline bool kvm_check_and_clear_host_stopped(int cpu) +{ + return false; +} + +#endif -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Add functions to check if the host has stopped the vm
When a host stops or suspends a VM it will set a flag to show this. The watchdog will use these functions to determine if a softlockup is real, or the result of a suspended VM. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock.h |2 ++ arch/x86/kernel/kvmclock.c | 14 ++ 2 files changed, 16 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index a518c0a..dd59ad0 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -13,6 +13,8 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct timespec *ts); void pvclock_resume(void); +bool kvm_check_and_clear_host_stopped(int cpu); + /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, * yielding a 64-bit result. diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index c1a0188..5f60d2b 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -113,6 +113,20 @@ static void kvm_get_preset_lpj(void) preset_lpj = lpj; } +bool kvm_check_and_clear_host_stopped(int cpu) +{ + bool ret = false; + struct pvclock_vcpu_time_info *src; + + src = per_cpu(hv_clock, cpu); + if ((src-flags PVCLOCK_GUEST_STOPPED) != 0) { + src-flags = src-flags (~PVCLOCK_GUEST_STOPPED); + ret = true; + } + + return ret; +} + static struct clocksource kvm_clock = { .name = kvm-clock, .read = kvm_clock_get_cycles, -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4] Avoid soft lockup message when KVM is stopped by host
Currently, when qemu stops a guest kernel that guest will issue a soft lockup message when it resumes. This set provides the ability for qemu to comminucate to the guest that it has been stopped. When the guest hits the watchdog on resume it will check if it was suspended before issuing the warning. Eric B Munson (4): Add flag to indicate that a vm was stopped by the host Add functions to check if the host has stopped the vm Add generic stubs for kvm stop check functions Add check for suspended vm in softlockup detector arch/x86/include/asm/pvclock-abi.h |1 + arch/x86/include/asm/pvclock.h |2 ++ arch/x86/kernel/kvmclock.c | 14 ++ include/asm-generic/pvclock.h | 14 ++ kernel/watchdog.c | 12 5 files changed, 43 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/pvclock.h -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Add check for suspended vm in softlockup detector
A suspended VM can cause spurious soft lockup warnings. To avoid these, the watchdog now checks if the kernel knows it was stopped by the host and skips the warning if so. Signed-off-by: Eric B Munson emun...@mgebm.net --- kernel/watchdog.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 36491cd..4cbb69f 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -25,6 +25,7 @@ #include linux/sysctl.h #include asm/irq_regs.h +#include asm/pvclock.h #include linux/perf_event.h int watchdog_enabled = 1; @@ -292,6 +293,17 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + /* +* If a virtual machine is stopped by the host it can look to +* the watchdog like a soft lockup, check to see if the host +* stopped the vm before we issue the warning +*/ + if (kvm_check_and_clear_host_stopped(get_cpu())) { + put_cpu(); + return HRTIMER_RESTART; + } + put_cpu(); + /* only warn once */ if (__this_cpu_read(soft_watchdog_warn) == true) return HRTIMER_RESTART; -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Add flag to indicate that a vm was stopped by the host
This flag will be used to check if the vm was stopped by the host when a soft lockup was detected. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock-abi.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 35f2d19..6167fd7 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -40,5 +40,6 @@ struct pvclock_wall_clock { } __attribute__((__packed__)); #define PVCLOCK_TSC_STABLE_BIT (1 0) +#define PVCLOCK_GUEST_STOPPED (1 1) #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: APIC: avoid instruction emulation for EOI writes
v2 changes: define exit qualification fields for APIC-Access in vmx.h use apic_reg_write instead of apic_set_eoi, to avoid breaking tracing add fasteoi option to allow disabling this acceleration commit 2a66a12cb6928c962d24907e6d39b6eb9ac94b4b Author: Kevin Tian kevin.t...@intel.com Date: Mon Aug 29 13:08:28 2011 +0800 KVM: APIC: avoid instruction emulation for EOI writes Instruction emulation for EOI writes can be skipped, since sane guest simply uses MOV instead of string operations. This is a nice improvement when guest doesn't support x2apic or hyper-V EOI support. a single VM bandwidth is observed with ~8% bandwidth improvement (7.4Gbps-8Gbps), by saving ~5% cycles from EOI emulation. Signed-off-by: Kevin Tian kevin.t...@intel.com Based on earlier work from: Signed-off-by: Eddie Dong eddie.d...@intel.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 2caf290..31f180c 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -350,6 +350,18 @@ enum vmcs_field { #define DEBUG_REG_ACCESS_REG(eq)(((eq) 8) 0xf) /* 11:8, general purpose reg. */ +/* + * Exit Qualifications for APIC-Access + */ +#define APIC_ACCESS_OFFSET 0xfff /* 11:0, offset within the APIC page */ +#define APIC_ACCESS_TYPE0xf000 /* 15:12, access type */ +#define TYPE_LINEAR_APIC_INST_READ (0 12) +#define TYPE_LINEAR_APIC_INST_WRITE (1 12) +#define TYPE_LINEAR_APIC_INST_FETCH (2 12) +#define TYPE_LINEAR_APIC_EVENT (3 12) +#define TYPE_PHYSICAL_APIC_EVENT(10 12) +#define TYPE_PHYSICAL_APIC_INST (15 12) + /* segment AR */ #define SEGMENT_AR_L_MASK (1 13) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..52645f2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu-arch.apic; + + if (apic) + apic_reg_write(vcpu-arch.apic, APIC_EOI, 0); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu-arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..9d0364b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -71,6 +71,9 @@ module_param(vmm_exclusive, bool, S_IRUGO); static int __read_mostly yield_on_hlt = 1; module_param(yield_on_hlt, bool, S_IRUGO); +static int __read_mostly fasteoi = 1; +module_param(fasteoi, bool, S_IRUGO); + /* * If nested=1, nested virtualization is supported, i.e., guests may use * VMX and be a hypervisor for its own guests. If nested=0, guests may not @@ -4540,6 +4543,24 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { + if (likely(fasteoi)) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = exit_qualification APIC_ACCESS_TYPE; + offset = exit_qualification APIC_ACCESS_OFFSET; + /* +* Sane guest uses MOV to write EOI, with written value +* not cared. So make a short-circuit here by avoiding +* heavy instruction emulation. +*/ + if ((access_type == TYPE_LINEAR_APIC_INST_WRITE) + (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; } Thanks Kevin 20110829_kvm_eoi_opt.patch Description: 20110829_kvm_eoi_opt.patch
Re: kvm PCI assignment VFIO ramblings
eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote: [snip] Yes. In essence, I'd rather not have to run any other admin processes. Doing things programmatically, on the fly, from each process, is the cleanest model right now. The persistent group model doesn't necessarily prevent that. There's no reason your program can't use the administrative interface as well as the use interface, and I don't see that making the admin interface separate and persistent makes this any harder. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/11] KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the function when spte is prefetched, unfortunately, we can not know how many spte need to be prefetched on this path, that means we can use out of the free pte_list_desc object in the cache, and BUG_ON() is triggered, also some path does not fill the cache, such as INS instruction emulated that does not trigger page fault Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 25 - 1 files changed, 20 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 5d7fbf0..b01afee 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -592,6 +592,11 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache, return 0; } +static int mmu_memory_cache_free_objects(struct kvm_mmu_memory_cache *cache) +{ + return cache-nobjs; +} + static void mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc, struct kmem_cache *cache) { @@ -969,6 +974,14 @@ static unsigned long *gfn_to_rmap(struct kvm *kvm, gfn_t gfn, int level) return linfo-rmap_pde; } +static bool rmap_can_add(struct kvm_vcpu *vcpu) +{ + struct kvm_mmu_memory_cache *cache; + + cache = vcpu-arch.mmu_pte_list_desc_cache; + return mmu_memory_cache_free_objects(cache); +} + static int rmap_add(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn) { struct kvm_mmu_page *sp; @@ -3585,6 +3598,12 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, break; } + /* +* No need to care whether allocation memory is successful +* or not since pte prefetch is skiped if it does not have +* enough objects in the cache. +*/ + mmu_topup_memory_caches(vcpu); spin_lock(vcpu-kvm-mmu_lock); if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter) gentry = 0; @@ -3655,7 +3674,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, mmu_page_zap_pte(vcpu-kvm, sp, spte); if (gentry !((sp-role.word ^ vcpu-arch.mmu.base_role.word) - mask.word)) + mask.word) rmap_can_add(vcpu)) mmu_pte_write_new_pte(vcpu, sp, spte, gentry); if (!remote_flush need_remote_flush(entry, *spte)) remote_flush = true; @@ -3716,10 +3735,6 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, goto out; } - r = mmu_topup_memory_caches(vcpu); - if (r) - goto out; - er = x86_emulate_instruction(vcpu, cr2, 0, insn, insn_len); switch (er) { -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 02/11] KVM: x86: tag the instructions which are used to write page table
The idea is from Avi: | tag instructions that are typically used to modify the page tables, and | drop shadow if any other instruction is used. | The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg, | and cmpxchg8b. This patch is used to tag the instructions and in the later path, shadow page is dropped if it is written by other instructions Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/emulate.c | 35 --- 1 files changed, 20 insertions(+), 15 deletions(-) diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index 0453c07..e24c269 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -82,6 +82,7 @@ #define RMExt (415) /* Opcode extension in ModRM r/m if mod == 3 */ #define Sse (118) /* SSE Vector instruction */ /* Misc flags */ +#define PageTable (1 19) /* instruction used to write page table */ #define Prot(121) /* instruction generates #UD if not in prot-mode */ #define VendorSpecific (122) /* Vendor specific instruction */ #define NoAccess(123) /* Don't access memory (lea/invlpg/verr etc) */ @@ -3018,10 +3019,10 @@ static struct opcode group7_rm7[] = { static struct opcode group1[] = { I(Lock, em_add), - I(Lock, em_or), + I(Lock | PageTable, em_or), I(Lock, em_adc), I(Lock, em_sbb), - I(Lock, em_and), + I(Lock | PageTable, em_and), I(Lock, em_sub), I(Lock, em_xor), I(0, em_cmp), @@ -3076,18 +3077,21 @@ static struct group_dual group7 = { { static struct opcode group8[] = { N, N, N, N, - D(DstMem | SrcImmByte | ModRM), D(DstMem | SrcImmByte | ModRM | Lock), - D(DstMem | SrcImmByte | ModRM | Lock), D(DstMem | SrcImmByte | ModRM | Lock), + D(DstMem | SrcImmByte | ModRM), + D(DstMem | SrcImmByte | ModRM | Lock | PageTable), + D(DstMem | SrcImmByte | ModRM | Lock), + D(DstMem | SrcImmByte | ModRM | Lock | PageTable), }; static struct group_dual group9 = { { - N, D(DstMem64 | ModRM | Lock), N, N, N, N, N, N, + N, D(DstMem64 | ModRM | Lock | PageTable), N, N, N, N, N, N, }, { N, N, N, N, N, N, N, N, } }; static struct opcode group11[] = { - I(DstMem | SrcImm | ModRM | Mov, em_mov), X7(D(Undefined)), + I(DstMem | SrcImm | ModRM | Mov | PageTable, em_mov), + X7(D(Undefined)), }; static struct gprefix pfx_0f_6f_0f_7f = { @@ -3099,7 +3103,7 @@ static struct opcode opcode_table[256] = { I6ALU(Lock, em_add), D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64), /* 0x08 - 0x0F */ - I6ALU(Lock, em_or), + I6ALU(Lock | PageTable, em_or), D(ImplicitOps | Stack | No64), N, /* 0x10 - 0x17 */ I6ALU(Lock, em_adc), @@ -3108,7 +3112,7 @@ static struct opcode opcode_table[256] = { I6ALU(Lock, em_sbb), D(ImplicitOps | Stack | No64), D(ImplicitOps | Stack | No64), /* 0x20 - 0x27 */ - I6ALU(Lock, em_and), N, N, + I6ALU(Lock | PageTable, em_and), N, N, /* 0x28 - 0x2F */ I6ALU(Lock, em_sub), N, I(ByteOp | DstAcc | No64, em_das), /* 0x30 - 0x37 */ @@ -3141,11 +3145,11 @@ static struct opcode opcode_table[256] = { G(ByteOp | DstMem | SrcImm | ModRM | No64 | Group, group1), G(DstMem | SrcImmByte | ModRM | Group, group1), I2bv(DstMem | SrcReg | ModRM, em_test), - I2bv(DstMem | SrcReg | ModRM | Lock, em_xchg), + I2bv(DstMem | SrcReg | ModRM | Lock | PageTable, em_xchg), /* 0x88 - 0x8F */ - I2bv(DstMem | SrcReg | ModRM | Mov, em_mov), + I2bv(DstMem | SrcReg | ModRM | Mov | PageTable, em_mov), I2bv(DstReg | SrcMem | ModRM | Mov, em_mov), - I(DstMem | SrcNone | ModRM | Mov, em_mov_rm_sreg), + I(DstMem | SrcNone | ModRM | Mov | PageTable, em_mov_rm_sreg), D(ModRM | SrcMem | NoAccess | DstReg), I(ImplicitOps | SrcMem16 | ModRM, em_mov_sreg_rm), G(0, group1A), @@ -3158,7 +3162,7 @@ static struct opcode opcode_table[256] = { II(ImplicitOps | Stack, em_popf, popf), N, N, /* 0xA0 - 0xA7 */ I2bv(DstAcc | SrcMem | Mov | MemAbs, em_mov), - I2bv(DstMem | SrcAcc | Mov | MemAbs, em_mov), + I2bv(DstMem | SrcAcc | Mov | MemAbs | PageTable, em_mov), I2bv(SrcSI | DstDI | Mov | String, em_mov), I2bv(SrcSI | DstDI | String, em_cmp), /* 0xA8 - 0xAF */ @@ -3255,18 +3259,19 @@ static struct opcode twobyte_table[256] = { D(DstMem | SrcReg | Src2CL | ModRM), N, N, /* 0xA8 - 0xAF */ D(ImplicitOps | Stack), D(ImplicitOps | Stack), - DI(ImplicitOps, rsm), D(DstMem | SrcReg | ModRM | BitOp | Lock), + DI(ImplicitOps, rsm), + D(DstMem | SrcReg | ModRM | BitOp | Lock | PageTable), D(DstMem | SrcReg | Src2ImmByte | ModRM), D(DstMem | SrcReg | Src2CL | ModRM), D(ModRM), I(DstReg | SrcMem | ModRM,
[PATCH v3 03/11] KVM: x86: retry non-page-table writing instruction
If the emulation is caused by #PF and it is non-page_table writing instruction, it means the VM-EXIT is caused by shadow page protected, we can zap the shadow page and retry this instruction directly The idea is from Avi Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_emulate.h |1 + arch/x86/include/asm/kvm_host.h|5 arch/x86/kvm/emulate.c |5 arch/x86/kvm/mmu.c | 22 + arch/x86/kvm/x86.c | 47 5 files changed, 75 insertions(+), 5 deletions(-) diff --git a/arch/x86/include/asm/kvm_emulate.h b/arch/x86/include/asm/kvm_emulate.h index 6040d11..fa87b63 100644 --- a/arch/x86/include/asm/kvm_emulate.h +++ b/arch/x86/include/asm/kvm_emulate.h @@ -362,6 +362,7 @@ enum x86_intercept { #endif int x86_decode_insn(struct x86_emulate_ctxt *ctxt, void *insn, int insn_len); +bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt); #define EMULATION_FAILED -1 #define EMULATION_OK 0 #define EMULATION_RESTART 1 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 6ab4241..27a25df 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -443,6 +443,9 @@ struct kvm_vcpu_arch { cpumask_var_t wbinvd_dirty_mask; + unsigned long last_retry_eip; + unsigned long last_retry_addr; + struct { bool halted; gfn_t gfns[roundup_pow_of_two(ASYNC_PF_PER_VCPU)]; @@ -689,6 +692,7 @@ enum emulation_result { #define EMULTYPE_NO_DECODE (1 0) #define EMULTYPE_TRAP_UD (1 1) #define EMULTYPE_SKIP (1 2) +#define EMULTYPE_RETRY (1 3) int x86_emulate_instruction(struct kvm_vcpu *vcpu, unsigned long cr2, int emulation_type, void *insn, int insn_len); @@ -753,6 +757,7 @@ void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, bool guest_initiated); +int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn); int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); int kvm_mmu_load(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index e24c269..c62424e 100644 --- a/arch/x86/kvm/emulate.c +++ b/arch/x86/kvm/emulate.c @@ -3691,6 +3691,11 @@ done: return (rc != X86EMUL_CONTINUE) ? EMULATION_FAILED : EMULATION_OK; } +bool page_table_writing_insn(struct x86_emulate_ctxt *ctxt) +{ + return ctxt-d PageTable; +} + static bool string_insn_completed(struct x86_emulate_ctxt *ctxt) { /* The second termination condition only applies for REPE diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index b01afee..26aae11 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1997,7 +1997,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages) kvm-arch.n_max_mmu_pages = goal_nr_mmu_pages; } -static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) +int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) { struct kvm_mmu_page *sp; struct hlist_node *node; @@ -2007,6 +2007,7 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) pgprintk(%s: looking for gfn %llx\n, __func__, gfn); r = 0; + spin_lock(kvm-mmu_lock); for_each_gfn_indirect_valid_sp(kvm, sp, gfn, node) { pgprintk(%s: gfn %llx role %x\n, __func__, gfn, sp-role.word); @@ -2014,8 +2015,10 @@ static int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn) kvm_mmu_prepare_zap_page(kvm, sp, invalid_list); } kvm_mmu_commit_zap_page(kvm, invalid_list); + spin_unlock(kvm-mmu_lock); return r; } +EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page); static void mmu_unshadow(struct kvm *kvm, gfn_t gfn) { @@ -3697,9 +3700,7 @@ int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva) gpa = kvm_mmu_gva_to_gpa_read(vcpu, gva, NULL); - spin_lock(vcpu-kvm-mmu_lock); r = kvm_mmu_unprotect_page(vcpu-kvm, gpa PAGE_SHIFT); - spin_unlock(vcpu-kvm-mmu_lock); return r; } EXPORT_SYMBOL_GPL(kvm_mmu_unprotect_page_virt); @@ -3720,10 +3721,18 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu) kvm_mmu_commit_zap_page(vcpu-kvm, invalid_list); } +static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr) +{ + if (vcpu-arch.mmu.direct_map || mmu_is_nested(vcpu)) + return vcpu_match_mmio_gpa(vcpu, addr); + + return vcpu_match_mmio_gva(vcpu, addr); +} + int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code, void *insn, int insn_len) { - int r; + int r, emulation_type =
[PATCH v3 04/11] KVM: x86: cleanup port-in/port-out emulated
Remove the same code between emulator_pio_in_emulated and emulator_pio_out_emulated Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/x86.c | 59 ++- 1 files changed, 26 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1afe59e..85e6b4e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4327,32 +4327,24 @@ static int kernel_pio(struct kvm_vcpu *vcpu, void *pd) return r; } - -static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, - int size, unsigned short port, void *val, - unsigned int count) +static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size, + unsigned short port, void *val, + unsigned int count, bool in) { - struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - - if (vcpu-arch.pio.count) - goto data_avail; - - trace_kvm_pio(0, port, size, count); + trace_kvm_pio(!in, port, size, count); vcpu-arch.pio.port = port; - vcpu-arch.pio.in = 1; + vcpu-arch.pio.in = in; vcpu-arch.pio.count = count; vcpu-arch.pio.size = size; if (!kernel_pio(vcpu, vcpu-arch.pio_data)) { - data_avail: - memcpy(val, vcpu-arch.pio_data, size * count); vcpu-arch.pio.count = 0; return 1; } vcpu-run-exit_reason = KVM_EXIT_IO; - vcpu-run-io.direction = KVM_EXIT_IO_IN; + vcpu-run-io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT; vcpu-run-io.size = size; vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE; vcpu-run-io.count = count; @@ -4361,36 +4353,37 @@ static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, return 0; } -static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, -int size, unsigned short port, -const void *val, unsigned int count) +static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt, + int size, unsigned short port, void *val, + unsigned int count) { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + int ret; - trace_kvm_pio(1, port, size, count); - - vcpu-arch.pio.port = port; - vcpu-arch.pio.in = 0; - vcpu-arch.pio.count = count; - vcpu-arch.pio.size = size; - - memcpy(vcpu-arch.pio_data, val, size * count); + if (vcpu-arch.pio.count) + goto data_avail; - if (!kernel_pio(vcpu, vcpu-arch.pio_data)) { + ret = emulator_pio_in_out(vcpu, size, port, val, count, true); + if (ret) { +data_avail: + memcpy(val, vcpu-arch.pio_data, size * count); vcpu-arch.pio.count = 0; return 1; } - vcpu-run-exit_reason = KVM_EXIT_IO; - vcpu-run-io.direction = KVM_EXIT_IO_OUT; - vcpu-run-io.size = size; - vcpu-run-io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE; - vcpu-run-io.count = count; - vcpu-run-io.port = port; - return 0; } +static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt, +int size, unsigned short port, +const void *val, unsigned int count) +{ + struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); + + memcpy(vcpu-arch.pio_data, val, size * count); + return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false); +} + static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg) { return kvm_x86_ops-get_segment_base(vcpu, seg); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/11] KVM: MMU: do not mark accessed bit on pte write path
In current code, the accessed bit is always set when page fault occurred, do not need to set it on pte write path Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |1 - arch/x86/kvm/mmu.c | 22 +- 2 files changed, 1 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 27a25df..58ea3a7 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -356,7 +356,6 @@ struct kvm_vcpu_arch { gfn_t last_pt_write_gfn; int last_pt_write_count; u64 *last_pte_updated; - gfn_t last_pte_gfn; struct fpu guest_fpu; u64 xcr0; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 26aae11..7ec2a6a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2206,11 +2206,6 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (set_mmio_spte(sptep, gfn, pfn, pte_access)) return 0; - /* -* We don't set the accessed bit, since we sometimes want to see -* whether the guest actually used the pte (in order to detect -* demand paging). -*/ spte = PT_PRESENT_MASK; if (!speculative) spte |= shadow_accessed_mask; @@ -2361,10 +2356,8 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } kvm_release_pfn_clean(pfn); - if (speculative) { + if (speculative) vcpu-arch.last_pte_updated = sptep; - vcpu-arch.last_pte_gfn = gfn; - } } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) @@ -3532,18 +3525,6 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) return !!(spte (*spte shadow_accessed_mask)); } -static void kvm_mmu_access_page(struct kvm_vcpu *vcpu, gfn_t gfn) -{ - u64 *spte = vcpu-arch.last_pte_updated; - - if (spte -vcpu-arch.last_pte_gfn == gfn -shadow_accessed_mask -!(*spte shadow_accessed_mask) -is_shadow_present_pte(*spte)) - set_bit(PT_ACCESSED_SHIFT, (unsigned long *)spte); -} - void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, const u8 *new, int bytes, bool guest_initiated) @@ -3614,7 +3595,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, ++vcpu-kvm-stat.mmu_pte_write; trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE); if (guest_initiated) { - kvm_mmu_access_page(vcpu, gfn); if (gfn == vcpu-arch.last_pt_write_gfn !last_updated_pte_accessed(vcpu)) { ++vcpu-arch.last_pt_write_count; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 06/11] KVM: MMU: cleanup FNAME(invlpg)
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the same code between FNAME(invlpg) and FNAME(sync_page) Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 16 ++-- arch/x86/kvm/paging_tmpl.h | 42 +++--- 2 files changed, 25 insertions(+), 33 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 7ec2a6a..ed3e778 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1808,7 +1808,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } -static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, +static bool mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, u64 *spte) { u64 pte; @@ -1816,17 +1816,21 @@ static void mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp, pte = *spte; if (is_shadow_present_pte(pte)) { - if (is_last_spte(pte, sp-role.level)) + if (is_last_spte(pte, sp-role.level)) { drop_spte(kvm, spte); - else { + if (is_large_pte(pte)) + --kvm-stat.lpages; + } else { child = page_header(pte PT64_BASE_ADDR_MASK); drop_parent_pte(child, spte); } - } else if (is_mmio_spte(pte)) + return true; + } + + if (is_mmio_spte(pte)) mmu_spte_clear_no_track(spte); - if (is_large_pte(pte)) - --kvm-stat.lpages; + return false; } static void kvm_mmu_page_unlink_children(struct kvm *kvm, diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 9299410..7862c05 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -656,6 +656,16 @@ out_unlock: return 0; } +static gpa_t FNAME(get_first_pte_gpa)(struct kvm_mmu_page *sp) +{ + int offset = 0; + + if (PTTYPE == 32) + offset = sp-role.quadrant PT64_LEVEL_BITS; + + return gfn_to_gpa(sp-gfn) + offset * sizeof(pt_element_t); +} + static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) { struct kvm_shadow_walk_iterator iterator; @@ -663,7 +673,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) gpa_t pte_gpa = -1; int level; u64 *sptep; - int need_flush = 0; vcpu_clear_mmio_info(vcpu, gva); @@ -675,36 +684,20 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) sp = page_header(__pa(sptep)); if (is_last_spte(*sptep, level)) { - int offset, shift; - if (!sp-unsync) break; - shift = PAGE_SHIFT - - (PT_LEVEL_BITS - PT64_LEVEL_BITS) * level; - offset = sp-role.quadrant shift; - - pte_gpa = (sp-gfn PAGE_SHIFT) + offset; + pte_gpa = FNAME(get_first_pte_gpa)(sp); pte_gpa += (sptep - sp-spt) * sizeof(pt_element_t); - if (is_shadow_present_pte(*sptep)) { - if (is_large_pte(*sptep)) - --vcpu-kvm-stat.lpages; - drop_spte(vcpu-kvm, sptep); - need_flush = 1; - } else if (is_mmio_spte(*sptep)) - mmu_spte_clear_no_track(sptep); - - break; + if (mmu_page_zap_pte(vcpu-kvm, sp, sptep)) + kvm_flush_remote_tlbs(vcpu-kvm); } if (!is_shadow_present_pte(*sptep) || !sp-unsync_children) break; } - if (need_flush) - kvm_flush_remote_tlbs(vcpu-kvm); - atomic_inc(vcpu-kvm-arch.invlpg_counter); spin_unlock(vcpu-kvm-mmu_lock); @@ -769,19 +762,14 @@ static gpa_t FNAME(gva_to_gpa_nested)(struct kvm_vcpu *vcpu, gva_t vaddr, */ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) { - int i, offset, nr_present; + int i, nr_present = 0; bool host_writable; gpa_t first_pte_gpa; - offset = nr_present = 0; - /* direct kvm_mmu_page can not be unsync. */ BUG_ON(sp-role.direct); - if (PTTYPE == 32) - offset = sp-role.quadrant PT64_LEVEL_BITS; - - first_pte_gpa = gfn_to_gpa(sp-gfn) + offset * sizeof(pt_element_t); + first_pte_gpa = FNAME(get_first_pte_gpa)(sp); for (i = 0; i PT64_ENT_PER_PAGE; i++) { unsigned pte_access; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More
[PATCH v3 07/11] KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |4 +--- arch/x86/kvm/mmu.c | 38 +++--- arch/x86/kvm/paging_tmpl.h | 30 ++ arch/x86/kvm/x86.c |4 ++-- 4 files changed, 36 insertions(+), 40 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 58ea3a7..927ba73 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,7 +460,6 @@ struct kvm_arch { unsigned int n_requested_mmu_pages; unsigned int n_max_mmu_pages; unsigned int indirect_shadow_pages; - atomic_t invlpg_counter; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; /* * Hash table of struct kvm_mmu_page. @@ -754,8 +753,7 @@ int fx_init(struct kvm_vcpu *vcpu); void kvm_mmu_flush_tlb(struct kvm_vcpu *vcpu); void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *new, int bytes, - bool guest_initiated); + const u8 *new, int bytes); int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn); int kvm_mmu_unprotect_page_virt(struct kvm_vcpu *vcpu, gva_t gva); void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index ed3e778..f6de2fc 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3530,8 +3530,7 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) } void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *new, int bytes, - bool guest_initiated) + const u8 *new, int bytes) { gfn_t gfn = gpa PAGE_SHIFT; union kvm_mmu_page_role mask = { .word = 0 }; @@ -3540,7 +3539,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, LIST_HEAD(invalid_list); u64 entry, gentry, *spte; unsigned pte_size, page_offset, misaligned, quadrant, offset; - int level, npte, invlpg_counter, r, flooded = 0; + int level, npte, r, flooded = 0; bool remote_flush, local_flush, zap_page; /* @@ -3555,19 +3554,16 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes); - invlpg_counter = atomic_read(vcpu-kvm-arch.invlpg_counter); - /* * Assume that the pte write on a page table of the same type * as the current vcpu paging mode since we update the sptes only * when they have the same mode. */ - if ((is_pae(vcpu) bytes == 4) || !new) { + if (is_pae(vcpu) bytes == 4) { /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ - if (is_pae(vcpu)) { - gpa = ~(gpa_t)7; - bytes = 8; - } + gpa = ~(gpa_t)7; + bytes = 8; + r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8)); if (r) gentry = 0; @@ -3593,22 +3589,18 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, */ mmu_topup_memory_caches(vcpu); spin_lock(vcpu-kvm-mmu_lock); - if (atomic_read(vcpu-kvm-arch.invlpg_counter) != invlpg_counter) - gentry = 0; kvm_mmu_free_some_pages(vcpu); ++vcpu-kvm-stat.mmu_pte_write; trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE); - if (guest_initiated) { - if (gfn == vcpu-arch.last_pt_write_gfn -!last_updated_pte_accessed(vcpu)) { - ++vcpu-arch.last_pt_write_count; - if (vcpu-arch.last_pt_write_count = 3) - flooded = 1; - } else { - vcpu-arch.last_pt_write_gfn = gfn; - vcpu-arch.last_pt_write_count = 1; - vcpu-arch.last_pte_updated = NULL; - } + if (gfn == vcpu-arch.last_pt_write_gfn +!last_updated_pte_accessed(vcpu)) { + ++vcpu-arch.last_pt_write_count; + if (vcpu-arch.last_pt_write_count = 3) + flooded = 1; + } else { + vcpu-arch.last_pt_write_gfn = gfn; + vcpu-arch.last_pt_write_count = 1; + vcpu-arch.last_pte_updated = NULL; } mask.cr0_wp = mask.cr4_pae = mask.nxe = 1; diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 7862c05..3395ab2 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -670,20 +670,27 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; - gpa_t pte_gpa = -1;
[PATCH v3 08/11] KVM: MMU: remove unnecessary kvm_mmu_free_some_pages
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling kvm_mmu_free_some_pages is really unnecessary Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index f6de2fc..9ac0dc8 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3589,7 +3589,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, */ mmu_topup_memory_caches(vcpu); spin_lock(vcpu-kvm-mmu_lock); - kvm_mmu_free_some_pages(vcpu); ++vcpu-kvm-stat.mmu_pte_write; trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE); if (gfn == vcpu-arch.last_pt_write_gfn -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 09/11] KVM: MMU: split kvm_mmu_pte_write function
kvm_mmu_pte_write is too long, we split it for better readable Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c | 187 +++- 1 files changed, 112 insertions(+), 75 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 9ac0dc8..cfe24fe 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3529,48 +3529,28 @@ static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) return !!(spte (*spte shadow_accessed_mask)); } -void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, - const u8 *new, int bytes) +static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, + const u8 *new, int *bytes) { - gfn_t gfn = gpa PAGE_SHIFT; - union kvm_mmu_page_role mask = { .word = 0 }; - struct kvm_mmu_page *sp; - struct hlist_node *node; - LIST_HEAD(invalid_list); - u64 entry, gentry, *spte; - unsigned pte_size, page_offset, misaligned, quadrant, offset; - int level, npte, r, flooded = 0; - bool remote_flush, local_flush, zap_page; - - /* -* If we don't have indirect shadow pages, it means no page is -* write-protected, so we can exit simply. -*/ - if (!ACCESS_ONCE(vcpu-kvm-arch.indirect_shadow_pages)) - return; - - zap_page = remote_flush = local_flush = false; - offset = offset_in_page(gpa); - - pgprintk(%s: gpa %llx bytes %d\n, __func__, gpa, bytes); + u64 gentry; + int r; /* * Assume that the pte write on a page table of the same type * as the current vcpu paging mode since we update the sptes only * when they have the same mode. */ - if (is_pae(vcpu) bytes == 4) { + if (is_pae(vcpu) *bytes == 4) { /* Handle a 32-bit guest writing two halves of a 64-bit gpte */ - gpa = ~(gpa_t)7; - bytes = 8; - - r = kvm_read_guest(vcpu-kvm, gpa, gentry, min(bytes, 8)); + *gpa = ~(gpa_t)7; + *bytes = 8; + r = kvm_read_guest(vcpu-kvm, *gpa, gentry, min(*bytes, 8)); if (r) gentry = 0; new = (const u8 *)gentry; } - switch (bytes) { + switch (*bytes) { case 4: gentry = *(const u32 *)new; break; @@ -3582,71 +3562,128 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, break; } - /* -* No need to care whether allocation memory is successful -* or not since pte prefetch is skiped if it does not have -* enough objects in the cache. -*/ - mmu_topup_memory_caches(vcpu); - spin_lock(vcpu-kvm-mmu_lock); - ++vcpu-kvm-stat.mmu_pte_write; - trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PTE_WRITE); + return gentry; +} + +/* + * If we're seeing too many writes to a page, it may no longer be a page table, + * or we may be forking, in which case it is better to unmap the page. + */ +static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + bool flooded = false; + if (gfn == vcpu-arch.last_pt_write_gfn !last_updated_pte_accessed(vcpu)) { ++vcpu-arch.last_pt_write_count; if (vcpu-arch.last_pt_write_count = 3) - flooded = 1; + flooded = true; } else { vcpu-arch.last_pt_write_gfn = gfn; vcpu-arch.last_pt_write_count = 1; vcpu-arch.last_pte_updated = NULL; } + return flooded; +} + +/* + * Misaligned accesses are too much trouble to fix up; also, they usually + * indicate a page is not used as a page table. + */ +static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa, + int bytes) +{ + unsigned offset, pte_size, misaligned; + + pgprintk(misaligned: gpa %llx bytes %d role %x\n, +gpa, bytes, sp-role.word); + + offset = offset_in_page(gpa); + pte_size = sp-role.cr4_pae ? 8 : 4; + misaligned = (offset ^ (offset + bytes - 1)) ~(pte_size - 1); + misaligned |= bytes 4; + + return misaligned; +} + +static u64 *get_written_sptes(struct kvm_mmu_page *sp, gpa_t gpa, int *nspte) +{ + unsigned page_offset, quadrant; + u64 *spte; + int level; + + page_offset = offset_in_page(gpa); + level = sp-role.level; + *nspte = 1; + if (!sp-role.cr4_pae) { + page_offset = 1; /* 32-64 */ + /* +* A 32-bit pde maps 4MB while the shadow pdes map +* only 2MB. So we need to double the offset again +* and zap two pdes instead of one. +*/ + if (level ==
[PATCH v3 10/11] KVM: MMU: fix detecting misaligned accessed
Sometimes, we only modify the last one byte of a pte to update status bit, for example, clear_bit is used to clear r/w bit in linux kernel and 'andb' instruction is used in this function, in this case, kvm_mmu_pte_write will treat it as misaligned access, and the shadow page table is zapped Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |8 1 files changed, 8 insertions(+), 0 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cfe24fe..adaa160 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3601,6 +3601,14 @@ static bool detect_write_misaligned(struct kvm_mmu_page *sp, gpa_t gpa, offset = offset_in_page(gpa); pte_size = sp-role.cr4_pae ? 8 : 4; + + /* +* Sometimes, the OS only writes the last one bytes to update status +* bits, for example, in linux, andb instruction is used in clear_bit(). +*/ + if (!(offset (pte_size - 1)) bytes == 1) + return false; + misaligned = (offset ^ (offset + bytes - 1)) ~(pte_size - 1); misaligned |= bytes 4; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 11/11] KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if the last speculative spte is not accessed, we treat the page is write-flooding, however, we can speculative spte on many path, such as pte prefetch, page synced, that means the last speculative spte may be not point to the written page and the written page can be accessed via other sptes, so depends on the Accessed bit of the last speculative spte is not enough Instead of detected page accessed, we can detect whether the spte is accessed after it is written, if the spte is not accessed but it is written frequently, we treat is not a page table or it not used for a long time Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |6 +--- arch/x86/kvm/mmu.c | 57 +-- arch/x86/kvm/paging_tmpl.h | 12 +++- 3 files changed, 26 insertions(+), 49 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 927ba73..9d17238 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -239,6 +239,8 @@ struct kvm_mmu_page { int clear_spte_count; #endif + int write_flooding_count; + struct rcu_head rcu; }; @@ -353,10 +355,6 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_page_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; - gfn_t last_pt_write_gfn; - int last_pt_write_count; - u64 *last_pte_updated; - struct fpu guest_fpu; u64 xcr0; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index adaa160..fd5b389 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1652,6 +1652,18 @@ static void init_shadow_page_table(struct kvm_mmu_page *sp) sp-spt[i] = 0ull; } +static void __clear_sp_write_flooding_count(struct kvm_mmu_page *sp) +{ + sp-write_flooding_count = 0; +} + +static void clear_sp_write_flooding_count(u64 *spte) +{ + struct kvm_mmu_page *sp = page_header(__pa(spte)); + + __clear_sp_write_flooding_count(sp); +} + static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gfn_t gfn, gva_t gaddr, @@ -1695,6 +1707,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, } else if (sp-unsync) kvm_mmu_mark_parents_unsync(sp); + __clear_sp_write_flooding_count(sp); trace_kvm_mmu_get_page(sp, false); return sp; } @@ -1847,15 +1860,6 @@ static void kvm_mmu_put_page(struct kvm_mmu_page *sp, u64 *parent_pte) mmu_page_remove_parent_pte(sp, parent_pte); } -static void kvm_mmu_reset_last_pte_updated(struct kvm *kvm) -{ - int i; - struct kvm_vcpu *vcpu; - - kvm_for_each_vcpu(i, vcpu, kvm) - vcpu-arch.last_pte_updated = NULL; -} - static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp) { u64 *parent_pte; @@ -1915,7 +1919,6 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp, } sp-role.invalid = 1; - kvm_mmu_reset_last_pte_updated(kvm); return ret; } @@ -2360,8 +2363,6 @@ static void mmu_set_spte(struct kvm_vcpu *vcpu, u64 *sptep, } } kvm_release_pfn_clean(pfn); - if (speculative) - vcpu-arch.last_pte_updated = sptep; } static void nonpaging_new_cr3(struct kvm_vcpu *vcpu) @@ -3522,13 +3523,6 @@ static void mmu_pte_write_flush_tlb(struct kvm_vcpu *vcpu, bool zap_page, kvm_mmu_flush_tlb(vcpu); } -static bool last_updated_pte_accessed(struct kvm_vcpu *vcpu) -{ - u64 *spte = vcpu-arch.last_pte_updated; - - return !!(spte (*spte shadow_accessed_mask)); -} - static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, const u8 *new, int *bytes) { @@ -3569,22 +3563,9 @@ static u64 mmu_pte_write_fetch_gpte(struct kvm_vcpu *vcpu, gpa_t *gpa, * If we're seeing too many writes to a page, it may no longer be a page table, * or we may be forking, in which case it is better to unmap the page. */ -static bool detect_write_flooding(struct kvm_vcpu *vcpu, gfn_t gfn) +static bool detect_write_flooding(struct kvm_mmu_page *sp, u64 *spte) { - bool flooded = false; - - if (gfn == vcpu-arch.last_pt_write_gfn -!last_updated_pte_accessed(vcpu)) { - ++vcpu-arch.last_pt_write_count; - if (vcpu-arch.last_pt_write_count = 3) - flooded = true; - } else { - vcpu-arch.last_pt_write_gfn = gfn; - vcpu-arch.last_pt_write_count = 1; - vcpu-arch.last_pte_updated = NULL; - } - - return flooded; + return ++sp-write_flooding_count = 3; } /*
Re: VFIO v2 design plan
On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote: I don't think too much has changed since the previous email went out, but it seems like a good idea to post a summary in case there were suggestions or objections that I missed. VFIO v2 will rely on the platform iommu driver reporting grouping information. Again, a group is a set of devices for which the iommu cannot differentiate transactions. An example would be a set of devices behind a PCI-to-PCI bridge. All transactions appear to be from the bridge itself rather than devices behind the bridge. Platforms are free to have whatever constraints they need to for what constitutes a group. I posted a rough draft of patch to implement that for the base iommu driver and VT-d, adding an iommu_device_group callback on iommu ops. The iommu base driver also populates an iommu_group sysfs file for each device that's part of a group. Members of the same group return the same value via either the sysfs or iommu_device_group. The value returned is arbitrary, should not be assumed to be persistent across boots, and is left to the iommu driver to generate. There are some implementation details around how to do this without favoring one bus over another, but the interface should be bus/device type agnostic in the end. When the vfio module is loaded, character devices will be created for each group in /dev/vfio/$GROUP. Setting file permissions on these files should be sufficient for providing a user with complete access to the group. Opening this device file provides what we'll call the group fd. The group fd is restricted to only work with a single mm context. Concurrent opens will be denied if the opening process mm does not match. The group fd will provide interfaces for enumerating the devices in the group, returning a file descriptor for each device in the group (the device fd), binding groups together, and returning a file descriptor for iommu operations (the iommu fd). A group is viable when all member devices of the group are bound to the vfio driver. Until that point, the group fd only allows enumeration interfaces (ie. listing of group devices). I'm currently thinking enumeration will be done by a simple read() on the device file returning a list of dev_name()s. Ok. Are you envisaging this interface as a virtual file, or as a stream? That is, can you seek around the list of devices like a regular file - in which case, what are the precise semantics when the list is changed by a bind - or is there no meaningful notion of file pointer and read() just gives you the next device - in which case how to you rewind to enumerate the group again. Once the group is viable, the user may bind the group to another group, retrieve the iommu fd, or retrieve device fds. Internally, each of these operations will result in an iommu domain being allocated and all of the devices attached to the domain. The purpose of binding groups is to share the iommu domain. Groups making use of incompatible iommu domains will fail to bind. Groups making use of different mm's will fail to bind. The vfio driver may reject some binding based on domain capabilities, but final veto power is left to the iommu driver[1]. If a user makes use of a group independently and later wishes to bind it to another group, all the device fds and the iommu fd must first be closed. This prevents using a stale iommu fd or accessing devices while the iommu is being switched. Operations on any group fds of a merged group are performed globally on the group (ie. enumerating the devices lists all devices in the merged group, retrieving the iommu fd from any group fd results in the same fd, device fds from any group can be retrieved from any group fd[2]). Groups can be merged and unmerged dynamically. Unmerging a group requires the device fds for the outgoing group are closed. The iommu fd will remain persistent for the remaining merged group. As I've said I prefer a persistent group model, rather than this transient group model, but it's not a dealbreaker by itself. How are unmerges specified? I'm also assuming that in this model closing a (bound) group fd will unmerge everything down to atomic groups again. If a device within a group is unbound from the vfio driver while it's in use (iommu fd refcnt 0 || device fd recnt 0), vfio will block the release and send netlink remove requests for every opened device in the group (or merged group). Hrm, I do dislike netlink being yet another aspect of an already complex interface. Would it be possible to do kernel-user notifications with a poll()/read() interface on one of the existing fds instead? If the device fds are not released and subsequently the iommu fd released as well, vfio will kill the user process after some delay. Ouch, this seems to me a problematic semantic. Whether the user process survives depends on whether it processes the remove requests
Re: A non-responsive guest problem
Hi, After changing the clock source from kvm-clock to tsc, everything is OK. Probably it's the bug of kvm-clock. Maybe these bugs have been fixed in latest qemu. Thanks, Paul On Wed, Aug 24, 2011 at 8:47 PM, Paul fly...@gmail.com wrote: Hi, Sometimes this problem happened in one day, but sometimes it was very difficult to reproduce it. Previously the clock source of the guest was kvm-clock. Now I changed it to tsc. The problem didn't occur until now. Is it related to the clock source? I find that there are some bug fixes for kvm-clock recently. (e.g., http://www.spinics.net/lists/stable-commits/msg11942.html) Anyway, I will update KVM later. Thanks, Paul On Wed, Aug 24, 2011 at 6:24 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Aug 24, 2011 at 10:02 AM, Xiao Guangrong xiaoguangr...@cn.fujitsu.com wrote: On 08/24/2011 04:40 PM, Paul wrote: Hi, I captured the output of pidstat when the problem was reproduced: bash-4.1# pidstat -p $PID 8966 Linux 2.6.32-71.el6.x86_64 (test) 07/24/11 _x86_64_ (4 CPU) 16:25:15 PID %usr %system %guest %CPU CPU Command 16:25:15 8966 0.14 55.04 115.41 170.59 1 qemu-kvm I have tried to reproduce it, but it was failed. I am using the current KVM code. I suggest you to test it by the new code if possible. Yes, that's a good idea. The issue might already be fixed. But if this is hard to reproduce then perhaps keep the spinning guest around a bit longer so we can poke at it and figure out what is happening. The pidstat output shows us that it's the guest that is spinning, not qemu-kvm userspace. The system time (time spent in host kernel) is also quite high so running kvm_stat might show some interesting KVM events happening. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: VFIO v2 design plan
On Tue, 2011-08-30 at 13:04 +1000, David Gibson wrote: On Fri, Aug 26, 2011 at 11:05:23AM -0600, Alex Williamson wrote: I don't think too much has changed since the previous email went out, but it seems like a good idea to post a summary in case there were suggestions or objections that I missed. VFIO v2 will rely on the platform iommu driver reporting grouping information. Again, a group is a set of devices for which the iommu cannot differentiate transactions. An example would be a set of devices behind a PCI-to-PCI bridge. All transactions appear to be from the bridge itself rather than devices behind the bridge. Platforms are free to have whatever constraints they need to for what constitutes a group. I posted a rough draft of patch to implement that for the base iommu driver and VT-d, adding an iommu_device_group callback on iommu ops. The iommu base driver also populates an iommu_group sysfs file for each device that's part of a group. Members of the same group return the same value via either the sysfs or iommu_device_group. The value returned is arbitrary, should not be assumed to be persistent across boots, and is left to the iommu driver to generate. There are some implementation details around how to do this without favoring one bus over another, but the interface should be bus/device type agnostic in the end. When the vfio module is loaded, character devices will be created for each group in /dev/vfio/$GROUP. Setting file permissions on these files should be sufficient for providing a user with complete access to the group. Opening this device file provides what we'll call the group fd. The group fd is restricted to only work with a single mm context. Concurrent opens will be denied if the opening process mm does not match. The group fd will provide interfaces for enumerating the devices in the group, returning a file descriptor for each device in the group (the device fd), binding groups together, and returning a file descriptor for iommu operations (the iommu fd). A group is viable when all member devices of the group are bound to the vfio driver. Until that point, the group fd only allows enumeration interfaces (ie. listing of group devices). I'm currently thinking enumeration will be done by a simple read() on the device file returning a list of dev_name()s. Ok. Are you envisaging this interface as a virtual file, or as a stream? That is, can you seek around the list of devices like a regular file - in which case, what are the precise semantics when the list is changed by a bind - or is there no meaningful notion of file pointer and read() just gives you the next device - in which case how to you rewind to enumerate the group again. I was implementing it as a virtual file that gets generated on read() (see example in note[2] below). It is a bit clunky as reading it a byte at a time could experience races w/ device add/remove. If it's read all at once, it's an accurate snapshot. Suggestions welcome, this just seemed easier than trying to stuff it into a struct for an ioctl. For a while I thought I could do a VFIO_GROUP_GET_NUM_DEVICES + VFIO_GROUP_GET_DEVICE_INDEX, but that assumes device stability, which I don't think we can guarantee. Once the group is viable, the user may bind the group to another group, retrieve the iommu fd, or retrieve device fds. Internally, each of these operations will result in an iommu domain being allocated and all of the devices attached to the domain. The purpose of binding groups is to share the iommu domain. Groups making use of incompatible iommu domains will fail to bind. Groups making use of different mm's will fail to bind. The vfio driver may reject some binding based on domain capabilities, but final veto power is left to the iommu driver[1]. If a user makes use of a group independently and later wishes to bind it to another group, all the device fds and the iommu fd must first be closed. This prevents using a stale iommu fd or accessing devices while the iommu is being switched. Operations on any group fds of a merged group are performed globally on the group (ie. enumerating the devices lists all devices in the merged group, retrieving the iommu fd from any group fd results in the same fd, device fds from any group can be retrieved from any group fd[2]). Groups can be merged and unmerged dynamically. Unmerging a group requires the device fds for the outgoing group are closed. The iommu fd will remain persistent for the remaining merged group. As I've said I prefer a persistent group model, rather than this transient group model, but it's not a dealbreaker by itself. How are unmerges specified? VFIO_GROUP_UNMERGE ioctl taking a group fd parameter. I'm also assuming that in this model closing a (bound) group fd will unmerge everything down to atomic groups again. Yes, it will
Re: RFC: vfio / device assignment -- layout of device fd files
On Mon, 2011-08-29 at 18:14 -0500, Scott Wood wrote: On 08/29/2011 05:46 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:58 -0500, Scott Wood wrote: On 08/29/2011 02:51 PM, Alex Williamson wrote: On Mon, 2011-08-29 at 16:51 +, Yoder Stuart-B08248 wrote: The device info records following the file header have the following record types each with content encoded in a record specific way: REGION - describes an addressable address range for the device DTPATH - describes the device tree path for the device DTINDEX - describes the index into the related device tree property (reg,ranges,interrupts,interrupt-map) I don't quite understand if these are physical or virtual. If what are physical or virtual? Can you give an example of a path vs an index? I don't understand enough about these to ask a useful question about what they're describing. You'd have both path and index. Example, for this tree: / { ... foo { ... bar { reg = 0x1000 64 0x1800 64; ranges = 0 0x2 0x1; ... child { reg = 0x100 0x100; ... }; }; }; }; There would be 4 regions if you bind to /foo/bar: // this is 64 bytes at 0x1000 DTPATH /foo/bar DTINDEX prop_type=REG prop_index=0 // this is 64 bytes at 0x1800 DTPATH /foo/bar DTINDEX prop_type=REG prop_index=1 // this is 16K at 0x2 DTPATH /foo/bar DTINDEX prop_type=RANGES prop_index=0 // this is 256 bytes at 0x20100 DTPATH /foo/bar/child DTINDEX prop_type=REG prop_index=0 Both ranges and the child reg are needed, since ranges could be a simple ranges; that passes everything with no translation, and child nodes could be absent-but-implied in some other cases (such as when they represent PCI devices which can be probed -- we still need to map the ranges that correspond to PCI controller windows). Thanks for the example. Is it always the case that you need a path and an index? If so, why are they separate sub-regions instead of combined into a DT_INFO sub-region? INTERRUPT - describes an interrupt for the device PCI_CONFIG_SPACE - describes config space for the device I would have expected this to be a REGION with a property of PCI_CONFIG_SPACE. Could be, if physical address is made optional. Or physical address is also a property, aka sub-region. A subrecord of REGION is fine with me. Would we only need to expose phys addr for 1:1 mapping requirements? I'm not sure why we'd care to expose this otherwise. It's more important for non-PCI, where it avoids the need for userspace to parse the device tree to find the guest address (we'll usually want 1:1), or to consolidate pages shared by multiple regions. It could be nice for debugging, as well. So the device tree path is ripped straight from the system, so it's the actual 1:1, matching physical hardware, path. Yes. Even for non-PCI we need to know if the region is pio/mmio32/mmio64/prefetchable/etc. Outside of PCI, what standardized form would you put such information in? Where would the kernel get this information? What does mmio32/mmio64 mean in this context? I could imagine a platform device described by ACPI that might want to differentiate. The physical device doesn't get moved of course, but guest drivers might care how the device is described if we need to rebuild those ACPI tables. ACPI might even be a good place to leverage these data structures... /me ducks. ACPI info could be another subrecord type, but in the device tree system-bus case we generally don't have this information at the generic infrastructure level. Drivers are expected to know how their devices' regions should be mapped. The device tree tells them how they're mapped, right? Or maybe more precisely, the device tree tells them where they're mapped and it doesn't really matter whether they're 32bit or 64bit because they can't be moved. Maybe this is sub-region material. It just feels wrong to enumerate a region and not be able to include any basic properties beyond offset and size in a common field. For PCI, we can also describe the properties via config space, so sub-regions could still be optional. BAR index could really just translate to a REGION instance number. How would that work if you make non-BAR things (such as config space) into regions? Put their instance numbers outside of the BAR region? We have a fixed REGION space on PCI, so we could just define BAR0 == instance 0, BAR1 == instance 1... ROM == instance 6, CONFIG == instance 0xF (or 7). Seems more awkward than just having each region say what it is. What do you do to fill in the gaps? You don't, instance numbers would just be non-contiguous. The
KVM on IBM PowerEN processor
Hi, everyone, This is Kun Wang from IBM Research China. I and my team have been working on IBM PowerEN processor in recent years, including its simulation, lib/runtime optimization and etc. Now we start the work to enable KVM on PowerEN processor. Since the A2 core of PowerEN follows Power ISA v2.06 (more specifically, book3e and 64-bit), I believe 99% of our work will stick to the ISA, and hence can be leveraged by others. As the new one to this KVM world, I and my team definitely need your help. Looking forward to talking and working with you guys in the future. Best regards, Kun Wang Research Staff Member, Manager of Next-Generation Systems IBM Research China Tel: (86)10-58748491 FAX: (86)10-58748330 E-mail: wang...@cn.ibm.com -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Add check for suspended vm in softlockup detector
A suspended VM can cause spurious soft lockup warnings. To avoid these, the watchdog now checks if the kernel knows it was stopped by the host and skips the warning if so. Signed-off-by: Eric B Munson emun...@mgebm.net --- kernel/watchdog.c | 12 1 files changed, 12 insertions(+), 0 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 36491cd..4cbb69f 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -25,6 +25,7 @@ #include linux/sysctl.h #include asm/irq_regs.h +#include asm/pvclock.h #include linux/perf_event.h int watchdog_enabled = 1; @@ -292,6 +293,17 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) */ duration = is_softlockup(touch_ts); if (unlikely(duration)) { + /* +* If a virtual machine is stopped by the host it can look to +* the watchdog like a soft lockup, check to see if the host +* stopped the vm before we issue the warning +*/ + if (kvm_check_and_clear_host_stopped(get_cpu())) { + put_cpu(); + return HRTIMER_RESTART; + } + put_cpu(); + /* only warn once */ if (__this_cpu_read(soft_watchdog_warn) == true) return HRTIMER_RESTART; -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Add flag to indicate that a vm was stopped by the host
This flag will be used to check if the vm was stopped by the host when a soft lockup was detected. Signed-off-by: Eric B Munson emun...@mgebm.net --- arch/x86/include/asm/pvclock-abi.h |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/pvclock-abi.h b/arch/x86/include/asm/pvclock-abi.h index 35f2d19..6167fd7 100644 --- a/arch/x86/include/asm/pvclock-abi.h +++ b/arch/x86/include/asm/pvclock-abi.h @@ -40,5 +40,6 @@ struct pvclock_wall_clock { } __attribute__((__packed__)); #define PVCLOCK_TSC_STABLE_BIT (1 0) +#define PVCLOCK_GUEST_STOPPED (1 1) #endif /* __ASSEMBLY__ */ #endif /* _ASM_X86_PVCLOCK_ABI_H */ -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html