[PATCH 2/2] KVM: ARM: Defer parts of the vgic init until first KVM_RUN
The vgic virtual cpu and emulated distributor interfaces must be mapped at a given physical address in the guest. This address is provided through the KVM_SET_DEVICE_ADDRESS ioctl, which happens after the KVM_CREATE_IRQCHIP ioctl is called, but before the first VCPU is excuted thorugh KVM_RUN. We create the vgic on KVM_CREATE_IRQCHIP, but query kvm_vgic_ready(kvm), which checks if the vgic.vctrl_base field has been set, before we execute a VCPU, and if it has not been set, we call kvm_vgic_init, which takes care of the remaining setup. We use the IS_VGIC_ADDR_UNDEF() macro, which compares to the VGIC_ADDR_UNDEF constant, to check if an address has been set; it's unlikely that a device will sit on address 0, but since this is a part of main kernel boot procedure if this stuff is enabled in the config, I'm being paranoid. The distributor and vcpu base addresses used to be a per-host setting global for all VMs, but this is not a requirement and when we want to emulate several boards on a single host, we need the flexibility of storing these guest addresses on a per-VM basis. Signed-off-by: Christoffer Dall --- arch/arm/include/asm/kvm_vgic.h | 21 -- arch/arm/kvm/arm.c | 10 - arch/arm/kvm/vgic.c | 82 +++ 3 files changed, 84 insertions(+), 29 deletions(-) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index a688132..2de167f 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -154,13 +154,14 @@ static inline void vgic_bytemap_set_irq_val(struct vgic_bytemap *x, struct vgic_dist { #ifdef CONFIG_KVM_ARM_VGIC spinlock_t lock; + boolready; /* Virtual control interface mapping */ void __iomem*vctrl_base; - /* Distributor mapping in the guest */ - unsigned long vgic_dist_base; - unsigned long vgic_dist_size; + /* Distributor and vcpu interface mapping in the guest */ + phys_addr_t vgic_dist_base; + phys_addr_t vgic_cpu_base; /* Distributor enabled */ u32 enabled; @@ -243,6 +244,7 @@ struct kvm_exit_mmio; #ifdef CONFIG_KVM_ARM_VGIC int kvm_vgic_hyp_init(void); int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr); +int kvm_vgic_create(struct kvm *kvm); int kvm_vgic_init(struct kvm *kvm); void kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu); void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu); @@ -252,8 +254,9 @@ int kvm_vgic_inject_irq(struct kvm *kvm, int cpuid, unsigned int irq_num, int kvm_vgic_vcpu_pending_irq(struct kvm_vcpu *vcpu); bool vgic_handle_mmio(struct kvm_vcpu *vcpu, struct kvm_run *run, struct kvm_exit_mmio *mmio); +bool irqchip_in_kernel(struct kvm *kvm); -#define irqchip_in_kernel(k) (!!((k)->arch.vgic.vctrl_base)) +#define vgic_initialized(k)((k)->arch.vgic.ready) #define vgic_active_irq(v) (atomic_read(&(v)->arch.vgic_cpu.irq_active_count) == 0) #else @@ -267,6 +270,11 @@ static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 add return 0; } +static inline int kvm_vgic_create(struct kvm *kvm) +{ + return 0; +} + static inline int kvm_vgic_init(struct kvm *kvm) { return 0; @@ -298,6 +306,11 @@ static inline int irqchip_in_kernel(struct kvm *kvm) return 0; } +static inline bool kvm_vgic_initialized(struct kvm *kvm) +{ + return true; +} + static inline int vgic_active_irq(struct kvm_vcpu *vcpu) { return 0; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 282794e..d64783e 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -636,6 +636,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run) if (unlikely(vcpu->arch.target < 0)) return -ENOEXEC; + /* Initalize the VGIC before running the vcpu */ + if (unlikely(irqchip_in_kernel(vcpu->kvm) && +!vgic_initialized(vcpu->kvm))) { + ret = kvm_vgic_init(vcpu->kvm); + if (ret) + return ret; + } + if (run->exit_reason == KVM_EXIT_MMIO) { ret = kvm_handle_mmio_return(vcpu, vcpu->run); if (ret) @@ -889,7 +897,7 @@ long kvm_arch_vm_ioctl(struct file *filp, #ifdef CONFIG_KVM_ARM_VGIC case KVM_CREATE_IRQCHIP: { if (vgic_present) - return kvm_vgic_init(kvm); + return kvm_vgic_create(kvm); else return -EINVAL; } diff --git a/arch/arm/kvm/vgic.c b/arch/arm/kvm/vgic.c index d63b7f8..fa591db 100644 --- a/arch/arm/kvm/vgic.c +++ b/arch/arm/kvm/vgic.c @@ -65,12 +65,17 @@ * interrupt line to be sampled again. */ -/* Temporary hacks, need to be provided by userspace emulation */ -#defin
[PATCH 1/2] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
On ARM (and possibly other architectures) some bits are specific to the model being emulated for the guest and user space needs a way to tell the kernel about those bits. An example is mmio device base addresses, where KVM must know the base address for a given device to properly emulate mmio accesses within a certain address range or directly map a device with virtualiation extensions into the guest address space. We try to make this API slightly more generic than for our specific use, but so far only the VGIC uses this feature. Signed-off-by: Christoffer Dall --- Documentation/virtual/kvm/api.txt | 37 + arch/arm/include/asm/kvm.h| 13 + arch/arm/include/asm/kvm_mmu.h|2 ++ arch/arm/include/asm/kvm_vgic.h |6 ++ arch/arm/kvm/arm.c| 31 ++- arch/arm/kvm/vgic.c | 25 + include/linux/kvm.h |8 7 files changed, 121 insertions(+), 1 deletion(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 0aa4d83..dae4f05 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2102,6 +2102,43 @@ This ioctl returns the guest registers that are supported for the KVM_GET_ONE_REG/KVM_SET_ONE_REG calls. +4.80 KVM_SET_DEVICE_ADDRESS + +Capability: KVM_CAP_SET_DEVICE_ADDRESS +Architectures: arm +Type: vm ioctl +Parameters: struct kvm_device_address (in) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device id is unknown + ENXIO: Device not supported on current system + EEXIST: Address already set + E2BIG: Address outside guest physical address space + +struct kvm_device_address { + __u32 id; + __u64 addr; +}; + +Specify a device address in the guest's physical address space where guests +can access emulated or directly exposed devices, which the host kernel needs +to know about. The id field is an architecture specific identifier for a +specific device. + +ARM divides the id field into two parts, a device id and an address type id +specific to the individual device. + + bits: | 31...16 | 15...0 | + field: | device id | addr type id | + +ARM currently only require this when using the in-kernel GIC support for the +hardware vGIC features, using KVM_ARM_DEVICE_VGIC_V2 as the device id. When +setting the base address for the guest's mapping of the vGIC virtual CPU +and distributor interface, the ioctl must be called after calling +KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs. Calling +this ioctl twice for any of the base addresses will return -EEXIST. + + 5. The kvm_run structure diff --git a/arch/arm/include/asm/kvm.h b/arch/arm/include/asm/kvm.h index fb41608..a7ae073 100644 --- a/arch/arm/include/asm/kvm.h +++ b/arch/arm/include/asm/kvm.h @@ -42,6 +42,19 @@ struct kvm_regs { #define KVM_ARM_TARGET_CORTEX_A15 0 #define KVM_ARM_NUM_TARGETS1 +/* KVM_SET_DEVICE_ADDRESS ioctl id encoding */ +#define KVM_DEVICE_TYPE_SHIFT 0 +#define KVM_DEVICE_TYPE_MASK (0x << KVM_DEVICE_TYPE_SHIFT) +#define KVM_DEVICE_ID_SHIFT16 +#define KVM_DEVICE_ID_MASK (0x << KVM_DEVICE_ID_SHIFT) + +/* Supported device IDs */ +#define KVM_ARM_DEVICE_VGIC_V2 0 + +/* Supported VGIC address types */ +#define KVM_VGIC_V2_ADDR_TYPE_DIST 0 +#define KVM_VGIC_V2_ADDR_TYPE_CPU 1 + struct kvm_vcpu_init { __u32 target; __u32 features[7]; diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 9bd0508..0800531 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -26,6 +26,8 @@ * To save a bit of memory and to avoid alignment issues we assume 39-bit IPA * for now, but remember that the level-1 table must be aligned to its size. */ +#define KVM_PHYS_SHIFT (38) +#define KVM_PHYS_MASK ((1ULL << KVM_PHYS_SHIFT) - 1) #define PTRS_PER_PGD2 512 #define PGD2_ORDER get_order(PTRS_PER_PGD2 * sizeof(pgd_t)) diff --git a/arch/arm/include/asm/kvm_vgic.h b/arch/arm/include/asm/kvm_vgic.h index 588c637..a688132 100644 --- a/arch/arm/include/asm/kvm_vgic.h +++ b/arch/arm/include/asm/kvm_vgic.h @@ -242,6 +242,7 @@ struct kvm_exit_mmio; #ifdef CONFIG_KVM_ARM_VGIC int kvm_vgic_hyp_init(void); +int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr); int kvm_vgic_init(struct kvm *kvm); void kvm_vgic_vcpu_init(struct kvm_vcpu *vcpu); void kvm_vgic_sync_to_cpu(struct kvm_vcpu *vcpu); @@ -261,6 +262,11 @@ static inline int kvm_vgic_hyp_init(void) return 0; } +static inline int kvm_vgic_set_addr(struct kvm *kvm, unsigned long type, u64 addr) +{ + return 0; +} + static inline int kvm_vgic_init(struct kvm *kvm) { return 0; diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index d552b94..282794
[PATCH 0/2] KVM: ARM: Get rid of hardcoded VGIC addresses
We need a way to specify the address at which we expect VMs to access the interrupt controller (both the emulated distributor and the hardware interface supporting virtualization). User space should decide on this address as user space decides on an emulated board and loads a device tree describing these details directly to the guest. We introduce a new ioctl, KVM_SET_DEVICE_ADDRESS, that lets user space provide a base address for a device based on exported device ids. For now, this is only supported for the ARM vgic. User space provides this address after creating the IRQ chip and KVM performs the required mappings for a VM on the first execution of a VCPU. Christoffer Dall (2): KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl KVM: ARM: Defer parts of the vgic init until first KVM_RUN Documentation/virtual/kvm/api.txt | 37 ++ arch/arm/include/asm/kvm.h| 13 + arch/arm/include/asm/kvm_mmu.h|2 + arch/arm/include/asm/kvm_vgic.h | 27 -- arch/arm/kvm/arm.c| 41 ++- arch/arm/kvm/vgic.c | 99 + include/linux/kvm.h |8 +++ 7 files changed, 201 insertions(+), 26 deletions(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
1.1.1 -> 1.1.2 migrate /managedsave issue
I'm using libvirt 0.10.2 and I had qemu-kvm 1.1.1 running all my VMs. I used libvirt's managedsave command to pause all the VMs and write them to disk and when I brought up the machine again I had upgraded to qemu-kvm 1.1.2 and attempted to resume the VMs from their state. It unfortunately fails. During the life of the VM I did not attempt to adjust the amount of memory it had via the balloon device unless of course libvirt did behind the scenes on me. Below is the command line invocation and the error: LC_ALL=C PATH=/bin:/sbin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:/opt/bin HOME=/root USER=root QEMU_AUDIO_DRV=spice /usr/bin/qemu-kvm -name expo -S -M pc-1.0 -cpu Penryn,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 1024 -smp 1,sockets=1,cores=1,threads=1 -uuid 19034754-aa3f-9671-d247-1bc53134e3f0 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/expo.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/lib/libvirt/images/expo.img,if=none,id=drive-ide0-0-0,format=raw,cache=none -device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -netdev tap,fd=23,id=hostnet0 -device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:0b:29:d9,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -spice port=5901,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -incoming fd:20 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 char device redirected to /dev/pts/7 qemu: warning: error while loading state for instance 0x0 of device 'ram' load of migration failed Let me know what specifics I can provide to make this easier to debug. Thanks. -- Doug Goldstein -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to do fast accesses to LAPIC TPR under kvm?
On Thursday 18 October 2012, Avi Kivity wrote: > On 10/18/2012 11:35 AM, Gleb Natapov wrote: > > You misunderstood the description. V_INTR_MASKING=1 means that > > CR8 writes are not propagated to real HW APIC. > > > > But KVM does not trap access to CR8 unconditionally. It enables > > CR8 intercept only when there is pending interrupt in IRR that > > cannot be immediately delivered due to current TPR value. This > > should eliminate 99% of CR8 intercepts. > > Right. You will need to expose the alternate encoding of cr8 (IIRC > lock mov reg, cr0) on AMD via cpuid, but otherwise it should just > work. Be aware that this will break cross-vendor migration. I get an exception and I am not sure why: kvm_entry: vcpu 0 kvm_exit: reason write_cr8 rip 0xd0203788 info 0 0 kvm_emulate_insn: 0:d0203788: f0 0f 22 c0 (prot32) kvm_inj_exception: #UD (0x0) This is qemu-kvm 1.1.2 on Linux 3.2. When I look at arch/x86/kvm/emulate.c (both the current and the v3.2 version), I don't see any special case handling for "lock mov reg, cr0" to mean "mov reg, cr8". Before I spend lots of time on debugging my code, can you verify if the alternate encoding of cr8 is actually supported in kvm or if it is maybe missing? Thanks in advance. Cheers, Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm, async_pf: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT
On Fri, Oct 19, 2012 at 12:11:55PM -0400, Sasha Levin wrote: > KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't > marked that spot as an exit from idleness. > > Not doing so can cause RCU warnings such as: > > [ 732.788386] === > [ 732.789803] [ INFO: suspicious RCU usage. ] > [ 732.790032] 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 > Tainted: GW > [ 732.790032] --- > [ 732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally > while idle! > [ 732.790032] > [ 732.790032] other info that might help us debug this: > [ 732.790032] > [ 732.790032] > [ 732.790032] RCU used illegally from idle CPU! > [ 732.790032] rcu_scheduler_active = 1, debug_locks = 1 > [ 732.790032] RCU used illegally from extended quiescent state! > [ 732.790032] 2 locks held by trinity-child31/8252: > [ 732.790032] #0: (&rq->lock){-.-.-.}, at: [] > __schedule+0x178/0x8f0 > [ 732.790032] #1: (rcu_read_lock){.+.+..}, at: [] > cpuacct_charge+0xe/0x200 > [ 732.790032] > [ 732.790032] stack backtrace: > [ 732.790032] Pid: 8252, comm: trinity-child31 Tainted: GW > 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 > [ 732.790032] Call Trace: > [ 732.790032] [] lockdep_rcu_suspicious+0x10b/0x120 > [ 732.790032] [] cpuacct_charge+0x90/0x200 > [ 732.790032] [] ? cpuacct_charge+0xe/0x200 > [ 732.790032] [] update_curr+0x1a3/0x270 > [ 732.790032] [] dequeue_entity+0x2a/0x210 > [ 732.790032] [] dequeue_task_fair+0x45/0x130 > [ 732.790032] [] dequeue_task+0x89/0xa0 > [ 732.790032] [] deactivate_task+0x1e/0x20 > [ 732.790032] [] __schedule+0x879/0x8f0 > [ 732.790032] [] ? trace_hardirqs_off+0xd/0x10 > [ 732.790032] [] ? kvm_async_pf_task_wait+0x1d5/0x2b0 > [ 732.790032] [] schedule+0x55/0x60 > [ 732.790032] [] kvm_async_pf_task_wait+0x1f4/0x2b0 > [ 732.790032] [] ? abort_exclusive_wait+0xb0/0xb0 > [ 732.790032] [] ? prepare_to_wait+0x25/0x90 > [ 732.790032] [] do_async_page_fault+0x56/0xa0 > [ 732.790032] [] async_page_fault+0x28/0x30 > > Signed-off-by: Sasha Levin Acked-by: Paul E. McKenney > --- > arch/x86/kernel/kvm.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index b3e5e51..4180a87 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -247,7 +247,10 @@ do_async_page_fault(struct pt_regs *regs, unsigned long > error_code) > break; > case KVM_PV_REASON_PAGE_NOT_PRESENT: > /* page is swapped out by the host. */ > + rcu_irq_enter(); > + exit_idle(); > kvm_async_pf_task_wait((u32)read_cr2()); > + rcu_irq_exit(); > break; > case KVM_PV_REASON_PAGE_READY: > rcu_irq_enter(); > -- > 1.7.12.3 > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
On Fri, Oct 19, 2012 at 4:27 PM, Christoffer Dall wrote: > On Fri, Oct 19, 2012 at 4:24 PM, Peter Maydell > wrote: >> On 19 October 2012 19:46, Christoffer Dall >> wrote: >>> On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell >>> wrote: This doesn't say whether userspace is allowed to make this ioctl multiple times for the same device. This could be any of: * undefined behaviour * second call fails with some errno * second call overrides first one >>> >>> I added an error condition EEXIST, but since this is trying to not be >>> arm-vgic specific this is really up to the individual device - maybe >>> we can have some polymorphic device that moves around later. >>> It also doesn't say that you're supposed to call this after CREATE and before INIT of the irqchip. (Nor does it say what happens if you call it at some other time.) >>> >>> same non-device specific argument as above. >> >> We could have a section in the docs that says "On ARM platforms >> there are devices X and Y and they have such-and-such properties >> and requirements" [and other devices later can have further docs >> as appropriate]. >> diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 65aacc5..1380885 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2131,6 +2131,12 @@ specific to the individual device. bits: | 31...16 | 15...0 | field: | device id | addr type id | +ARM currently only require this when using the in-kernel GIC support for the +hardware vGIC features, using KVM_ARM_DEVICE_VGIC_V2 as the device id. When +setting the base address for the guest's mapping of the vGIC virtual CPU +and distributor interface, the ioctl must be called after calling +KVM_CREATE_IRQCHIP, but before calling KVM_RUN on any of the VCPUs. Calling +this ioctl twice for any of the base addresses will return -EEXIST. 5. The kvm_run structure -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
On Fri, Oct 19, 2012 at 4:24 PM, Peter Maydell wrote: > On 19 October 2012 19:46, Christoffer Dall > wrote: >> On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell >> wrote: >>> This doesn't say whether userspace is allowed to make this ioctl >>> multiple times for the same device. This could be any of: >>> * undefined behaviour >>> * second call fails with some errno >>> * second call overrides first one >>> >> >> I added an error condition EEXIST, but since this is trying to not be >> arm-vgic specific this is really up to the individual device - maybe >> we can have some polymorphic device that moves around later. >> >>> It also doesn't say that you're supposed to call this after CREATE >>> and before INIT of the irqchip. (Nor does it say what happens if >>> you call it at some other time.) >>> >> >> same non-device specific argument as above. > > We could have a section in the docs that says "On ARM platforms > there are devices X and Y and they have such-and-such properties > and requirements" [and other devices later can have further docs > as appropriate]. > sure, I can add that. -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
On 19 October 2012 19:46, Christoffer Dall wrote: > On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell > wrote: >> This doesn't say whether userspace is allowed to make this ioctl >> multiple times for the same device. This could be any of: >> * undefined behaviour >> * second call fails with some errno >> * second call overrides first one >> > > I added an error condition EEXIST, but since this is trying to not be > arm-vgic specific this is really up to the individual device - maybe > we can have some polymorphic device that moves around later. > >> It also doesn't say that you're supposed to call this after CREATE >> and before INIT of the irqchip. (Nor does it say what happens if >> you call it at some other time.) >> > > same non-device specific argument as above. We could have a section in the docs that says "On ARM platforms there are devices X and Y and they have such-and-such properties and requirements" [and other devices later can have further docs as appropriate]. -- PMM -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 2/3] KVM: ARM: Introduce KVM_SET_DEVICE_ADDRESS ioctl
On Wed, Oct 17, 2012 at 4:29 PM, Peter Maydell wrote: > On 14 October 2012 01:04, Christoffer Dall > wrote: >> On ARM (and possibly other architectures) some bits are specific to the >> model being emulated for the guest and user space needs a way to tell >> the kernel about those bits. An example is mmio device base addresses, >> where KVM must know the base address for a given device to properly >> emulate mmio accesses within a certain address range or directly map a >> device with virtualiation extensions into the guest address space. >> >> We try to make this API slightly more generic than for our specific use, >> but so far only the VGIC uses this feature. >> >> Signed-off-by: Christoffer Dall >> --- >> Documentation/virtual/kvm/api.txt | 30 ++ >> arch/arm/include/asm/kvm.h| 13 + >> arch/arm/include/asm/kvm_mmu.h|1 + >> arch/arm/include/asm/kvm_vgic.h |6 ++ >> arch/arm/kvm/arm.c| 31 ++- >> arch/arm/kvm/vgic.c | 34 +++--- >> include/linux/kvm.h |8 >> 7 files changed, 119 insertions(+), 4 deletions(-) >> >> diff --git a/Documentation/virtual/kvm/api.txt >> b/Documentation/virtual/kvm/api.txt >> index 26e953d..30ddcac 100644 >> --- a/Documentation/virtual/kvm/api.txt >> +++ b/Documentation/virtual/kvm/api.txt >> @@ -2118,6 +2118,36 @@ for the emulated platofrm (see >> KVM_SET_DEVICE_ADDRESS), but before the CPU is >> initally run. >> >> >> +4.80 KVM_SET_DEVICE_ADDRESS >> + >> +Capability: KVM_CAP_SET_DEVICE_ADDRESS >> +Architectures: arm >> +Type: vm ioctl >> +Parameters: struct kvm_device_address (in) >> +Returns: 0 on success, -1 on error >> +Errors: >> + ENODEV: The device id is unknwown > > "unknown" > >> + ENXIO: Device not supported in configuration > > "in this configuration" ? (I'm guessing this is for "you tried to > map a GIC when this CPU doesn't have a GIC" and similar errors?) > >> + E2BIG: Address outside of guest physical address space > > I would say "outside" rather than "outside of" here. > >> + >> +struct kvm_device_address { >> + __u32 id; >> + __u64 addr; >> +}; >> + >> +Specify a device address in the guest's physical address space where guests >> +can access emulated or directly exposed devices, which the host kernel needs >> +to know about. The id field is an architecture specific identifier for a >> +specific device. >> + >> +ARM divides the id field into two parts, a device ID and an address type id > > We should be consistent about whether ID is capitalised or not. > indeed >> +specific to the individual device. >> + >> + bits: | 31...16 | 15...0 | >> + field: | device id | addr type id | > > This doesn't say whether userspace is allowed to make this ioctl > multiple times for the same device. This could be any of: > * undefined behaviour > * second call fails with some errno > * second call overrides first one > I added an error condition EEXIST, but since this is trying to not be arm-vgic specific this is really up to the individual device - maybe we can have some polymorphic device that moves around later. > It also doesn't say that you're supposed to call this after CREATE > and before INIT of the irqchip. (Nor does it say what happens if > you call it at some other time.) > same non-device specific argument as above. Thanks, -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/3] KVM: ARM: Introduce KVM_INIT_IRQCHIP ioctl
On Thu, Oct 18, 2012 at 8:20 AM, Avi Kivity wrote: > On 10/14/2012 02:04 AM, Christoffer Dall wrote: >> Used to initialize the in-kernel interrupt controller. On ARM we need to >> map the virtual generic interrupt controller (vGIC) into Hyp the guest's >> physicall address space so the guest can access the virtual cpu >> interface. This must be done after the IRQ chips is create and after a >> base address has been provided for the emulated platform (patch is >> following), but before the CPU is initally run. >> >> >> +4.79 KVM_INIT_IRQCHIP >> + >> +Capability: KVM_CAP_INIT_IRQCHIP >> +Architectures: arm >> +Type: vm ioctl >> +Parameters: none >> +Returns: 0 on success, -1 on error >> + >> +Initialize the in-kernel interrupt controller. On ARM we need to map the >> +virtual generic interrupt controller (vGIC) into Hyp the guest's physicall >> +address space so the guest can access the virtual cpu interface. This must >> be >> +done after the IRQ chips is create and after a base address has been >> provided >> +for the emulated platofrm (see KVM_SET_DEVICE_ADDRESS), but before the CPU >> is >> +initally run. >> + > > What enforces this? > > Can it be done automatically? issue a > kvm_make_request(KVM_REQ_INIT_IRQCHIP) on vcpu creation, and you'll > automatically be notified before the first guest entry. > > Having an ioctl that must be called after point A but before point B > seems pointless, when A and B are both known. > I reworked this according to your comments, patches on the way. thanks for the input. >> + >> 5. The kvm_run structure >> >> >> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c >> index f8c377b..85c76e4 100644 >> --- a/arch/arm/kvm/arm.c >> +++ b/arch/arm/kvm/arm.c >> @@ -195,6 +195,7 @@ int kvm_dev_ioctl_check_extension(long ext) >> switch (ext) { >> #ifdef CONFIG_KVM_ARM_VGIC >> case KVM_CAP_IRQCHIP: >> + case KVM_CAP_INIT_IRQCHIP: > > This could be part of a baseline, if you don't envision ever taking it out. > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3 06/19] Implement "-dimm" command line option
On Thu, Oct 18, 2012 at 12:33 PM, Avi Kivity wrote: > On 10/18/2012 11:27 AM, Vasilis Liaskovitis wrote: >> On Wed, Oct 17, 2012 at 12:03:51PM +0200, Avi Kivity wrote: >>> On 10/17/2012 11:19 AM, Vasilis Liaskovitis wrote: >>> >> >>> >> I don't think so, but probably there's a limit of DIMMs that real >>> >> controllers have, something like 8 max. >>> > >>> > In the case of i440fx specifically, do you mean that we should model the >>> > DRB >>> > (Dram row boundary registers in section 3.2.19 of the i440fx spec) ? >>> > >>> > The i440fx DRB registers only supports up to 8 DRAM rows (let's say 1 row >>> > maps 1-1 to a DimmDevice for this discussion) and only supports up to 2GB >>> > of >>> > memory afaict (bit 31 and above is ignored). >>> > >>> > I 'd rather not model this part of the i440fx - having only 8 DIMMs seems >>> > too >>> > restrictive. The rest of the patchset supports up to 255 DIMMs so it >>> > would be a >>> > waste imho to model an old pc memory controller that only supports 8 >>> > DIMMs. >>> > >>> > There was also an old discussion about i440fx modeling here: >>> > https://lists.nongnu.org/archive/html/qemu-devel/2011-07/msg02705.html >>> > the general direction was that i440fx is too old and we don't want to >>> > precisely >>> > emulate the DRB registers, since they lack flexibility. >>> > >>> > Possible solutions: >>> > >>> > 1) is there a newer and more flexible chipset that we could model? >>> >>> Look for q35 on this list. >> >> thanks, I 'll take a look. It sounds like the other options below are more >> straightforward now, but let me know if you prefer q35 integration as a >> priority. > > At least validate that what you're doing fits with how q35 works. > >>> >>> > >>> > We could for example model: >>> > - an 8-bit non-cumulative register for each DIMM, denoting how many >>> > 128MB chunks it contains. This allowes 32GB for each DIMM, and with 255 >>> > DIMMs we >>> > describe a bit less than 8TB. These registers require 255 bytes. >>> > - a 16-bit cumulative register for each DIMM again for 128MB chunks. This >>> > allows >>> > us to describe 8TB of memory (but the registers take up double the space, >>> > because >>> > they describe cumulative memory amounts) >>> >>> There is no reason to save space. Why not have two 64-bit registers per >>> DIMM, one describing the size and the other the base address, both in >>> bytes? Use a few low order bits for control. >> >> Do we want this generic scheme above to be tied into the i440fx/pc machine? > > Yes. q35 should work according to its own specifications. > >> Or have it as a separate generic memory bus / pmc usable by others (e.g. in >> hw/dimm.c)? >> The 64-bit values you describe are already part of DimmDevice properties, but >> they are not hardware registers described as part of a chipset. >> >> In terms of control bits, did you want to mimic some other chipset >> registers? - >> any examples would be useful. > > I don't have any real requirements. Just make it simple and easily > accessible to ACPI code. > >> >>> >>> > >>> > 3) let everything be handled/abstracted by dimmbus - the chipset DRB >>> > modelling >>> > is not done (at least for i440fx, other machines could). This is the >>> > least precise >>> > in terms of emulation. On the other hand, if we are not really trying to >>> > emulate >>> > the real (too restrictive) hardware, does it matter? >>> >>> We could emulate base memory using the chipset, and extra memory using >>> the scheme above. This allows guests that are tied to the chipset to >>> work, and guests that have more awareness (seabios) to use the extra >>> features. >> >> But if we use the real i440fx pmc DRBs for base memory, this means base >> memory >> would be <= 2GB, right? >> >> Sounds like we 'd need to change the DRBs anyway to describe useful amounts >> of >> base memory (e.g. 512MB chunks and check against address lines [36:29] can >> describe base memory up to 64GB, though that's still limiting for very large >> VMs). But we'd be diverting from the real hardware again. > > Then there's no point. Modelling real hardware allows guests written to > work against that hardware to function correctly. If you diverge, they > won't. The guest is also unlikely to want to reprogram the memory controller. > >> >> Then we can model base memory with "tweaked" i440fx pmc's DRB registers - we >> could only use DRB[0] (one DIMM describing all of base memory) or more. >> >> DIMMs would be allowed to be hotplugged in the generic mem-controller scheme >> only >> (unless it makes sense to allow hotplug in the remaining pmc DRBs and >> start using the generic scheme once we run out of emulated DRBs) >> > > 440fx seems a lost cause, so we can go wild and just implement pv dimms. Maybe. But what would be a PV DIMM? Do we need any DIMM-like granularity at all, instead the guest could be told to use a list of RAM regions with arbitrary start and end addresses? Isn't ballooning also related? >
[PATCH] kvm, async_pf: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT
KVM_PV_REASON_PAGE_NOT_PRESENT kicks cpu out of idleness, but we haven't marked that spot as an exit from idleness. Not doing so can cause RCU warnings such as: [ 732.788386] === [ 732.789803] [ INFO: suspicious RCU usage. ] [ 732.790032] 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 Tainted: GW [ 732.790032] --- [ 732.790032] include/linux/rcupdate.h:738 rcu_read_lock() used illegally while idle! [ 732.790032] [ 732.790032] other info that might help us debug this: [ 732.790032] [ 732.790032] [ 732.790032] RCU used illegally from idle CPU! [ 732.790032] rcu_scheduler_active = 1, debug_locks = 1 [ 732.790032] RCU used illegally from extended quiescent state! [ 732.790032] 2 locks held by trinity-child31/8252: [ 732.790032] #0: (&rq->lock){-.-.-.}, at: [] __schedule+0x178/0x8f0 [ 732.790032] #1: (rcu_read_lock){.+.+..}, at: [] cpuacct_charge+0xe/0x200 [ 732.790032] [ 732.790032] stack backtrace: [ 732.790032] Pid: 8252, comm: trinity-child31 Tainted: GW 3.7.0-rc1-next-20121019-sasha-2-g6d8d02d-dirty #63 [ 732.790032] Call Trace: [ 732.790032] [] lockdep_rcu_suspicious+0x10b/0x120 [ 732.790032] [] cpuacct_charge+0x90/0x200 [ 732.790032] [] ? cpuacct_charge+0xe/0x200 [ 732.790032] [] update_curr+0x1a3/0x270 [ 732.790032] [] dequeue_entity+0x2a/0x210 [ 732.790032] [] dequeue_task_fair+0x45/0x130 [ 732.790032] [] dequeue_task+0x89/0xa0 [ 732.790032] [] deactivate_task+0x1e/0x20 [ 732.790032] [] __schedule+0x879/0x8f0 [ 732.790032] [] ? trace_hardirqs_off+0xd/0x10 [ 732.790032] [] ? kvm_async_pf_task_wait+0x1d5/0x2b0 [ 732.790032] [] schedule+0x55/0x60 [ 732.790032] [] kvm_async_pf_task_wait+0x1f4/0x2b0 [ 732.790032] [] ? abort_exclusive_wait+0xb0/0xb0 [ 732.790032] [] ? prepare_to_wait+0x25/0x90 [ 732.790032] [] do_async_page_fault+0x56/0xa0 [ 732.790032] [] async_page_fault+0x28/0x30 Signed-off-by: Sasha Levin --- arch/x86/kernel/kvm.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index b3e5e51..4180a87 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -247,7 +247,10 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) break; case KVM_PV_REASON_PAGE_NOT_PRESENT: /* page is swapped out by the host. */ + rcu_irq_enter(); + exit_idle(); kvm_async_pf_task_wait((u32)read_cr2()); + rcu_irq_exit(); break; case KVM_PV_REASON_PAGE_READY: rcu_irq_enter(); -- 1.7.12.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] KVM_VCPU_GET_REG_LIST API
On Fri, Oct 19, 2012 at 2:19 AM, Rusty Russell wrote: > Rusty Russell writes: >> Avi Kivity writes: >>> On 09/05/2012 10:58 AM, Rusty Russell wrote: This is the generic part of the KVM_SET_ONE_REG/KVM_GET_ONE_REG enhancements which ARM wants, rebased onto kvm/next. >>> >>> This was stalled for so long it needs rebasing again, sorry. >>> >>> But otherwise I'm happy to apply. >> >> Ok, will rebase and re-test against kvm-next. > > Wait, what? kvm/arm isn't in kvm-next? > > This will produce a needless clash with that, which is more important > than this cleanup. I'll rebase this as soon as that is merged. > > Christoffer, is there anything I can help with? > There are some worries about duplicating functionality on the ARM side of things. Specifically there are worries about the instruction decoding for the mmio instructions. My cycles are unfortunately too limited to change this right now and I'm also not sure I agree things will turn out nicer by unifying all decoding into a large complicated space ship, but it would be great if you could take a look. This discussion seems to be a good place to start: https://lists.cs.columbia.edu/pipermail/kvmarm/2012-September/003447.html Thanks! -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM on NFS
MySQl might not run good locally on the disk in the VM. It is actually not a good idea to run it on nfs filesystem. The startup should not be a problem. Once the data grows up, it will become the problem to you. Try to cache as much as possible that would be helping you out from lots of problems. Banyan He Blog: http://www.rootong.com Email: ban...@rootong.com On 2012-10-17 6:46 PM, Avi Kivity wrote: On 10/17/2012 11:20 AM, Andrew Holway wrote: Hello, I am testing KVM on an Oracle NFS box that I have. Does the list have any advice on best practice? I remember reading that there is stuff you can do with I/O schedulers and stuff to make it more efficient. My VMs will primarily be running mysql databases. I am currently using o_direct. O_DIRECT is good. I/O schedulers don't affect NFS so no need to tune anything on the host. You might experiment with switching to the deadline scheduler in the guest. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote: > On 10/15/2012 08:04 PM, Andrew Theurer wrote: > > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: > >> On 10/11/2012 01:06 AM, Andrew Theurer wrote: > >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > On 10/10/2012 08:29 AM, Andrew Theurer wrote: > > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >> * Avi Kivity [2012-10-04 17:00:28]: > >> > >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > > > >> [...] > > A big concern I have (if this is 1x overcommit) for ebizzy is that it > > has just terrible scalability to begin with. I do not think we should > > try to optimize such a bad workload. > > > > I think my way of running dbench has some flaw, so I went to ebizzy. > Could you let me know how you generally run dbench? > >>> > >>> I mount a tmpfs and then specify that mount for dbench to run on. This > >>> eliminates all IO. I use a 300 second run time and number of threads is > >>> equal to number of vcpus. All of the VMs of course need to have a > >>> synchronized start. > >>> > >>> I would also make sure you are using a recent kernel for dbench, where > >>> the dcache scalability is much improved. Without any lock-holder > >>> preemption, the time in spin_lock should be very low: > >>> > >>> > 21.54% 78016 dbench [kernel.kallsyms] [k] > copy_user_generic_unrolled > 3.51% 12723 dbench libc-2.12.so[.] > __strchr_sse42 > 2.81% 10176 dbench dbench [.] child_run > 2.54% 9203 dbench [kernel.kallsyms] [k] > _raw_spin_lock > 2.33% 8423 dbench dbench [.] > next_token > 2.02% 7335 dbench [kernel.kallsyms] [k] > __d_lookup_rcu > 1.89% 6850 dbench libc-2.12.so[.] > __strstr_sse42 > 1.53% 5537 dbench libc-2.12.so[.] > __memset_sse2 > 1.47% 5337 dbench [kernel.kallsyms] [k] > link_path_walk > 1.40% 5084 dbench [kernel.kallsyms] [k] > kmem_cache_alloc > 1.38% 5009 dbench libc-2.12.so[.] memmove > 1.24% 4496 dbench libc-2.12.so[.] vfprintf > 1.15% 4169 dbench [kernel.kallsyms] [k] > __audit_syscall_exit > >>> > >> > >> Hi Andrew, > >> I ran the test with dbench with tmpfs. I do not see any improvements in > >> dbench for 16k ple window. > >> > >> So it seems apart from ebizzy no workload benefited by that. and I > >> agree that, it may not be good to optimize for ebizzy. > >> I shall drop changing to 16k default window and continue with other > >> original patch series. Need to experiment with latest kernel. > > > > Thanks for running this again. I do believe there are some workloads, > > when run at 1x overcommit, would benefit from a larger ple_window [with > > he current ple handling code], but I do not also want to potentially > > degrade >1x with a larger window. I do, however, think there may be a > > another option. I have not fully worked this out, but I think I am on > > to something. > > > > I decided to revert back to just a yield() instead of a yield_to(). My > > motivation was that yield_to() [for large VMs] is like a dog chasing its > > tail, round and round we go Just yield(), in particular a yield() > > which results in yielding to something -other- than the current VM's > > vcpus, helps synchronize the execution of sibling vcpus by deferring > > them until the lock holder vcpu is running again. The more we can do to > > get all vcpus running at the same time, the far less we deal with the > > preemption problem. The other benefit is that yield() is far, far lower > > overhead than yield_to() > > > > This does assume that vcpus from same VM do not share same runqueues. > > Yielding to a sibling vcpu with yield() is not productive for larger VMs > > in the same way that yield_to() is not. My recent results include > > restricting vcpu placement so that sibling vcpus do not get to run on > > the same runqueue. I do believe we could implement a initial placement > > and load balance policy to strive for this restriction (making it purely > > optional, but I bet could also help user apps which use spin locks). > > > > For 1x VMs which still vm_exit due to PLE, I believe we could probably > > just leave the ple_window alone, as long as we mostly use yield() > > instead of yield_to(). The problem with the unneeded exits in this case > > has been the overhead in routines leading up to yield_to() and the > > yield_to() itself. If we use yield() most of the time, this overhead > > will go away. > >
Re: I/O errors in guest OS after repeated migration
Am Donnerstag, 18. Oktober 2012, 18:05:39 schrieb Avi Kivity: > On 10/18/2012 05:50 PM, Guido Winkelmann wrote: > > Am Mittwoch, 17. Oktober 2012, 13:25:45 schrieb Brian Jackson: > >> On Wednesday, October 17, 2012 10:45:14 AM Guido Winkelmann wrote: > >> > vda1, logical block 1858771 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070600] Buffer I/O error on > >> > device > >> > vda1, logical block 1858772 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070602] Buffer I/O error on > >> > device > >> > vda1, logical block 1858773 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070605] Buffer I/O error on > >> > device > >> > vda1, logical block 1858774 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070607] Buffer I/O error on > >> > device > >> > vda1, logical block 1858775 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070610] Buffer I/O error on > >> > device > >> > vda1, logical block 1858776 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070612] Buffer I/O error on > >> > device > >> > vda1, logical block 1858777 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070615] Buffer I/O error on > >> > device > >> > vda1, logical block 1858778 > >> > Oct 17 17:12:04 localhost kernel: [ 212.070617] Buffer I/O error on > >> > device > >> > vda1, logical block 1858779 > >> > > >> > (I was writing a large file at the time, to make sure I actually catch > >> > I/O > >> > errors as they happen) > >> > >> What about newer versions of qemu/kvm? But of course if those work, your > >> next task is going to be git bisect it or file a bug with your distro > >> that > >> is using an ancient version of qemu/kvm. > > > > I've just upgraded both hosts to qemu-kvm 1.2.0 > > (qemu-1.2.0-14.fc17.x86_64, > > built from spec files under http://pkgs.fedoraproject.org/cgit/qemu.git/). > > > > The bug is still there. > > If you let the guest go idle (no I/O), then migrate it, then restart the > I/O, do the errors show? Just tested - yes, they do. Guido -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 10/15/2012 08:04 PM, Andrew Theurer wrote: On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: On 10/11/2012 01:06 AM, Andrew Theurer wrote: On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: On 10/10/2012 08:29 AM, Andrew Theurer wrote: On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: * Avi Kivity [2012-10-04 17:00:28]: On 10/04/2012 03:07 PM, Peter Zijlstra wrote: On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: [...] A big concern I have (if this is 1x overcommit) for ebizzy is that it has just terrible scalability to begin with. I do not think we should try to optimize such a bad workload. I think my way of running dbench has some flaw, so I went to ebizzy. Could you let me know how you generally run dbench? I mount a tmpfs and then specify that mount for dbench to run on. This eliminates all IO. I use a 300 second run time and number of threads is equal to number of vcpus. All of the VMs of course need to have a synchronized start. I would also make sure you are using a recent kernel for dbench, where the dcache scalability is much improved. Without any lock-holder preemption, the time in spin_lock should be very low: 21.54% 78016 dbench [kernel.kallsyms] [k] copy_user_generic_unrolled 3.51% 12723 dbench libc-2.12.so[.] __strchr_sse42 2.81% 10176 dbench dbench [.] child_run 2.54% 9203 dbench [kernel.kallsyms] [k] _raw_spin_lock 2.33% 8423 dbench dbench [.] next_token 2.02% 7335 dbench [kernel.kallsyms] [k] __d_lookup_rcu 1.89% 6850 dbench libc-2.12.so[.] __strstr_sse42 1.53% 5537 dbench libc-2.12.so[.] __memset_sse2 1.47% 5337 dbench [kernel.kallsyms] [k] link_path_walk 1.40% 5084 dbench [kernel.kallsyms] [k] kmem_cache_alloc 1.38% 5009 dbench libc-2.12.so[.] memmove 1.24% 4496 dbench libc-2.12.so[.] vfprintf 1.15% 4169 dbench [kernel.kallsyms] [k] __audit_syscall_exit Hi Andrew, I ran the test with dbench with tmpfs. I do not see any improvements in dbench for 16k ple window. So it seems apart from ebizzy no workload benefited by that. and I agree that, it may not be good to optimize for ebizzy. I shall drop changing to 16k default window and continue with other original patch series. Need to experiment with latest kernel. Thanks for running this again. I do believe there are some workloads, when run at 1x overcommit, would benefit from a larger ple_window [with he current ple handling code], but I do not also want to potentially degrade >1x with a larger window. I do, however, think there may be a another option. I have not fully worked this out, but I think I am on to something. I decided to revert back to just a yield() instead of a yield_to(). My motivation was that yield_to() [for large VMs] is like a dog chasing its tail, round and round we go Just yield(), in particular a yield() which results in yielding to something -other- than the current VM's vcpus, helps synchronize the execution of sibling vcpus by deferring them until the lock holder vcpu is running again. The more we can do to get all vcpus running at the same time, the far less we deal with the preemption problem. The other benefit is that yield() is far, far lower overhead than yield_to() This does assume that vcpus from same VM do not share same runqueues. Yielding to a sibling vcpu with yield() is not productive for larger VMs in the same way that yield_to() is not. My recent results include restricting vcpu placement so that sibling vcpus do not get to run on the same runqueue. I do believe we could implement a initial placement and load balance policy to strive for this restriction (making it purely optional, but I bet could also help user apps which use spin locks). For 1x VMs which still vm_exit due to PLE, I believe we could probably just leave the ple_window alone, as long as we mostly use yield() instead of yield_to(). The problem with the unneeded exits in this case has been the overhead in routines leading up to yield_to() and the yield_to() itself. If we use yield() most of the time, this overhead will go away. Here is a comparison of yield_to() and yield(): dbench with 20-way VMs, 8 of them on 80-way host: no PLE426 +/- 11.03% no PLE w/ gangsched 32001 +/- .37% PLE with yield()29207 +/- .28% PLE with yield_to() 8175 +/- 1.37% Yield() is far and way better than yield_to() here and almost approaches gang sched result. Here is a link for the perf sched map bitmap: https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU The thrashing is way down and sibling vcpus tend to run together, approximating the behavior of the gang scheduling with
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On 10/18/2012 06:09 PM, Avi Kivity wrote: On 10/09/2012 08:51 PM, Raghavendra K T wrote: Here is the summary: We do get good benefit by increasing ple window. Though we don't see good benefit for kernbench and sysbench, for ebizzy, we get huge improvement for 1x scenario. (almost 2/3rd of ple disabled case). Let me know if you think we can increase the default ple_window itself to 16k. I think so, there is no point running with untuned defaults. Oaky. I can respin the whole series including this default ple_window change. It can come as a separate patch. Yes. Will spin it separately. I also have the perf kvm top result for both ebizzy and kernbench. I think they are in expected lines now. Improvements 16 core PLE machine with 16 vcpu guest base = 3.6.0-rc5 + ple handler optimization patches base_pleopt_16k = base + ple_window = 16k base_pleopt_32k = base + ple_window = 32k base_pleopt_nople = base + ple_gap = 0 kernbench, hackbench, sysbench (time in sec lower is better) ebizzy (rec/sec higher is better) % improvements w.r.t base (ple_window = 4k) ---+---+-+---+ |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | ---+---+-+---+ kernbench_1x | 0.42371 | 1.15164| 0.09320 | kernbench_2x | -1.40981 | -17.48282 | -570.77053 | ---+---+-+---+ sysbench_1x| -0.92367 | 0.24241 | -0.27027 | sysbench_2x| -2.22706 |-0.30896 | -1.27573 | sysbench_3x| -0.75509 | 0.09444 | -2.97756 | ---+---+-+---+ ebizzy_1x | 54.99976 | 67.29460| 74.14076 | ebizzy_2x | -8.83386 |-27.38403| -96.22066 | ---+---+-+---+ So it seems we want dynamic PLE windows. As soon as we enter overcommit we need to decrease the window. Okay. I have some rough idea on the implementation. I 'll try that after this V2 experiments are over. So in brief, I have this in my queue priority wise 1) V2 version of this patch series( in progress) 2) default PLE window 3) preemption notifiers 4) Pv spinlock -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] emulator test: add "rep ins" mmio access test
Add the test to trigger the bug that "rep ins" causes vcpu->mmio_fragments overflow overflow while move large data from ioport to MMIO Signed-off-by: Xiao Guangrong --- x86/emulator.c | 14 ++ 1 files changed, 14 insertions(+), 0 deletions(-) diff --git a/x86/emulator.c b/x86/emulator.c index 24b33d1..0735405 100644 --- a/x86/emulator.c +++ b/x86/emulator.c @@ -731,6 +731,18 @@ static void test_crosspage_mmio(volatile uint8_t *mem) report("cross-page mmio write", mem[4095] == 0xaa && mem[4096] == 0x88); } +static void test_string_io_mmio(volatile uint8_t *mem) +{ + /* Cross MMIO pages.*/ + volatile uint8_t *mmio = mem + 4032; + + asm volatile("outw %%ax, %%dx \n\t" : : "a"(0x), "d"(TESTDEV_IO_PORT)); + + asm volatile ("cld; rep insb" : : "d" (TESTDEV_IO_PORT), "D" (mmio), "c" (1024)); + + report("string_io_mmio", mmio[1023] == 0x99); +} + static void test_lgdt_lidt(volatile uint8_t *mem) { struct descriptor_table_ptr orig, fresh = {}; @@ -878,6 +890,8 @@ int main() test_crosspage_mmio(mem); + test_string_io_mmio(mem); + printf("\nSUMMARY: %d tests, %d failures\n", tests, fails); return fails ? 1 : 0; } -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: fix vcpu->mmio_fragments overflow
After commit b3356bf0dbb349 (KVM: emulator: optimize "rep ins" handling), the pieces of io data can be collected and write them to the guest memory or MMIO together. Unfortunately, kvm splits the mmio access into 8 bytes and store them to vcpu->mmio_fragments. If the guest uses "rep ins" to move large data, it will cause vcpu->mmio_fragments overflow The bug can be exposed by isapc (-M isapc): [23154.818733] general protection fault: [#1] SMP DEBUG_PAGEALLOC [ ..] [23154.858083] Call Trace: [23154.859874] [] kvm_get_cr8+0x1d/0x28 [kvm] [23154.861677] [] kvm_arch_vcpu_ioctl_run+0xcda/0xe45 [kvm] [23154.863604] [] ? kvm_arch_vcpu_load+0x17b/0x180 [kvm] Actually, we can use one mmio_fragment to store a large mmio access for the mmio access is always continuous then split it when we pass the mmio-exit-info to userspace. After that, we only need two entries to store mmio info for the cross-mmio pages access Signed-off-by: Xiao Guangrong --- arch/x86/kvm/x86.c | 127 +- include/linux/kvm_host.h | 16 +- 2 files changed, 84 insertions(+), 59 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8b90dd5..41ceb51 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3779,9 +3779,6 @@ static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, void *val, int bytes) { - struct kvm_mmio_fragment *frag = &vcpu->mmio_fragments[0]; - - memcpy(vcpu->run->mmio.data, frag->data, frag->len); return X86EMUL_CONTINUE; } @@ -3799,6 +3796,64 @@ static const struct read_write_emulator_ops write_emultor = { .write = true, }; +static bool get_current_mmio_info(struct kvm_vcpu *vcpu, gpa_t *gpa, + unsigned *len, void **data) +{ + struct kvm_mmio_fragment *frag; + int cur = vcpu->mmio_cur_fragment; + + if (cur >= vcpu->mmio_nr_fragments) + return false; + + frag = &vcpu->mmio_fragments[cur]; + if (frag->pos >= frag->len) { + if (++vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) + return false; + frag++; + } + + *gpa = frag->gpa + frag->pos; + *data = frag->data + frag->pos; + *len = min(8u, frag->len - frag->pos); + return true; +} + +static void complete_current_mmio(struct kvm_vcpu *vcpu) +{ + struct kvm_mmio_fragment *frag; + gpa_t gpa; + unsigned len; + void *data; + + get_current_mmio_info(vcpu, &gpa, &len, &data); + + if (!vcpu->mmio_is_write) + memcpy(data, vcpu->run->mmio.data, len); + + /* Increase frag->pos to switch to the next mmio. */ + frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment]; + frag->pos += len; +} + +static bool vcpu_fill_mmio_exit_info(struct kvm_vcpu *vcpu) +{ + gpa_t gpa; + unsigned len; + void *data; + + if (!get_current_mmio_info(vcpu, &gpa, &len, &data)) + return false; + + vcpu->run->mmio.len = len; + vcpu->run->mmio.is_write = vcpu->mmio_is_write; + vcpu->run->exit_reason = KVM_EXIT_MMIO; + vcpu->run->mmio.phys_addr = gpa; + + if (vcpu->mmio_is_write) + memcpy(vcpu->run->mmio.data, data, len); + return true; +} + static int emulator_read_write_onepage(unsigned long addr, void *val, unsigned int bytes, struct x86_exception *exception, @@ -3834,18 +3889,12 @@ mmio: bytes -= handled; val += handled; - while (bytes) { - unsigned now = min(bytes, 8U); - - frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++]; - frag->gpa = gpa; - frag->data = val; - frag->len = now; - - gpa += now; - val += now; - bytes -= now; - } + WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS); + frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++]; + frag->pos = 0; + frag->gpa = gpa; + frag->data = val; + frag->len = bytes; return X86EMUL_CONTINUE; } @@ -3855,7 +3904,6 @@ int emulator_read_write(struct x86_emulate_ctxt *ctxt, unsigned long addr, const struct read_write_emulator_ops *ops) { struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt); - gpa_t gpa; int rc; if (ops->read_write_prepare && @@ -3887,17 +3935,13 @@ int emulator_read_write(struct x86_emulate_ctxt *ctxt, unsigned long addr, if (!vcpu->mmio_nr_fragments) return rc; - gpa = vcpu->mmio_fragments[0].gpa; - vcpu->mmio_needed = 1; vcpu->mmio_cur_fragment = 0; + vcpu->mmio_is_write = ops->write; - vcpu->run->mmio.len = vcpu->mmio_fragments[0].len; -
RE: Shared IRQ with PCI Passthrough?
> To: kvm@vger.kernel.org > From: msch...@gmx.eu > Subject: Re: Shared IRQ with PCI Passthrough? > Date: Thu, 18 Oct 2012 20:09:56 + > > Jan Kiszka siemens.com> writes: > > > > > On 2012-10-15 11:07, Marco wrote: > > > Jan Kiszka siemens.com> writes: > > > > > > > > > > > >> > > >> Nope, there is no IRQ sharing support for assigned devices in any public > > >> version so far. I'm on it, but some issues remain to be solved. > > >> > > >> Jan > > >> > > > > > > > > > Hi, any news on this? I own an Intel DQ67OW that has the same issue. No > > > PCI > > > passthrough possible with KVM when USB is active. > > We encountered severe problems with the DQ67OW; it proved all but impossible to pass thru USB, as you have to pass both PCI bridges through also, in which case you get booting problems with actual PCI cards being bounced from host to guest and back several times. We worked around this(our ultimate kernel was pretty unstable in any case and IOMMU crashed it) using the KVM ehci script file to load USB 2 emulation. We don't use passthru at all on the current iteration. However, while our kernel is the main reason, the DQ67OW board has awkward IRQ's, particularly in 64 bit, which would have nixed IOMMU in any case for us. > > Supported by qemu-kvm-1.2 and Linux >= 3.4. But not all devices play > > well with it, so your mileage may vary. > > > > Jan > > > > > Unfortunately I had no luck trying on Ubuntu Quantal (qemu-kvm 1.2 and Linux > 3.5). Exactly the same error message than before: > > Failed to assign irq for "hostdev0": Input/output error > Perhaps you are assigning a device that shares an IRQ with another device? > > Marco > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hardwre for KVM host
Hey all, at the moment I am looking for hardware for a KVM host, but on a few points I realy have problems to get a good hardware setup. My situation is the following: I need a server with minimal energy consumption, the ability of PCI Passtrough (so IOMMU is needed) and good support for Nested Virtualization (KVM, Xen and VMWare) So when I understood everything right I need a CPU and chipset on the board which supports IOMMU, which consumer boards do and which of them works fine with KVM? For the nested virtualization theme I read that EPTs in case of Intel and RVI in case of AMD is needed. Also I read that the AMD RVI would be better than the EPTs of Intel and bring a good speedup, wikipedia mentions a VMWare research whitepaper and Red Hat tests (http://en.wikipedia.org/wiki/Rapid_Virtualization_Indexing). Also it seems that the implementation of Nested Virualization is easier with the AMD RVIs than with the Intel EPTs. So which CPU with which chipset (baord) should i prefer to get this features. For the RAM I thought about 32GB because my experiences show that virtualization needs much RAM, would that be a good value for RAM when the VM host runs a few VMs with simple server services like (HTTP, SMTP etc.)? I want to run VMs with much I/O so I'm thinking about the hardware setup. I want to have a few TB of space so for example 4 3TB hard disks in RAID5, all virtualization solutions (Xen, VMWare) recommand hardware RAID. I also would prefer hardware RAID but would software RAID decrease the I/O performance very much? And would it be a difference to take software RAID with mdadm (for example on ext4 filesystem) or native RAID support like in BTRFS? Which modern hardware RAID controllers would be fully supported by linux and are there recommandations for a special interface (S-ATA, SAS etc.)? I don't have that much experiences with KVM-over-IP managment but I want to start using it. Are there special conditions which must be fulfilled by the board or operation system? And does someone here have experiences with KVM-over-IP managment and can recommand adapters or cards? To say something about the price I targeting: It would be greate to have costs between 800 and 1400€ for the host system without the costs of the KVM-over-IP console. When the price would be higher it would be ok when it don't is greater than 2500€ Best Regards -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html