Re: qemu help documentation
On Thu, Feb 14, 2013 at 02:22:51PM +0100, Paolo Pedaletti wrote: I have trouble to get full list of the output of qemu help inside kvm when I switch to second console CTRL-ALT-2 I can't find the full list even inside source code (apt-get source qemu-kvm) and neither inside binary file (grep blockarg qemu-*) Is it possible to redirect the output of help on an external file? Or paging it? This because (the main problem is) I'm trying to get full kernel message at boot, but inside KVM window it's not possible to scroll up ( goal: http://pedalinux.blogspot.it/2013/02/physical-to-virtual-step-by-step.html ) or to dump outside terminal output. Try Ctrl+PageUp. If that doesn't work you can put the monitor on stdio like this: $ qemu-system-x86_64 -monitor stdio ... Then you can interact from your shell and scroll back up as usual. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.
On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote: On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote: +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva) +{ +gpa_t gpa; +uint32_t kseg = KSEGX(gva); + +if ((kseg == CKSEG0) || (kseg == CKSEG1)) You seems to be using KVM_GUEST_KSEGX variants on gva in all other places. Why not here? This function is invoked to handle 2 scenarios: (1) Parse the boot code config tables setup by QEMU's Malta emulation. The pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not Guest KSEG0 addresses. (2) Handle I/O accesses by the guest. On MIPS platforms, I/O device registers are mapped into the KSEG1 address space (unmapped, uncached). Again like (1) these are actual KSEG1 addresses, which cause an exception and are passed onto QEMU for I/O emulation. Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH qom-cpu-next 0/6] QOM CPUState, part 8: CPU_COMMON continued
Am 14.02.2013 18:49, schrieb Andreas Färber: Am 01.02.2013 13:38, schrieb Andreas Färber: Hello, This series moves more fields from CPU_COMMON / CPU*State to CPUState, allowing access from target-independent code. The final patch in this series will help solve some issues (in particular avoid a dependency on CPU_COMMON TLB refactoring for now) but opens a can of worms: Since it is initialized in derived instance_init functions, functions cannot randomly be changed to operate on CPUState and be called from CPUState's instance_init or they will crash due to NULL env_ptr. The questionable patch in this series has been acked by rth, so if nobody objects, I'll queue it on qom-cpu-next tonight, [...] Applied: https://github.com/afaerber/qemu-cpu/commits/qom-cpu-next Andreas For those of you that may have been following the CPU refactorings closely, I have now split off part of former qom-cpu-8 branch into qom-cpu-9. This series thereby applies directly to qom-cpu-next, whereas qom-cpu-9 depends on the pending s390x pull, my m68k cleanups and may be changed for VMState changes cooking elsewhere to keep i386 v5 compat. Available for testing at: git://github.com/afaerber/qemu-cpu.git qom-cpu-8.v1 https://github.com/afaerber/qemu-cpu/commits/qom-cpu-8.v1 Regards, Andreas Changes from previews: * Drop #ifdefs for user-only CPUState fields. * Defer interrupt-related changes to part 9. Andreas Färber (6): cpu: Move host_tid field to CPUState cpu: Move running field to CPUState cpu: Move exit_request field to CPUState cpu: Move current_tb field to CPUState cputlb: Pass CPUState to cpu_unlink_tb() cpu: Add CPUArchState pointer to CPUState cpu-exec.c | 21 - cputlb.c|6 -- dump.c |8 ++-- exec.c |6 -- gdbstub.c | 14 +- hw/apic_common.c|2 +- hw/apic_internal.h |2 +- hw/kvmvapic.c | 13 - hw/spapr_hcall.c|5 +++-- include/exec/cpu-defs.h |5 - include/exec/exec-all.h |4 +++- include/exec/gdbstub.h |5 ++--- include/qom/cpu.h | 11 +++ kvm-all.c |6 +++--- linux-user/main.c | 37 ++--- linux-user/syscall.c|4 +++- qom/cpu.c |2 ++ target-alpha/cpu.c |2 ++ target-arm/cpu.c|2 ++ target-cris/cpu.c |2 ++ target-i386/cpu.c |1 + target-i386/kvm.c |4 ++-- target-lm32/cpu.c |2 ++ target-m68k/cpu.c |2 ++ target-microblaze/cpu.c |2 ++ target-mips/cpu.c |2 ++ target-openrisc/cpu.c |2 ++ target-ppc/translate_init.c |2 ++ target-s390x/cpu.c |2 ++ target-sh4/cpu.c|2 ++ target-sparc/cpu.c |2 ++ target-unicore32/cpu.c |2 ++ target-xtensa/cpu.c |2 ++ translate-all.c | 36 +++- translate-all.h |2 +- 35 Dateien geändert, 149 Zeilen hinzugefügt(+), 73 Zeilen entfernt(-) -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
PIO/MMIO access from guest
Hi all, I would like to access the configuration registers of my passthrough device from an unmodified guest under KVM via MMIO/PIO directly. Is there any option to configure the guest in this way? I know that you can do this with the ioports=[''] option in case of Xen, but I could not find any such option until now for KVM. I also read that [+] PIO can be passed directly via VMCS I/O bitmaps [+] MMIO can be passed directly via mapping device BARs to guest [+] Some PIO/MMIO accesses must be trapped (PCI config space) Can I do this with some configurations without recompiling the kernel? Thank you! -gabor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] virtio: new API for addition of buffers, scatterlist changes
Il 14/02/2013 10:23, Paolo Bonzini ha scritto: How about this as a first step? virtio_ring: virtqueue_add_sgs, to add multiple sgs. virtio_scsi and virtio_blk can really use these, to avoid their current hack of copying the whole sg array. Signed-off-by: Ruty Russell ru...@rustcorp.com.au It's much better than the other prototype you had posted, but I still dislike this... You pay for additional counting of scatterlists when the caller knows the number of buffers; and the nested loops aren't free, either. Another problem is that you cannot pass truncated scatterlists. You must ensure there is an end marker on the last item. I'm not sure if the kernel ensures that, given that for_each_sg takes explicitly the number of scatterlist elements; and it is not as trivial as sg_mark_end(foo + nsg - 1); if the upper layers hand you a chained scatterlist. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/18] KVM/MIPS32: MMU/TLB operations for the Guest.
On Feb 6, 2013, at 7:08 AM, Gleb Natapov wrote: +static void kvm_mips_map_page(struct kvm *kvm, gfn_t gfn) +{ +pfn_t pfn; + +if (kvm-arch.guest_pmap[gfn] != KVM_INVALID_PAGE) +return; + +pfn =kvm_mips_gfn_to_pfn(kvm, gfn); This call should be in srcu read section since it access memory slots which are srcu protected. You should test with RCU debug enabled. kvm_mips_gfn_to_pfn just maps to gfn_to_pfn. I don't see an instance where gfn_to_pfn is in a scru read section? + +uint32_t kvm_get_inst(uint32_t *opc, struct kvm_vcpu *vcpu) +{ +uint32_t inst; +struct mips_coproc *cop0 __unused = vcpu-arch.cop0; +int index; +ulong paddr, flags; + +if (KVM_GUEST_KSEGX((ulong) opc) KVM_GUEST_KSEG0 || +KVM_GUEST_KSEGX((ulong) opc) == KVM_GUEST_KSEG23) { +local_irq_save(flags); +index = kvm_mips_host_tlb_lookup(vcpu, (ulong) opc); +if (index = 0) { +inst = *(opc); Here and in some more places below you access __user memory. Shouldn't you use get_user() to access it? What prevents the kernel crash by access fault here if userspace remaps the memory to be non-readable? Hmm, may be it uses guest translation here so it cannot happen, but still, sparse will not be happy and kvm_mips_translate_guest_kseg0_to_hpa() case below uses host translation anyway. Actually, I don't need the __user declaration in most cases, since KVM/MIPS handles mapping the page (if needed) and does not rely on the usual kernel mechanisms. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 09/18] KVM/MIPS32: COP0 accesses profiling.
On Feb 6, 2013, at 8:17 AM, Gleb Natapov wrote: On Wed, Nov 21, 2012 at 06:34:07PM -0800, Sanjay Lal wrote: +int kvm_mips_dump_stats(struct kvm_vcpu *vcpu) +{ +int i, j __unused; +#ifdef CONFIG_KVM_MIPS_DEBUG_COP0_COUNTERS +printk(\nKVM VCPU[%d] COP0 Access Profile:\n, vcpu-vcpu_id); +for (i = 0; i N_MIPS_COPROC_REGS; i++) { +for (j = 0; j N_MIPS_COPROC_SEL; j++) { +if (vcpu-arch.cop0-stat[i][j]) +printk(%s[%d]: %lu\n, kvm_cop0_str[i], j, + vcpu-arch.cop0-stat[i][j]); +} +} +#endif + +return 0; +} You need to use ftrace event for that. Much more flexible with perf integration and no need to recompile to enabled/disable. -- Gleb. Agreed, I'll start using trace for keeping track of COP0 accesses. Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/18] KVM/MIPS32: MMU/TLB operations for the Guest.
On Fri, Feb 15, 2013 at 01:19:29PM -0500, Sanjay Lal wrote: On Feb 6, 2013, at 7:08 AM, Gleb Natapov wrote: +static void kvm_mips_map_page(struct kvm *kvm, gfn_t gfn) +{ + pfn_t pfn; + + if (kvm-arch.guest_pmap[gfn] != KVM_INVALID_PAGE) + return; + + pfn =kvm_mips_gfn_to_pfn(kvm, gfn); This call should be in srcu read section since it access memory slots which are srcu protected. You should test with RCU debug enabled. kvm_mips_gfn_to_pfn just maps to gfn_to_pfn. I don't see an instance where gfn_to_pfn is in a scru read section? Where are you looking? On x86 if vcpu is not in a guest mode it is in srcu read section. The lock is taken immediately after exit and released before entry. This is because x86 access memory slots a lot. Other arches, that do not access memory slots as much, take the lock around each individual access. As far as I see this is the only place in MIPS kvm where this is needed. + +uint32_t kvm_get_inst(uint32_t *opc, struct kvm_vcpu *vcpu) +{ + uint32_t inst; + struct mips_coproc *cop0 __unused = vcpu-arch.cop0; + int index; + ulong paddr, flags; + + if (KVM_GUEST_KSEGX((ulong) opc) KVM_GUEST_KSEG0 || + KVM_GUEST_KSEGX((ulong) opc) == KVM_GUEST_KSEG23) { + local_irq_save(flags); + index = kvm_mips_host_tlb_lookup(vcpu, (ulong) opc); + if (index = 0) { + inst = *(opc); Here and in some more places below you access __user memory. Shouldn't you use get_user() to access it? What prevents the kernel crash by access fault here if userspace remaps the memory to be non-readable? Hmm, may be it uses guest translation here so it cannot happen, but still, sparse will not be happy and kvm_mips_translate_guest_kseg0_to_hpa() case below uses host translation anyway. Actually, I don't need the __user declaration in most cases, since KVM/MIPS handles mapping the page (if needed) and does not rely on the usual kernel mechanisms. Yes I see that you check/populate tlb for non KVM_GUEST_KSEG0 access, for kvm_mips_translate_guest_kseg0_to_hpa() you do not, but now I notice that you are not using the address directly but uses CKSEG0ADDR() on it, which, as far as I can tell, gives you kernel mapping for the page, correct? Why this is better than using get_user()? To save tlb entries? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio: Protect vfio_dev_present against device_del
vfio_dev_present is meant to give us a wait_event callback so that we can block removing a device from vfio until it becomes unused. The root of this check depends on being able to get the iommu group from the device. Unfortunately if the BUS_NOTIFY_DEL_DEVICE notifier has fired then the device-group reference is no longer searchable and we fail the lookup. We don't need to go to such extents for this though. We have a reference to the device, from which we can acquire a reference to the group. We can then use the group reference to search for the device and properly block removal. Signed-off-by: Alex Williamson alex.william...@redhat.com --- drivers/vfio/vfio.c | 33 - 1 file changed, 12 insertions(+), 21 deletions(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 12c264d..8e6dcec 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -642,33 +642,16 @@ int vfio_add_group_dev(struct device *dev, } EXPORT_SYMBOL_GPL(vfio_add_group_dev); -/* Test whether a struct device is present in our tracking */ -static bool vfio_dev_present(struct device *dev) +/* Given a referenced group, check if it contains the device */ +static bool vfio_dev_present(struct vfio_group *group, struct device *dev) { - struct iommu_group *iommu_group; - struct vfio_group *group; struct vfio_device *device; - iommu_group = iommu_group_get(dev); - if (!iommu_group) - return false; - - group = vfio_group_get_from_iommu(iommu_group); - if (!group) { - iommu_group_put(iommu_group); - return false; - } - device = vfio_group_get_device(group, dev); - if (!device) { - vfio_group_put(group); - iommu_group_put(iommu_group); + if (!device) return false; - } vfio_device_put(device); - vfio_group_put(group); - iommu_group_put(iommu_group); return true; } @@ -682,10 +665,18 @@ void *vfio_del_group_dev(struct device *dev) struct iommu_group *iommu_group = group-iommu_group; void *device_data = device-device_data; + /* +* The group exists so long as we have a device reference. Get +* a group reference and use it to scan for the device going away. +*/ + vfio_group_get(group); + vfio_device_put(device); /* TODO send a signal to encourage this to be released */ - wait_event(vfio.release_q, !vfio_dev_present(dev)); + wait_event(vfio.release_q, !vfio_dev_present(group, dev)); + + vfio_group_put(group); iommu_group_put(iommu_group); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio: whitelist pcieport
pcieport does nice things like manage AER and we know it doesn't do DMA or expose any user accessible devices on the host. It also keeps the Memory, I/O, and Busmaster bits enabled, which is pretty handy when trying to use anyting below it. Devices owned by pcieport cannot be given to users via vfio, but we can tolerate them not being owned by vfio-pci. Signed-off-by: Alex Williamson alex.william...@redhat.com --- drivers/vfio/vfio.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index 8e6dcec..28e2d5b 100644 --- a/drivers/vfio/vfio.c +++ b/drivers/vfio/vfio.c @@ -442,7 +442,7 @@ static struct vfio_device *vfio_group_get_device(struct vfio_group *group, * a device. It's not always practical to leave a device within a group * driverless as it could get re-bound to something unsafe. */ -static const char * const vfio_driver_whitelist[] = { pci-stub }; +static const char * const vfio_driver_whitelist[] = { pci-stub, pcieport }; static bool vfio_whitelisted_driver(struct device_driver *drv) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] vfio-pci: Manage user power state transitions
We give the user access to change the power state of the device but certain transitions result in an uninitialized state which the user cannot resolve. To fix this we need to mark the PowerState field of the PMCSR register read-only and effect the requested change on behalf of the user. This has the added benefit that pdev-current_state remains accurate while controlled by the user. The primary example of this bug is a QEMU guest doing a reboot where the device it put into D3 on shutdown and becomes unusable on the next boot because the device did a soft reset on D3-D0 (NoSoftRst-). Signed-off-by: Alex Williamson alex.william...@redhat.com --- drivers/vfio/pci/vfio_pci_config.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index f1dde2c..81567f8 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -587,12 +587,30 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) return 0; } +static int vfio_pm_config_write(struct vfio_pci_device *vdev, int pos, + int count, struct perm_bits *perm, + int offset, __le32 val) +{ + count = vfio_default_config_write(vdev, pos, count, perm, offset, val); + if (count 0) + return count; + + if (offset == PCI_PM_CTRL) { + uint8_t state = le32_to_cpu(val) PCI_PM_CTRL_STATE_MASK; + pci_set_power_state(vdev-pdev, state); + } + + return count; +} + /* Permissions for the Power Management capability */ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) { if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM])) return -ENOMEM; + perm-writefn = vfio_pm_config_write; + /* * We always virtualize the next field so we can remove * capabilities from the chain if we want to. @@ -600,10 +618,11 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE); /* -* Power management is defined *per function*, -* so we let the user write this +* Power management is defined *per function*, so we can let +* the user change power state, but we trap and initiate the +* change ourselves, so the state bits are read-only. */ - p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE); + p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.9-rc1 0/3] KVM/ARM core fixes for 3.9-rc1
Here's a number of patches to cope with the churn that occured in kvm-next, and breaks the compilation in next-20130215. No big deal, just regular breakage... Tested on TC2. Marc Zyngier (3): ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region ARM: KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS ARM: KVM: fix compilation after removal of user_alloc from struct kvm_memory_slot arch/arm/include/asm/kvm_host.h | 2 +- arch/arm/kvm/arm.c | 4 ++-- arch/arm/kvm/mmu.c | 5 - 3 files changed, 3 insertions(+), 8 deletions(-) -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.9-rc1 3/3] ARM: KVM: fix compilation after removal of user_alloc from struct kvm_memory_slot
Commit 7a905b1 (KVM: Remove user_alloc from struct kvm_memory_slot) broke KVM/ARM by removing the user_alloc field from a public structure. As we only used this field to alert the user that we didn't support this operation mode, there is no harm in discarding this bit of code without any remorse. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/mmu.c | 5 - 1 file changed, 5 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index f30e131..99e07c7 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -633,11 +633,6 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run) } memslot = gfn_to_memslot(vcpu-kvm, gfn); - if (!memslot-user_alloc) { - kvm_err(non user-alloc memslots not supported\n); - ret = -EINVAL; - goto out_unlock; - } ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status); if (ret == 0) -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.9-rc1 2/3] ARM: KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS
Commit bbacc0c (KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS) broke KVM/ARM by changing a global #define. Apply the same change to fix the compilation breakage. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index dfe9886..d1736a5 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -26,7 +26,7 @@ #include asm/kvm_arch_timer.h #define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS -#define KVM_MEMORY_SLOTS 32 +#define KVM_USER_MEM_SLOTS 32 #define KVM_PRIVATE_MEM_SLOTS 4 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_HAVE_ONE_REG -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.9-rc1 1/3] ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region
Commit f82a8cfe9 (KVM: struct kvm_memory_slot.user_alloc - bool) broke the ARM KVM port by changing the prototype of two global functions. Apply the same change to fix the compilation breakage. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/arm.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 9ada554..5a93698 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -232,7 +232,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, struct kvm_memory_slot *memslot, struct kvm_memory_slot old, struct kvm_userspace_memory_region *mem, - int user_alloc) + bool user_alloc) { return 0; } @@ -240,7 +240,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm, void kvm_arch_commit_memory_region(struct kvm *kvm, struct kvm_userspace_memory_region *mem, struct kvm_memory_slot old, - int user_alloc) + bool user_alloc) { } -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: randconfig errors
On Fri, Feb 15, 2013 at 2:22 PM, Marc Zyngier marc.zyng...@arm.com wrote: On 15/02/13 19:16, Christoffer Dall wrote: On Fri, Feb 15, 2013 at 1:47 PM, Marc Zyngier marc.zyng...@arm.com wrote: On Fri, 15 Feb 2013 19:22:37 +0100, Marc Zyngier marc.zyng...@arm.com wrote: On Fri, 15 Feb 2013 11:25:56 -0600, Rob Herring robherri...@gmail.com wrote: Errors from randconfig builds restricted to multi-platform configs on the current linux-next. I know I've seen a fix on the first one from Arnd, but have not investigated any of these. /var/lib/jenkins/jobs/linux-randconfig/workspace/include/linux/kvm_host.h:333:34: error: 'KVM_USER_MEM_SLOTS' undeclared here (not in a function) This one comes from bbacc0c (KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS). I'll queue a patch to fix the breakage... Looks like there's a few more, and what was brewing in kvm-next had some side effects. Christoffer, I'll send the patches separately. I was kind of hoping that the KVM guys would have pulled the KVM/ARM stuff into kvm/next and it could have been fixed up there The patches are really trivial though iirc. Indeed they are. Probably an oversight from the KVM guys. Sent them anyway... I saw them, it all looks good. I'm cc'ing the kvm list here just for info. -Christoffer -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Helping with KVM development
Hi, I'm a computer engineering PhD student and I have the opportunity to contribute something to KVM as a class project. I found a few interesting items on the KVM TODO list (http://www.linux-kvm.org/page/TODO) that I could do. If I choose to do this I would start immediately and have about 2 months to finish. I wouldn't choose this project if I thought I would have to rely on the help of you guys, but I think my biggest challenge will be diving in and getting started. Hopefully you can help with that. My focus is on computer architecture so the x86 topics really interested me. Here are the specific ones that I think I could do. I would start by implementing one, and if that ends up not being substantial enough for a final project, I'd do another. 1. Improve mmu page eviction algorithm (currently FIFO, change to approximate LRU). 2. On-demand register access, really, copying all registers all the time is gross. 3. Implement mmx and sse memory move instructions; useful for guests that use multimedia extensions for accessing vga (partially done) 4. Implement an operation queue for the emulator. The emulator often calls userspace to perform a read or a write, but due to inversion of control it actually restarts instead of continuing. The queue would allow it to replay all previous operations until it reaches the point it last stopped. 5. convert more instructions to direct dispatch (function pointer in decode table) 6. move init_emulate_ctxt() into x86_decode_insn() and other emulator entry points Have any of these already been implemented? It seems number 2 and possibly 1 already have been. I think this list is quite outdated, so is there something else along these lines of these updates that I could help with? Thanks, Brandon -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/12] KVM: PPC: E500: Explicitly mark shadow maps invalid
On 02/14/2013 06:16:18 PM, Alexander Graf wrote: When we invalidate shadow TLB maps on the host, we don't mark them as not valid. But we should. Fix this by removing the E500_TLB_VALID from their flags when invalidating. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/e500_tlb.c | 13 ++--- 1 files changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c index d38ad63..8efb2ac 100644 --- a/arch/powerpc/kvm/e500_tlb.c +++ b/arch/powerpc/kvm/e500_tlb.c @@ -204,9 +204,13 @@ static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500, { struct kvm_book3e_206_tlb_entry *gtlbe = get_entry(vcpu_e500, tlbsel, esel); + struct tlbe_ref *ref = vcpu_e500-gtlb_priv[tlbsel][esel].ref; - if (tlbsel == 1 - vcpu_e500-gtlb_priv[1][esel].ref.flags E500_TLB_BITMAP) { + /* Don't bother with unmapped entries */ + if (!(ref-flags E500_TLB_VALID)) + return; This is broken as pointed out here: http://patchwork.ozlabs.org/patch/220356/ ...and note that I'm still seeing problems even after that fix, which I'll try to debug today. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE(1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING(1ULL 42) What does server mean? Do you mean laid out like this for XICS? Let's please have a clean separation between what is generic and what is implementation-specific. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] vfio-pci: Manage user power state transitions
We give the user access to change the power state of the device but certain transitions result in an uninitialized state which the user cannot resolve. To fix this we need to mark the PowerState field of the PMCSR register read-only and effect the requested change on behalf of the user. This has the added benefit that pdev-current_state remains accurate while controlled by the user. The primary example of this bug is a QEMU guest doing a reboot where the device it put into D3 on shutdown and becomes unusable on the next boot because the device did a soft reset on D3-D0 (NoSoftRst-). Signed-off-by: Alex Williamson alex.william...@redhat.com --- v2: sparse found a type error when calling pci_set_power_state. Fixed here drivers/vfio/pci/vfio_pci_config.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index f1dde2c..1fd1895 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -587,12 +587,30 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm) return 0; } +static int vfio_pm_config_write(struct vfio_pci_device *vdev, int pos, + int count, struct perm_bits *perm, + int offset, __le32 val) +{ + count = vfio_default_config_write(vdev, pos, count, perm, offset, val); + if (count 0) + return count; + + if (offset == PCI_PM_CTRL) { + pci_power_t state = le32_to_cpu(val) PCI_PM_CTRL_STATE_MASK; + pci_set_power_state(vdev-pdev, state); + } + + return count; +} + /* Permissions for the Power Management capability */ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) { if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM])) return -ENOMEM; + perm-writefn = vfio_pm_config_write; + /* * We always virtualize the next field so we can remove * capabilities from the chain if we want to. @@ -600,10 +618,11 @@ static int __init init_pci_cap_pm_perm(struct perm_bits *perm) p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE); /* -* Power management is defined *per function*, -* so we let the user write this +* Power management is defined *per function*, so we can let +* the user change power state, but we trap and initiate the +* change ourselves, so the state bits are read-only. */ - p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE); + p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK); return 0; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, the only time userspace needs to use these is at initialization, when it does a SET_SOURCES to create the interrupt sources it wants the VM to have. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? Just because only PPC has code to handle it at this point. I would hope that ARM and others could pick it up. Maybe I should make it: +Architectures: all (but so far only implemented on ppc) or something. +struct kvm_irq_sources { +__u32 irq; +__u32 nr_irqs; +__u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, which is generally less efficient, but since it's mainly for migration that probably doesn't matter. In fact, I could probably use the existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the `chip_id' field for `irq' and the `pad' field for `nr_irqs'. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that destination rather than server. You're right, server is confusing, but it just means where the interrupt is sent to be handled. It has nothing particularly to do with server computers. Let's please have a clean separation between what is generic and what is implementation-specific. I believe that the interface is pretty cleanly generic - the model is a set of interrupt sources and some per-vcpu state, with priorities to decide which interrupts get delivered when. That describes the basics of just about any SMP-capable interrupt controller, including MPIC. MPIC would still need an extra interface for userspace to save and restore
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. How do MSIs get injected? BTW, do you have any plans regarding irqfd? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, It doesn't need to be a fixed size buffer. You can still have pointers, but they need to be represented as a plain __u64 with users casting the pointer. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED(1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. Sure - I was just trying to fit in with the expressed wish of the maintainer. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. Yes, MPIC has other random stuff in it besides interrupt control, so that's not surprising. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. How is the explicit request made in this patchset? Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) BTW, do you have any plans regarding irqfd? I'm going to look at that next. Likewise... We should probably coordinate our efforts so that at least the de-APICization of the code only has to get done once. What about interrupt controllers that allow multiple destinations? The destination can be an identifier for a group of vcpus, or even a bitmap -- that's why I made it 32 bits. So you can have single delivery, or be limited to 32 vcpus, or have to implement some destination ID allocation scheme (which is more state that needs to be accessible somehow). More than 256 priorities? Different levels of output (normal, critical, machine check)? Programmable vector numbers? Active high/low control? There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. (BTW, I think having more than 256 priorities would be insane - do you know of any actual example that does?) No, but hardware designers have been known to do insane things. The per-vcpu state isn't even part of this AFAICT. It's an XICS-specific ONE_REG -- which is fine, but all that's left of the generic API is the get/set sources which is an imperfect match to our per-IRQ state and it's not clear how an
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. I'd think there could be some code reduction in the live migration code, but I'd need a qemu hacker to chime in here. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? If that's so, then why bother with in-kernel emulation at all? Once you're back in userspace, it's just as fast to do the emulation there as in the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. Yes, the names of the bitfields in the ICP state word are XICS-specific, but the concepts are pretty generic - current processor priority, identifier for interrupt awaiting service, pending IPI request priority, pending interrupt request priority. We don't have separate concepts of pending IPI request priority and pending interrupt request priority. There can be multiple Sorry, I meant pending interrupt request. You do have that, it's what you read from the interrupt acknowledge register. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set
Re: [PATCH 02/12] KVM: PPC: E500: Explicitly mark shadow maps invalid
On 02/14/2013 06:16:18 PM, Alexander Graf wrote: When we invalidate shadow TLB maps on the host, we don't mark them as not valid. But we should. Fix this by removing the E500_TLB_VALID from their flags when invalidating. Signed-off-by: Alexander Graf ag...@suse.de --- arch/powerpc/kvm/e500_tlb.c | 13 ++--- 1 files changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c index d38ad63..8efb2ac 100644 --- a/arch/powerpc/kvm/e500_tlb.c +++ b/arch/powerpc/kvm/e500_tlb.c @@ -204,9 +204,13 @@ static void inval_gtlbe_on_host(struct kvmppc_vcpu_e500 *vcpu_e500, { struct kvm_book3e_206_tlb_entry *gtlbe = get_entry(vcpu_e500, tlbsel, esel); + struct tlbe_ref *ref = vcpu_e500-gtlb_priv[tlbsel][esel].ref; - if (tlbsel == 1 - vcpu_e500-gtlb_priv[1][esel].ref.flags E500_TLB_BITMAP) { + /* Don't bother with unmapped entries */ + if (!(ref-flags E500_TLB_VALID)) + return; This is broken as pointed out here: http://patchwork.ozlabs.org/patch/220356/ ...and note that I'm still seeing problems even after that fix, which I'll try to debug today. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE(1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING(1ULL 42) What does server mean? Do you mean laid out like this for XICS? Let's please have a clean separation between what is generic and what is implementation-specific. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, the only time userspace needs to use these is at initialization, when it does a SET_SOURCES to create the interrupt sources it wants the VM to have. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. +4.80 KVM_CREATE_IRQCHIP_ARGS + +Capability: KVM_CAP_IRQCHIP_ARGS +Architectures: ppc Why just ppc? Just because only PPC has code to handle it at this point. I would hope that ARM and others could pick it up. Maybe I should make it: +Architectures: all (but so far only implemented on ppc) or something. +struct kvm_irq_sources { +__u32 irq; +__u32 nr_irqs; +__u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, which is generally less efficient, but since it's mainly for migration that probably doesn't matter. In fact, I could probably use the existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the `chip_id' field for `irq' and the `pad' field for `nr_irqs'. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT 32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED (1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that destination rather than server. You're right, server is confusing, but it just means where the interrupt is sent to be handled. It has nothing particularly to do with server computers. Let's please have a clean separation between what is generic and what is implementation-specific. I believe that the interface is pretty cleanly generic - the model is a set of interrupt sources and some per-vcpu state, with priorities to decide which interrupts get delivered when. That describes the basics of just about any SMP-capable interrupt controller, including MPIC. MPIC would still need an extra interface for userspace to save and restore
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. How do MSIs get injected? BTW, do you have any plans regarding irqfd? +struct kvm_irq_sources { + __u32 irq; + __u32 nr_irqs; + __u64 __user *irqbuf; +}; Please no pointers in UAPI -- this would require a compat wrapper with 32-bit user and 64-bit kernel. Hmmm, you're right. I suppose it will have to be a fixed-size buffer, It doesn't need to be a fixed size buffer. You can still have pointers, but they need to be represented as a plain __u64 with users casting the pointer. +/* irqbuf entries are laid out like this: */ +#define KVM_IRQ_SERVER_SHIFT 0 +#define KVM_IRQ_SERVER_MASK 0xULL +#define KVM_IRQ_PRIORITY_SHIFT32 +#define KVM_IRQ_PRIORITY_MASK 0xff +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL 40) +#define KVM_IRQ_MASKED(1ULL 41) +#define KVM_IRQ_PENDING (1ULL 42) What does server mean? Do you mean laid out like this for XICS? Sorry, I should have made that
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote: On 02/15/2013 05:18:31 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote: On 02/14/2013 06:01:08 PM, Paul Mackerras wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like KVM_CREATE_IRQCHIP in that it indicates that the virtual machine should use in-kernel interrupt controller emulation, but also takes an argument struct that contains the type of interrupt controller architecture and an optional parameter. Currently only one type value is defined, that which indicates the XICS architecture. Would the device config API I posted a couple days ago work for you? I suppose it could be made to work. It doesn't feel like a natural fit though, because your API seems to assume (AFAICT) that a device is manipulated via registers at specific physical addresses, so I would have to invent an artificial set of registers with addresses and bit layouts, that aren't otherwise required. The XICS is operated from the guest side via hcalls, not via emulated MMIO. I don't think it makes such an assumption. The MPIC device has physical registers, so it exposes them, but it also exposes things that are not physical registers (e.g. the per-IRQ input state). The generic device control layer leaves interpretation of attributes up to the device. I think it would be easier to fit XICS into the device control api model than to fit MPIC into this model, not to mention what would happen if we later want to emulate some other type of device -- x86 already has at least one non-irqchip emulated device (i8254). I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. Further, the magic means that you can only have one instance of the device, whereas you might want to model the interrupt controller architecture as several devices. You could do that using several device types, but then the interconnections between them would also be magic. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Part of the reason I went with the API that I did is that that was what was agreed on at KVM Forum (as far as I can tell, not having been at the meeting). Your device API seems to be quite different to that. I wasn't there either. It's fine to have discussions at such events, but it should not preclude others from having input, or better ideas from being considered afterward. Sure - I was just trying to fit in with the expressed wish of the maintainer. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Two other new ioctls allow userspace to control the interrupt sources. The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority, destination cpu, level/edge sensitivity and pending state of a range of interrupt sources, creating them if they don't already exist. The KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of interrupt sources (they are required to already exist). Why is it userspace's job to control these? If you use KVM_IRQ_PENDING These are primarily there to support live migration. For live migration, userspace needs to be able to extract the entire state of the virtual machine from the old guest, and then set the new guest to that exact same state. In that case, the state it describes is insufficient for MPIC. Yes, MPIC has other random stuff in it besides interrupt control, so that's not surprising. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. How is the explicit request made in this patchset? Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. We have live migration working in qemu for pSeries guests with in-kernel XICS emulation using this interface. If you're not doing live migration, We don't yet, but would prefer not to assume that it'll never happen. for interrupt injection, what if there's a race with the user changing other flags via MMIO? Maybe this isn't an issue with XICS, but this is being presented as a generic API. They're not used while the guest is running, as I said, but even if they were, there is appropriate locking in there to handle any races. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) BTW, do you have any plans regarding irqfd? I'm going to look at that next. Likewise... We should probably coordinate our efforts so that at least the de-APICization of the code only has to get done once. What about interrupt controllers that allow multiple destinations? The destination can be an identifier for a group of vcpus, or even a bitmap -- that's why I made it 32 bits. So you can have single delivery, or be limited to 32 vcpus, or have to implement some destination ID allocation scheme (which is more state that needs to be accessible somehow). More than 256 priorities? Different levels of output (normal, critical, machine check)? Programmable vector numbers? Active high/low control? There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. (BTW, I think having more than 256 priorities would be insane - do you know of any actual example that does?) No, but hardware designers have been known to do insane things. The per-vcpu state isn't even part of this AFAICT. It's an XICS-specific ONE_REG -- which is fine, but all that's left of the generic API is the get/set sources which is an imperfect match to our per-IRQ state and it's not clear how an
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Secondly, it means that we are completely abandoning any attempt to define an abstract or generic interface to in-kernel interrupt controller emulations. Each device will have its own unique set of attribute groups and its own unique userspace code to drive it, with no commonality between them. Yes. I am unconvinced that such an abstraction is well-advised (especially after seeing existing generic interfaces that are clearly APIC-oriented). This isn't like normal driver interfaces where we're abstracting away hardware differences to let generic code use a device. Userspace knows what kind of device it wants, and how it wants it to integrate with the rest of the emulated system. We'd have to go out of our way to apply the abstraction on *both* ends. What do we get from that other than a chance that the abstraction leaks? What significant code actually becomes common? kvm_set_irq() is just a thin wrapper around the ioctl. I'd think there could be some code reduction in the live migration code, but I'd need a qemu hacker to chime in here. OK, KVM_IRQ_LINE is still used for interrupt injection. I was hoping to avoid going through a standardized interface that forces a global interrupt numberspace. Why? The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? If that's so, then why bother with in-kernel emulation at all? Once you're back in userspace, it's just as fast to do the emulation there as in the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. Yes, the names of the bitfields in the ICP state word are XICS-specific, but the concepts are pretty generic - current processor priority, identifier for interrupt awaiting service, pending IPI request priority, pending interrupt request priority. We don't have separate concepts of pending IPI request priority and pending interrupt request priority. There can be multiple Sorry, I meant pending interrupt request. You do have that, it's what you read from the interrupt acknowledge register. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set