Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/02/2013 06:47 AM, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt| 70 ++ Documentation/virtual/kvm/devices/README |1 + arch/powerpc/include/asm/kvm_host.h |6 +++ arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/kvm/powerpc.c |7 +++ include/uapi/linux/kvm.h | 27 virt/kvm/kvm_main.c | 31 + 7 files changed, 144 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created + + Other error conditions may be defined by individual device types. + +Creates an emulated device in the kernel. The file descriptor returned +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific. See individual device documentation in +the devices directory. As with ONE_REG, the size of the data +transferred is defined by the particular attribute. + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 4.77 KVM_ARM_VCPU_INIT diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index 000..34a6983 --- /dev/null +++ b/Documentation/virtual/kvm/devices/README @@ -0,0 +1 @@ +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e34f8fe..e0caae2 100644 ---
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 --- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* +* Flag indicating that debug registers are used by guest +* and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features in linux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE 0x0 +#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ(1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* +* exiting to userspace because of h/w breakpoint, watchpoint +* (read, write or both) and software breakpoint. +*/ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE0x0 -#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ(1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr = ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* +* Since there is no shadow MSR, sync MSR_DE into the guest +* visible MSR. Do not allow guest to change MSR[DE]. +*/ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets enabled. It should automatically get synchronized as soon as the next vpcu_load() happens. I think this is not required here as shadow_dbsr already have MSRP_DEP set. Will setup shadow_msrp when setting guest_debug and clear shadow_msrp when guest_debug is cleared. But that will also not be sufficient as it not sure when vcpu_load() will be called after the shadow_msrp is changed. So
Re: [PATCH 2/4 v2] KVM: PPC: debug stub interface parameter defined
On 29.03.2013, at 04:08, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Friday, March 29, 2013 7:26 AM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 2/4 v2] KVM: PPC: debug stub interface parameter defined On 21.03.2013, at 07:24, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com This patch defines the interface parameter for KVM_SET_GUEST_DEBUG ioctl support. Follow up patches will use this for setting up hardware breakpoints, watchpoints and software breakpoints. Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below. This is because I am not sure what is required for book3s. So this ioctl behaviour will not change for book3s. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - No Change arch/powerpc/include/uapi/asm/kvm.h | 23 +++ arch/powerpc/kvm/book3s.c |6 ++ arch/powerpc/kvm/booke.c|6 ++ arch/powerpc/kvm/powerpc.c |6 -- 4 files changed, 35 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index c2ff99c..15f9a00 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -272,8 +272,31 @@ struct kvm_debug_exit_arch { /* for KVM_SET_GUEST_DEBUG */ struct kvm_guest_debug_arch { + struct { + /* H/W breakpoint/watchpoint address */ + __u64 addr; + /* +* Type denotes h/w breakpoint, read watchpoint, write +* watchpoint or watchpoint (both read and write). +*/ +#define KVMPPC_DEBUG_NOTYPE0x0 +#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ(1UL 3) Are you sure you want to introduce these here, just to remove them again in a later patch? Up to this patch the scope was limited to this structure. So for clarity I defined here and later the scope expands so moved out of this structure. I do not think this really matters, let me know how you want to see ? Well, at least I want to see the names be identical between the patches ;). Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery
Hi Tomoki I tried your smart patch cpu isolation and direct interrupt delivery, http://article.gmane.org/gmane.linux.kernel/1353803 got output when I run qemu kvm_set_slave_cpu: Invalid argument So I wonder * Did I misuse your patches? * How is the offlined CPU assigned? or the Guest OS will automaticly detect and use it? details of my trial: - based on v3.6-rc4 and qemu-kvm-1.0 as you commented - boot the kernel with intel_iommu=on BOOT_IMAGE=(hd0,1)/boot/vmlinuz-3.6.0-rc4+ root=/dev/sda1 rhgb quiet selinux=0 intel_iommu=on - the offlined cpu # cat /sys/devices/system/cpu/offline 23 - qemu command line qemu-kvm -enable-kvm -m 1024 -cpu qemu64,+x2apic -no-kvm-pit -serial pty -nographic -drive file=/mnt/sdb/vmtest/testfc.qcow2,if=virtio,index=0,format=qcow2 -spice port=12000,addr=186.100.8.171,disable-ticketing,plaintext-channel=main,plaintext-channel=playback,plaintext-channel=record,image-compression=auto_glz Thanks, Yang Minqiang-- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM guest memory mapping
Hi list, I've just started doing some research into VM memory allocation, and I've got a few questions about how KVM performs memory translations from guest to host, using Intel-VT extensions. My questions relate to the implementation of Intel EPTs. I've put in a few printk statements within the KVM source, specifically mmu.c to try to follow what is happening within the VM and hypervisor, however, I'm a little bit lost at what I'm seeing. The very first virtual memory access from within my guest triggers a 'handle_ept_violation', this is to be expected as it's the very first, and no pages will have been allocated as of yet. The value taken from the guest's CR2 register is: 0xfff0 (which I am assuming to be a guest physical address). Upon this ept violation occurring, the function tdp_page_fault is called, which then in turn calls __direct_map. I'm a little confused about exactly what __direct_map is actually doing. The input to __direct_map is: gpa_t v: fff0 gfn_t gfn: f pfn_t pfn: 35b649 level: 1 Firstly, I'm confused as to why the gpa_t type variable is called 'v'. This would indicate to me that it's a virtual address, however it is being stored as a guest physical type. Could anyone explain why this is named as such? After this I can see a lot of different memory addresses being passed around the system, but I'd still like to better understand how KVM allocates and finally translates guest addresses into host physical address. If anyone could help explain how __direct_map functions, I would appreciate it. Thanks Tony -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] ARM: EXYNOS5440: DTS: Add virtual GIC DT bindings
Giridhar Maruthy wrote: Exynos5440 has GIC which has virtualization support in them. These are used by KVM. Signed-off-by: Giridhar Maruthy giridha...@samsung.com --- arch/arm/boot/dts/exynos5440.dtsi |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/arm/boot/dts/exynos5440.dtsi b/arch/arm/boot/dts/exynos5440.dtsi index c374a31..25c6134 100644 --- a/arch/arm/boot/dts/exynos5440.dtsi +++ b/arch/arm/boot/dts/exynos5440.dtsi @@ -26,7 +26,11 @@ compatible = arm,cortex-a15-gic; #interrupt-cells = 3; interrupt-controller; - reg = 0x2E1000 0x1000, 0x2E2000 0x1000; + reg = 0x2E1000 0x1000, + 0x2E2000 0x1000, + 0x2E4000 0x2000, + 0x2E6000 0x2000; + interrupts = 1 9 0xf04; }; cpus { -- 1.7.9.5 Looks ok to me, applied. Thanks. - Kukjin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers
Alexander Graf wrote: When running on an exynos 5250 SoC, we don't initialize the architected timers. The chip however supports architected timers. Yes, exynos5250 can support, mct(multi core timer) is used though. When we don't initialize them, KVM will try to access them and run into NULL pointer dereferences attempting to do so. Yes, right. This patch is really more of a hack than a real fix, but does get me working with KVM on Arndale. Hmm, if you think, this is _really_ a hack, you need to add some comments about that for clearance, and since the mct.c file has been moved into drivers/clocksource/, this should be re-worked. BTW, I discussed about this with Thomas and Giridhar just now, we reached this 3rd patch could be dropped because the correct way is to add a dts node for arch timer which patch 2nd is already doing after 3.9-rc1 because of CLOCKSOURCE_OF_DECLARE macro. So if you' OK above, let me know so that I can take only 1st and 2nd patches to support KVM on exynos5250. Thanks. - Kukjin Signed-off-by: Alexander Graf ag...@suse.de --- arch/arm/mach-exynos/mct.c |4 1 file changed, 4 insertions(+) diff --git a/arch/arm/mach-exynos/mct.c b/arch/arm/mach-exynos/mct.c index c9d6650..eefb8af 100644 --- a/arch/arm/mach-exynos/mct.c +++ b/arch/arm/mach-exynos/mct.c @@ -482,4 +482,8 @@ void __init exynos4_timer_init(void) exynos4_timer_resources(); exynos4_clocksource_init(); exynos4_clockevent_init(); + + if (soc_is_exynos5250()) { + arch_timer_of_register(); + } } -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers
On 04/02/2013 12:44 PM, Kukjin Kim wrote: Alexander Graf wrote: When running on an exynos 5250 SoC, we don't initialize the architected timers. The chip however supports architected timers. Yes, exynos5250 can support, mct(multi core timer) is used though. When we don't initialize them, KVM will try to access them and run into NULL pointer dereferences attempting to do so. Yes, right. This patch is really more of a hack than a real fix, but does get me working with KVM on Arndale. Hmm, if you think, this is _really_ a hack, you need to add some comments about that for clearance, and since the mct.c file has been moved into drivers/clocksource/, this should be re-worked. BTW, I discussed about this with Thomas and Giridhar just now, we reached this 3rd patch could be dropped because the correct way is to add a dts node for arch timer which patch 2nd is already doing after 3.9-rc1 because of CLOCKSOURCE_OF_DECLARE macro. So if you' OK above, let me know so that I can take only 1st and 2nd patches to support KVM on exynos5250. I'd say go ahead and take them and I'll verify whether things work on your tree :). What's the git repo of your branch? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM
On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote: Gleb Natapov g...@redhat.com writes: On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote: 2013/3/21 Gleb Natapov g...@redhat.com: Isn't is simpler for kernel/context_tracking.c to define empty __guest_enter()/__guest_exit() if !CONFIG_KVM. That doesn't look right. Off-cases are usually handled from the headers, right? So that we avoid iffdeffery ugliness in core code. Lets put it in linux/context_tracking.h header then. Here's a version to do that. Frederic, are you OK with this version? Kevin From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001 From: Kevin Hilman khil...@linaro.org Date: Mon, 25 Mar 2013 14:12:41 -0700 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest enter/exit When KVM is not enabled, or not available on a platform, the KVM headers should not be included. Instead, just define stub __guest_[enter|exit] functions. Cc: Frederic Weisbecker fweis...@gmail.com Signed-off-by: Kevin Hilman khil...@linaro.org --- include/linux/context_tracking.h | 7 +++ kernel/context_tracking.c| 1 - 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index 365f4a6..9d0f242 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -3,6 +3,13 @@ #include linux/sched.h #include linux/percpu.h +#if IS_ENABLED(CONFIG_KVM) +#include linux/kvm_host.h +#else +#define __guest_enter() +#define __guest_exit() +#endif + #include asm/ptrace.h struct context_tracking { diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c index 65349f0..85bdde1 100644 --- a/kernel/context_tracking.c +++ b/kernel/context_tracking.c @@ -15,7 +15,6 @@ */ #include linux/context_tracking.h -#include linux/kvm_host.h #include linux/rcupdate.h #include linux/sched.h #include linux/hardirq.h -- 1.8.2 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Call kvm_apic_match_dest() to check destination vcpu
On Mon, Apr 01, 2013 at 12:42:33AM +, Zhang, Yang Z wrote: Zhang, Yang Z wrote on 2013-03-21: From: Yang Zhang yang.z.zh...@intel.com For a given vcpu, kvm_apic_match_dest() will tell you whether the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap() and use kvm_apic_match_dest() instead. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 47 --- arch/x86/kvm/lapic.h |4 virt/kvm/ioapic.c|9 - 3 files changed, 4 insertions(+), 56 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a8e9369..e227474 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic) return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, - struct kvm_lapic_irq *irq, - u64 *eoi_exit_bitmap) -{ - struct kvm_lapic **dst; - struct kvm_apic_map *map; - unsigned long bitmap = 1; - int i; - - rcu_read_lock(); - map = rcu_dereference(vcpu-kvm-arch.apic_map); - - if (unlikely(!map)) { - __set_bit(irq-vector, (unsigned long *)eoi_exit_bitmap); - goto out; - } - - if (irq-dest_mode == 0) { /* physical mode */ - if (irq-delivery_mode == APIC_DM_LOWEST || - irq-dest_id == 0xff) { - __set_bit(irq-vector, - (unsigned long *)eoi_exit_bitmap); - goto out; - } - dst = map-phys_map[irq-dest_id 0xff]; - } else { - u32 mda = irq-dest_id (32 - map-ldr_bits); - - dst = map-logical_map[apic_cluster_id(map, mda)]; - - bitmap = apic_logical_id(map, mda); - } - - for_each_set_bit(i, bitmap, 16) { - if (!dst[i]) - continue; - if (dst[i]-vcpu == vcpu) { - __set_bit(irq-vector, - (unsigned long *)eoi_exit_bitmap); - break; - } - } - -out: - rcu_read_unlock(); -} - static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 2c721b9..baa20cf 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) return ldr map-lid_mask; } -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, - struct kvm_lapic_irq *irq, - u64 *eoi_bitmap); - static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) { return vcpu-arch.apic-pending_events; diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index ce82b94..b54ddfa 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -132,11 +132,10 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index))) { - irqe.dest_id = e-fields.dest_id; - irqe.vector = e-fields.vector; - irqe.dest_mode = e-fields.dest_mode; - irqe.delivery_mode = e-fields.delivery_mode 8; - kvm_calculate_eoi_exitmap(vcpu, irqe, eoi_exit_bitmap); + if (kvm_apic_match_dest(vcpu, NULL, 0, + e-fields.dest_id, e-fields.dest_mode)) + __set_bit(irqe.vector, +(unsigned long *)eoi_exit_bitmap); } } spin_unlock(ioapic-lock); -- 1.7.1 Any comments? You can drop irqe now since it was needed for kvm_calculate_eoi_exitmap() call. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 0/3] virtio/vhost: Add checks for uninitialized VQs
On Mon, Apr 01, 2013 at 11:58:21PM +, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org Hi folks, This series adds a virtio_queue_valid() for use by virtio-pci code in order to prevent opreations upon uninitialized VQs, which is currently expected to occur during seabios setup of virtio-scsi with in-flight vhost-scsi-pci device code. On the vhost side, it also adds virtio_queue_valid() sanity checks in vhost_virtqueue_[start,stop]() and vhost_verify_ring_mappings() in order to skip the same uninitialized VQs. Changes from v1: - Remove now unnecessary virtio_queue_get_num() calls in virtio-pci.c - Add virtio_queue_valid() calls in vhost_virtqueue_[start,stop]() Please review. --nab Looks reasonable. Acked-by: Michael S. Tsirkin m...@redhat.com So - does this fix the issues you saw with vhost-scsi? Michael S. Tsirkin (1): virtio: add API to check that ring is setup Nicholas Bellinger (2): virtio-pci: Add virtio_queue_valid checks ahead of virtio_queue_get_num vhost: Skip uninitialized VQs in vhost_virtqueue_[start,stop] hw/vhost.c | 12 hw/virtio-pci.c | 34 +++--- hw/virtio.c |5 + hw/virtio.h |1 + 4 files changed, 33 insertions(+), 19 deletions(-) -- 1.7.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup
On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote: On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote: On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote: Currently, vs-vs_endpoint is used indicate if the endpoint is setup or not. It is set or cleared in vhost_scsi_set_endpoint() or vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. However, when we check it in vhost_scsi_handle_vq(), we ignored the lock. Instead of using the vs-vs_endpoint and the vs-dev.mutex lock to indicate the status of the endpoint, we use per virtqueue vq-private_data to indicate it. In this way, we can only take the vq-mutex lock which is per queue and make the concurrent multiqueue process having less lock contention. Further, in the read side of vq-private_data, we can even do not take only lock if it is accessed in the vhost worker thread, because it is protected by vhost rcu. Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 38 +- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 5e3d4487..0524267 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -67,7 +67,6 @@ struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET]; char vs_vhost_wwpn[TRANSPORT_IQN_LEN]; - bool vs_endpoint; struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; @@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov) ((unsigned long)iov-iov_base PAGE_MASK)) PAGE_SHIFT; } +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq) +{ + bool ret = false; + + /* + * We can handle the vq only after the endpoint is setup by calling the + * VHOST_SCSI_SET_ENDPOINT ioctl. + * + * TODO: Check that we are running from vhost_worker which acts + * as read-side critical section for vhost kind of RCU. + * See the comments in struct vhost_virtqueue in drivers/vhost/vhost.h + */ + if (rcu_dereference_check(vq-private_data, 1)) + ret = true; + + return ret; +} + static int tcm_vhost_check_true(struct se_portal_group *se_tpg) { return 1; @@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs, int head, ret; u8 target; - /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */ - if (unlikely(!vs-vs_endpoint)) + if (!tcm_vhost_check_endpoint(vq)) return; I would just move the check to under vq mutex, and avoid rcu completely. In vhost-net we are using private data outside lock so we can't do this, no such issue here. Are you talking about: handle_tx: /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; wmem = atomic_read(sock-sk-sk_wmem_alloc); if (wmem = sock-sk-sk_sndbuf) { mutex_lock(vq-mutex); tx_poll_start(net, sock); mutex_unlock(vq-mutex); return; } mutex_lock(vq-mutex); Why not do the atomic_read and tx_poll_start under the vq-mutex, and thus do the check under the lock as well. handle_rx: mutex_lock(vq-mutex); /* TODO: check that we are running from vhost_worker? */ struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; mutex_lock(vq-mutex); Can't we can do the check under the vq-mutex here? The rcu is still there but it makes the code easier to read. IMO, If we want to use rcu, use it explicitly and avoid the vhost rcu completely. mutex_lock(vq-mutex); @@ -829,11 +845,12 @@ static int vhost_scsi_set_endpoint( sizeof(vs-vs_vhost_wwpn)); for (i = 0; i VHOST_SCSI_MAX_VQ; i++) {
Re: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status
On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2013-03-26: Il 22/03/2013 06:24, Yang Zhang ha scritto: +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu, + struct rtc_status *rtc_status, int irq) +{ + if (irq != RTC_GSI) + return; + + if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map)) + --rtc_status-pending_eoi; + + WARN_ON(rtc_status-pending_eoi 0); +} This is the only case where you're passing the struct rtc_status instead of the struct kvm_ioapic. Please use the latter, and make it the first argument. @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) irqe.level = 1; irqe.shorthand = 0; - return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL); + if (irq == RTC_GSI) { + ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, + ioapic-rtc_status.dest_map); + ioapic-rtc_status.pending_eoi = ret; I think you should either add a BUG_ON(ioapic-rtc_status.pending_eoi != 0); or use ioapic-rtc_status.pending_eoi += ret (or both). There may malicious guest to write EOI more than once. And the pending_eoi will be negative. But it should not be a bug. Just WARN_ON is enough. And we already do it in ack_eoi. So don't need to do duplicated thing here. Since we track vcpus that already called EOI and decrement pending_eoi only once for each vcpu malicious guest cannot trigger it, but we already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need another one here. += will be correct (since pending_eoi == 0 here), but confusing since it makes an impression that pending_eoi may not be zero. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup
On Tue, Apr 02, 2013 at 09:27:57AM +1030, Rusty Russell wrote: Michael S. Tsirkin m...@redhat.com writes: Rusty's currently doing some reorgs of -net let's delay cleanups there to avoid stepping on each other's toys. Let's focus on scsi here. E.g. any chance framing assumptions can be fixed in 3.10? I am waiting for your removal of the dma-compelete ordering stuff in vhost-net. Cheers, Rusty. Sure. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/11] KVM: s390: More patches for kvm-next.
On Mon, Mar 25, 2013 at 05:22:47PM +0100, Cornelia Huck wrote: Hi, here are some kvm/s390 patches that have accumulated in our queue. Changes include fixes in the lpsw(e) and stsi handlers, proper handling of interrupt injection failures and a gmap optimization. Also included are patches allowing support for standby memory on kvm guests. Standby memory is used for providing hotpluggable memory on s390. Please consider applying. Applied, thanks. Christian Borntraeger (1): KVM: s390: Dont do a gmap update on minor memslot changes Heiko Carstens (7): KVM: s390: fix 24 bit psw handling in lpsw/lpswe handler KVM: s390: fix psw conversion in lpsw handler KVM: s390: fix return code handling in lpsw/lpswe handlers KVM: s390: make if statements in lpsw/lpswe handlers readable KVM: s390: fix and enforce return code handling for irq injections KVM: s390: fix stsi exception handling KVM: s390: fix compile with !CONFIG_COMPAT Nick Wang (3): KVM: s390: Change the virtual memory mapping location for virtio devices KVM: s390: Remove the sanity checks for kvm memory slot KVM: s390: Enable KVM_CAP_NR_MEMSLOTS on s390 arch/s390/kvm/intercept.c | 12 +-- arch/s390/kvm/kvm-s390.c | 32 --- arch/s390/kvm/kvm-s390.h | 12 +-- arch/s390/kvm/priv.c | 203 +++--- drivers/s390/kvm/kvm_virtio.c | 11 +-- 5 files changed, 108 insertions(+), 162 deletions(-) -- 1.7.12.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM call agenda for 2013-04-02
Juan Quintela quint...@redhat.com wrote: Hi Please send in any agenda topics you are interested in. As there are no items, today call is cancelled. Happy hacking. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] ARM: KVM: fix HYP mapping limitations around zero
The current code for creating HYP mapping doesn't like to wrap around zero, which prevents from mapping anything into the last page of the virtual address space. It doesn't take much effort to remove this limitation, making the code more consistent with the rest of the kernel in the process. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/mmu.c | 21 ++--- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 24811d1..eb4f8fa 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -131,11 +131,12 @@ static void create_hyp_pte_mappings(pmd_t *pmd, unsigned long start, pte_t *pte; unsigned long addr; - for (addr = start; addr end; addr += PAGE_SIZE) { + addr = start; + do { pte = pte_offset_kernel(pmd, addr); kvm_set_pte(pte, pfn_pte(pfn, prot)); pfn++; - } + } while (addr += PAGE_SIZE, addr != end); } static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, @@ -146,7 +147,8 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, pte_t *pte; unsigned long addr, next; - for (addr = start; addr end; addr = next) { + addr = start; + do { pmd = pmd_offset(pud, addr); BUG_ON(pmd_sect(*pmd)); @@ -164,7 +166,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, create_hyp_pte_mappings(pmd, addr, next, pfn, prot); pfn += (next - addr) PAGE_SHIFT; - } + } while (addr = next, addr != end); return 0; } @@ -179,11 +181,10 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long addr, next; int err = 0; - if (start = end) - return -EINVAL; - mutex_lock(kvm_hyp_pgd_mutex); - for (addr = start PAGE_MASK; addr end; addr = next) { + addr = start PAGE_MASK; + end = PAGE_ALIGN(end); + do { pgd = pgdp + pgd_index(addr); pud = pud_offset(pgd, addr); @@ -202,7 +203,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, if (err) goto out; pfn += (next - addr) PAGE_SHIFT; - } + } while (addr = next, addr != end); out: mutex_unlock(kvm_hyp_pgd_mutex); return err; @@ -216,8 +217,6 @@ out: * The same virtual address as the kernel virtual address is also used * in Hyp-mode mapping (modulo HYP_PAGE_OFFSET) to the same underlying * physical pages. - * - * Note: Wrapping around zero in the to address is not supported. */ int create_hyp_mappings(void *from, void *to) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code
Our HYP init code suffers from two major design issues: - it cannot support CPU hotplug, as we tear down the idmap very early - it cannot perform a TLB invalidation when switching from init to runtime mappings, as pages are manipulated from PL1 exclusively The hotplug problem mandates that we keep two sets of page tables (boot and runtime). The TLB problem mandates that we're able to transition from one PGD to another while in HYP, invalidating the TLBs in the process. To be able to do this, we need to share a page between the two page tables. A page that will have the same VA in both configurations. All we need is a VA that has the following properties: - This VA can't be used to represent a kernel mapping. - This VA will not conflict with the physical address of the kernel text The vectors page seems to satisfy this requirement: - The kernel never maps anything else there - The kernel text being copied at the beginning of the physical memory, it is unlikely to use the last 64kB (I doubt we'll ever support KVM on a system with something like 4MB of RAM, but patches are very welcome). Let's call this VA the trampoline VA. Now, we map our init page at 3 locations: - idmap in the boot pgd - trampoline VA in the boot pgd - trampoline VA in the runtime pgd The init scenario is now the following: - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd, runtime stack, runtime vectors - Enable the MMU with the boot pgd - Jump to a target into the trampoline page (remember, this is the same physical page!) - Now switch to the runtime pgd (same VA, and still the same physical page!) - Invalidate TLBs - Set stack and vectors - Profit! (or eret, if you only care about the code). Note that we keep the boot mapping permanently (it is not strictly an idmap anymore) to allow for CPU hotplug in later patches. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 18 --- arch/arm/include/asm/kvm_mmu.h | 21 ++-- arch/arm/kvm/arm.c | 9 ++ arch/arm/kvm/init.S | 29 +++-- arch/arm/kvm/mmu.c | 71 ++--- 5 files changed, 101 insertions(+), 47 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index a7a0bb5..3556684 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -190,22 +190,32 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *); int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run, int exception_index); -static inline void __cpu_init_hyp_mode(unsigned long long pgd_ptr, +static inline void __cpu_init_hyp_mode(unsigned long long boot_pgd_ptr, + unsigned long long pgd_ptr, unsigned long hyp_stack_ptr, unsigned long vector_ptr) { unsigned long pgd_low, pgd_high; - pgd_low = (pgd_ptr ((1ULL 32) - 1)); - pgd_high = (pgd_ptr 32ULL); + pgd_low = (boot_pgd_ptr ((1ULL 32) - 1)); + pgd_high = (boot_pgd_ptr 32ULL); /* * Call initialization code, and switch to the full blown * HYP code. The init code doesn't need to preserve these registers as -* r1-r3 and r12 are already callee save according to the AAPCS. +* r1-r3 and r12 are already callee saved according to the AAPCS. * Note that we slightly misuse the prototype by casing the pgd_low to * a void *. +* +* We don't have enough registers to perform the full init in one go. +* Install the boot PGD first, and then install the runtime PGD, +* stack pointer and vectors. */ + kvm_call_hyp((void *)pgd_low, pgd_high, 0, 0); + + pgd_low = (pgd_ptr ((1ULL 32) - 1)); + pgd_high = (pgd_ptr 32ULL); + kvm_call_hyp((void *)pgd_low, pgd_high, hyp_stack_ptr, vector_ptr); } diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 92eb20d..3567a49 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -19,17 +19,29 @@ #ifndef __ARM_KVM_MMU_H__ #define __ARM_KVM_MMU_H__ -#include asm/cacheflush.h -#include asm/pgalloc.h +#include asm/memory.h +#include asm/page.h /* * We directly use the kernel VA for the HYP, as we can directly share * the mapping (HTTBR covers TTBR1). */ -#define HYP_PAGE_OFFSET_MASK (~0UL) +#define HYP_PAGE_OFFSET_MASK UL(~0) #define HYP_PAGE_OFFSETPAGE_OFFSET #define KERN_TO_HYP(kva) (kva) +/* + * Our virtual mapping for the boot-time MMU-enable code. Must be + * shared across all the page-tables. Conveniently, we use the vectors + * page, where no kernel data will ever be shared with HYP. + */ +#define TRAMPOLINE_VA UL(CONFIG_VECTORS_BASE) + +#ifndef __ASSEMBLY__ + +#include
[PATCH 1/7] ARM: KVM: simplify HYP mapping population
The way we populate HYP mappings is a bit convoluted, to say the least. Passing a pointer around to keep track of the current PFN is quite odd, and we end-up having two different PTE accessors for no good reason. Simplify the whole thing by unifying the two PTE accessors, passing a pgprot_t around, and moving the various validity checks to the upper layers. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/mmu.c | 100 ++--- 1 file changed, 41 insertions(+), 59 deletions(-) diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 2f12e40..24811d1 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -125,54 +125,34 @@ void free_hyp_pmds(void) } static void create_hyp_pte_mappings(pmd_t *pmd, unsigned long start, - unsigned long end) + unsigned long end, unsigned long pfn, + pgprot_t prot) { pte_t *pte; unsigned long addr; - struct page *page; - for (addr = start PAGE_MASK; addr end; addr += PAGE_SIZE) { - unsigned long hyp_addr = KERN_TO_HYP(addr); - - pte = pte_offset_kernel(pmd, hyp_addr); - BUG_ON(!virt_addr_valid(addr)); - page = virt_to_page(addr); - kvm_set_pte(pte, mk_pte(page, PAGE_HYP)); - } -} - -static void create_hyp_io_pte_mappings(pmd_t *pmd, unsigned long start, - unsigned long end, - unsigned long *pfn_base) -{ - pte_t *pte; - unsigned long addr; - - for (addr = start PAGE_MASK; addr end; addr += PAGE_SIZE) { - unsigned long hyp_addr = KERN_TO_HYP(addr); - - pte = pte_offset_kernel(pmd, hyp_addr); - BUG_ON(pfn_valid(*pfn_base)); - kvm_set_pte(pte, pfn_pte(*pfn_base, PAGE_HYP_DEVICE)); - (*pfn_base)++; + for (addr = start; addr end; addr += PAGE_SIZE) { + pte = pte_offset_kernel(pmd, addr); + kvm_set_pte(pte, pfn_pte(pfn, prot)); + pfn++; } } static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, - unsigned long end, unsigned long *pfn_base) + unsigned long end, unsigned long pfn, + pgprot_t prot) { pmd_t *pmd; pte_t *pte; unsigned long addr, next; for (addr = start; addr end; addr = next) { - unsigned long hyp_addr = KERN_TO_HYP(addr); - pmd = pmd_offset(pud, hyp_addr); + pmd = pmd_offset(pud, addr); BUG_ON(pmd_sect(*pmd)); if (pmd_none(*pmd)) { - pte = pte_alloc_one_kernel(NULL, hyp_addr); + pte = pte_alloc_one_kernel(NULL, addr); if (!pte) { kvm_err(Cannot allocate Hyp pte\n); return -ENOMEM; @@ -182,25 +162,17 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start, next = pmd_addr_end(addr, end); - /* -* If pfn_base is NULL, we map kernel pages into HYP with the -* virtual address. Otherwise, this is considered an I/O -* mapping and we map the physical region starting at -* *pfn_base to [start, end[. -*/ - if (!pfn_base) - create_hyp_pte_mappings(pmd, addr, next); - else - create_hyp_io_pte_mappings(pmd, addr, next, pfn_base); + create_hyp_pte_mappings(pmd, addr, next, pfn, prot); + pfn += (next - addr) PAGE_SHIFT; } return 0; } -static int __create_hyp_mappings(void *from, void *to, unsigned long *pfn_base) +static int __create_hyp_mappings(pgd_t *pgdp, +unsigned long start, unsigned long end, +unsigned long pfn, pgprot_t prot) { - unsigned long start = (unsigned long)from; - unsigned long end = (unsigned long)to; pgd_t *pgd; pud_t *pud; pmd_t *pmd; @@ -209,21 +181,14 @@ static int __create_hyp_mappings(void *from, void *to, unsigned long *pfn_base) if (start = end) return -EINVAL; - /* Check for a valid kernel memory mapping */ - if (!pfn_base (!virt_addr_valid(from) || !virt_addr_valid(to - 1))) - return -EINVAL; - /* Check for a valid kernel IO mapping */ - if (pfn_base (!is_vmalloc_addr(from) || !is_vmalloc_addr(to - 1))) - return -EINVAL; mutex_lock(kvm_hyp_pgd_mutex); - for (addr = start; addr end; addr = next) { - unsigned long hyp_addr = KERN_TO_HYP(addr); -
[PATCH 7/7] ARM: KVM: perform HYP initilization for hotplugged CPUs
Now that we have the necessary infrastructure to boot a hotplugged CPU at any point in time, wire a CPU notifier that will perform the HYP init for the incoming CPU. Note that this depends on the platform code and/or firmware to boot the incoming CPU with HYP mode enabled and return to the kernel by following the normal boot path (HYP stub installed). Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kvm/arm.c | 47 +++ 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index f0f3290..6cc076e 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -793,8 +793,9 @@ long kvm_arch_vm_ioctl(struct file *filp, } } -static void cpu_init_hyp_mode(void *vector) +static void cpu_init_hyp_mode(void *dummy) { + phys_addr_t init_vector_addr = virt_to_phys(__kvm_hyp_init); unsigned long long boot_pgd_ptr; unsigned long long pgd_ptr; unsigned long hyp_stack_ptr; @@ -802,7 +803,7 @@ static void cpu_init_hyp_mode(void *vector) unsigned long vector_ptr; /* Switch from the HYP stub to our own HYP init vector */ - __hyp_set_vectors((unsigned long)vector); + __hyp_set_vectors(init_vector_addr); boot_pgd_ptr = (unsigned long long)kvm_mmu_get_boot_httbr(); pgd_ptr = (unsigned long long)kvm_mmu_get_httbr(); @@ -813,12 +814,28 @@ static void cpu_init_hyp_mode(void *vector) __cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr); } +static int hyp_init_cpu_notify(struct notifier_block *self, + unsigned long action, void *cpu) +{ + switch (action) { + case CPU_STARTING: + case CPU_STARTING_FROZEN: + cpu_init_hyp_mode(NULL); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block hyp_init_cpu_nb = { + .notifier_call = hyp_init_cpu_notify, +}; + /** * Inits Hyp-mode on all online CPUs */ static int init_hyp_mode(void) { - phys_addr_t init_phys_addr; int cpu; int err = 0; @@ -851,19 +868,6 @@ static int init_hyp_mode(void) } /* -* Execute the init code on each CPU. -* -* Note: The stack is not mapped yet, so don't do anything else than -* initializing the hypervisor mode on each CPU using a local stack -* space for temporary storage. -*/ - init_phys_addr = virt_to_phys(__kvm_hyp_init); - for_each_online_cpu(cpu) { - smp_call_function_single(cpu, cpu_init_hyp_mode, -(void *)(long)init_phys_addr, 1); - } - - /* * Map the Hyp-code called directly from the host */ err = create_hyp_mappings(__kvm_hyp_code_start, __kvm_hyp_code_end); @@ -908,6 +912,11 @@ static int init_hyp_mode(void) } /* +* Execute the init code on each CPU. +*/ + on_each_cpu(cpu_init_hyp_mode, NULL, 1); + + /* * Init HYP view of VGIC */ err = kvm_vgic_hyp_init(); @@ -963,6 +972,12 @@ int kvm_arch_init(void *opaque) if (err) goto out_err; + err = register_cpu_notifier(hyp_init_cpu_nb); + if (err) { + kvm_err(Cannot register HYP init CPU notifier (%d)\n, err); + goto out_err; + } + kvm_coproc_table_init(); return 0; out_err: -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] ARM: KVM: Revamping the HYP init code for fun and profit
Over the past few weeks, I've gradually realized how broken our HYP idmap code is. Badly broken. The main problem is about supporting CPU hotplug. Imagine a CPU being initialized normally, running VMs, and then being powered down. So far, so good. Now mentally bring it back online. The CPU will come back via the secondary CPU boot path, and then what? We cannot use it anymore, because we need an idmap which is long gone, and because our page tables are now live, containing the world-switch code, VM structures, and other bits and pieces. Another fun issue is that we don't have any TLB invalidation in the HYP init code. And guess what? we cannot do it! HYP TLB invalidation has to occur in HYP, and once we've installed the runtime page tables, it is already too late. It is actually fairly easy to construct a scenario where idmap and runtime pages have colliding translations. The nail on the coffin was provided by Catalin Marinas who told me how much he disliked the arm64 HYP idmap code, and made me realize that we already have all the necessary code in arch/arm/kvm/mmu.c. It just needs a tiny bit of care and affection. With a chainsaw. The solution to the first two issues is a bit tricky, but doesn't involve a lot of code. The hotplug problem mandates that we keep two sets of page tables (boot and runtime). The TLB problem mandates that we're able to transition from one PGD to another while in HYP, invalidating the TLBs in the process. To be able to do this, we need to share a page between the two page tables. A page that will have the same VA in both configurations. All we need is a VA that has the following properties: - This VA can't be used to represent a kernel mapping. - This VA will not conflict with the physical address of the kernel text The vectors page VA seems to satisfy this requirement: - The kernel never maps anything else there - The kernel text being copied at the beginning of the physical memory, it is unlikely to use the last 64kB (I doubt we'll ever support KVM on a system with something like 4MB of RAM, but patches are very welcome). Let's call this VA the trampoline VA. Now, we map our init page at 3 locations: - idmap in the boot pgd - trampoline VA in the boot pgd - trampoline VA in the runtime pgd The init scenario is now the following: - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd, runtime stack, runtime vectors - Enable the MMU with the boot pgd - Jump to a target into the trampoline page (remember, this is the same physical page!) - Now switch to the runtime pgd (same VA, and still the same physical page!) - Invalidate TLBs - Set stack and vectors - Profit! (or eret, if you only care about the code). Once we have this infrastructure in place, supporting CPU hot-plug is a piece of cake. Just wire a cpu-notifier in the existing code. This has been tested on both arm (VE TC2) and arm64 (Foundation Model). Marc Zyngier (7): ARM: KVM: simplify HYP mapping population ARM: KVM: fix HYP mapping limitations around zero ARM: KVM: move to a KVM provided HYP idmap ARM: KVM: enforce page alignment for identity mapped code ARM: KVM: parametrize HYP page table freeing ARM: KVM: switch to a dual-step HYP init code ARM: KVM: perform HYP initilization for hotplugged CPUs arch/arm/include/asm/idmap.h| 1 - arch/arm/include/asm/kvm_host.h | 18 +++- arch/arm/include/asm/kvm_mmu.h | 24 - arch/arm/kernel/vmlinux.lds.S | 2 +- arch/arm/kvm/arm.c | 58 ++ arch/arm/kvm/init.S | 36 ++- arch/arm/kvm/mmu.c | 232 +--- arch/arm/mm/idmap.c | 31 +- 8 files changed, 227 insertions(+), 175 deletions(-) -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] ARM: KVM: parametrize HYP page table freeing
In order to prepare for having to deal with multiple HYP page tables, pass the PGD parameter to the function performing the freeing of the page tables. Also move the freeing of the PGD itself there, and rename the free_hyp_pmds to free_hyp_pgds. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_mmu.h | 2 +- arch/arm/kvm/arm.c | 2 +- arch/arm/kvm/mmu.c | 30 +- 3 files changed, 19 insertions(+), 15 deletions(-) diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 3c71a1d..92eb20d 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -32,7 +32,7 @@ int create_hyp_mappings(void *from, void *to); int create_hyp_io_mappings(void *from, void *to, phys_addr_t); -void free_hyp_pmds(void); +void free_hyp_pgds(void); int kvm_alloc_stage2_pgd(struct kvm *kvm); void kvm_free_stage2_pgd(struct kvm *kvm); diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index 2ce90bb..6eba879 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -936,7 +936,7 @@ static int init_hyp_mode(void) out_free_context: free_percpu(kvm_host_cpu_state); out_free_mappings: - free_hyp_pmds(); + free_hyp_pgds(); out_free_stack_pages: for_each_possible_cpu(cpu) free_page(per_cpu(kvm_arm_hyp_stack_page, cpu)); diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index 7d23480..85b3553 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -86,42 +86,46 @@ static void free_ptes(pmd_t *pmd, unsigned long addr) } } -static void free_hyp_pgd_entry(unsigned long addr) +static void free_hyp_pgd_entry(pgd_t *pgdp, unsigned long addr) { pgd_t *pgd; pud_t *pud; pmd_t *pmd; - unsigned long hyp_addr = KERN_TO_HYP(addr); - pgd = hyp_pgd + pgd_index(hyp_addr); - pud = pud_offset(pgd, hyp_addr); + pgd = pgdp + pgd_index(addr); + pud = pud_offset(pgd, addr); if (pud_none(*pud)) return; BUG_ON(pud_bad(*pud)); - pmd = pmd_offset(pud, hyp_addr); + pmd = pmd_offset(pud, addr); free_ptes(pmd, addr); pmd_free(NULL, pmd); pud_clear(pud); } /** - * free_hyp_pmds - free a Hyp-mode level-2 tables and child level-3 tables + * free_hyp_pgds - free Hyp-mode page tables * - * Assumes this is a page table used strictly in Hyp-mode and therefore contains + * Assumes hyp_pgd is a page table used strictly in Hyp-mode and therefore contains * either mappings in the kernel memory area (above PAGE_OFFSET), or * device mappings in the vmalloc range (from VMALLOC_START to VMALLOC_END). */ -void free_hyp_pmds(void) +void free_hyp_pgds(void) { unsigned long addr; mutex_lock(kvm_hyp_pgd_mutex); - for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE) - free_hyp_pgd_entry(addr); - for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr += PGDIR_SIZE) - free_hyp_pgd_entry(addr); + + if (hyp_pgd) { + for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE) + free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr)); + for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr += PGDIR_SIZE) + free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr)); + kfree(hyp_pgd); + } + mutex_unlock(kvm_hyp_pgd_mutex); } @@ -741,7 +745,7 @@ int kvm_mmu_init(void) return 0; out: - kfree(hyp_pgd); + free_hyp_pgds(); return err; } -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code
We're about to move to a init procedure where we rely on the fact that the init code fits in a single page. Make sure we align the idmap text on a page boundary, and that the code is not bigger than a single page. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/kernel/vmlinux.lds.S | 2 +- arch/arm/kvm/init.S | 7 +++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S index b571484..d9dd265 100644 --- a/arch/arm/kernel/vmlinux.lds.S +++ b/arch/arm/kernel/vmlinux.lds.S @@ -20,7 +20,7 @@ VMLINUX_SYMBOL(__idmap_text_start) = .; \ *(.idmap.text) \ VMLINUX_SYMBOL(__idmap_text_end) = .; \ - ALIGN_FUNCTION(); \ + . = ALIGN(PAGE_SIZE); \ VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \ *(.hyp.idmap.text) \ VMLINUX_SYMBOL(__hyp_idmap_text_end) = .; diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S index 9f37a79..35a463f 100644 --- a/arch/arm/kvm/init.S +++ b/arch/arm/kvm/init.S @@ -111,4 +111,11 @@ __do_hyp_init: .globl __kvm_hyp_init_end __kvm_hyp_init_end: + /* +* The above code *must* fit in a single page for the trampoline +* madness to work. Whoever decides to change it must make sure +* we map the right amount of memory for the trampoline to work. +* The line below ensures any breakage will get noticed. +*/ + .org__kvm_hyp_init + PAGE_SIZE .popsection -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] ARM: KVM: move to a KVM provided HYP idmap
After the HYP page table rework, it is pretty easy to let the KVM code provide its own idmap, rather than expecting the kernel to provide it. It takes actually less code to do so. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/idmap.h | 1 - arch/arm/include/asm/kvm_mmu.h | 1 - arch/arm/kvm/mmu.c | 24 +++- arch/arm/mm/idmap.c| 31 +-- 4 files changed, 24 insertions(+), 33 deletions(-) diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h index 1a66f907..bf863ed 100644 --- a/arch/arm/include/asm/idmap.h +++ b/arch/arm/include/asm/idmap.h @@ -8,7 +8,6 @@ #define __idmap __section(.idmap.text) noinline notrace extern pgd_t *idmap_pgd; -extern pgd_t *hyp_pgd; void setup_mm_for_reboot(void); diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h index 970f3b5..3c71a1d 100644 --- a/arch/arm/include/asm/kvm_mmu.h +++ b/arch/arm/include/asm/kvm_mmu.h @@ -21,7 +21,6 @@ #include asm/cacheflush.h #include asm/pgalloc.h -#include asm/idmap.h /* * We directly use the kernel VA for the HYP, as we can directly share diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c index eb4f8fa..7d23480 100644 --- a/arch/arm/kvm/mmu.c +++ b/arch/arm/kvm/mmu.c @@ -32,6 +32,7 @@ extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[]; +static pgd_t *hyp_pgd; static DEFINE_MUTEX(kvm_hyp_pgd_mutex); static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa) @@ -715,12 +716,33 @@ phys_addr_t kvm_mmu_get_httbr(void) int kvm_mmu_init(void) { + unsigned long hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start); + unsigned long hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end); + int err; + + hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL); if (!hyp_pgd) { kvm_err(Hyp mode PGD not allocated\n); - return -ENOMEM; + err = -ENOMEM; + goto out; + } + + /* Create the idmap in the boot page tables */ + err = __create_hyp_mappings(boot_hyp_pgd, + hyp_idmap_start, hyp_idmap_end, + __phys_to_pfn(hyp_idmap_start), + PAGE_HYP); + + if (err) { + kvm_err(Failed to idmap %lx-%lx\n, + hyp_idmap_start, hyp_idmap_end); + goto out; } return 0; +out: + kfree(hyp_pgd); + return err; } /** diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c index 5ee505c..9c467d0 100644 --- a/arch/arm/mm/idmap.c +++ b/arch/arm/mm/idmap.c @@ -83,37 +83,10 @@ static void identity_mapping_add(pgd_t *pgd, const char *text_start, } while (pgd++, addr = next, addr != end); } -#if defined(CONFIG_ARM_VIRT_EXT) defined(CONFIG_ARM_LPAE) -pgd_t *hyp_pgd; - -extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[]; - -static int __init init_static_idmap_hyp(void) -{ - hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL); - if (!hyp_pgd) - return -ENOMEM; - - pr_info(Setting up static HYP identity map for 0x%p - 0x%p\n, - __hyp_idmap_text_start, __hyp_idmap_text_end); - identity_mapping_add(hyp_pgd, __hyp_idmap_text_start, -__hyp_idmap_text_end, PMD_SECT_AP1); - - return 0; -} -#else -static int __init init_static_idmap_hyp(void) -{ - return 0; -} -#endif - extern char __idmap_text_start[], __idmap_text_end[]; static int __init init_static_idmap(void) { - int ret; - idmap_pgd = pgd_alloc(init_mm); if (!idmap_pgd) return -ENOMEM; @@ -123,12 +96,10 @@ static int __init init_static_idmap(void) identity_mapping_add(idmap_pgd, __idmap_text_start, __idmap_text_end, 0); - ret = init_static_idmap_hyp(); - /* Flush L1 for the hardware to see this page table content */ flush_cache_louis(); - return ret; + return 0; } early_initcall(init_static_idmap); -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check
On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote: On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto: On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote: I think it's the right thing to do, but maybe not the right place to do this, need to reset after all IO is done, before ring memory is write protected. Our emails are crossing each other unfortunately, but I want to reinforce this: ring memory is not write protected. Understood. However, AFAICT the act of write protecting these ranges for ROM generates the offending callbacks to vhost_set_memory(). The part that I'm missing is if ring memory is not being write protected by make_bios_readonly_intel(), why are the vhost_set_memory() calls being invoked..? Because mappings change for the region that contains the ring. vhost doesn't know yet that the changes do not affect ring memory, vhost_set_memory() is called exactly to ascertain that. Hi Paolo Co, Here's a bit more information on what is going on with the same cpu_physical_memory_map() failure in vhost_verify_ring_mappings().. So as before, at the point that seabios is marking memory as readonly for ROM in src/shadow.c:make_bios_readonly_intel() with the following call: Calling pci_config_writeb(0x31): bdf: 0x pam: 0x005b the memory API update hook triggers back into vhost_region_del() code, and following occurs: Entering vhost_region_del section: 0x7fd30a213b60 offset_within_region: 0xc size: 2146697216 readonly: 0 vhost_region_del: is_rom: 0, rom_device: 0 vhost_region_del: readable: 1 vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648 vhost_region_del: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213b60 add: 0, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c size: 2146697216 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 5124 address_space_map: addr: 0xed000, plen: 5124 address_space_map: l: 4096, len: 5124 phys_page_find got PHYS_MAP_NODE_NIL .. address_space_map: section: 0x7fd30fabaed0 memory_region_is_ram: 0 readonly: 0 address_space_map: section: 0x7fd30fabaed0 offset_within_region: 0x0 section size: 18446744073709551615 Unable to map ring buffer for ring 2, l: 4096 So the interesting part is that phys_page_find() is not able to locate the corresponding page for vq-ring_phys: 0xed000 from the vhost_region_del() callback with section-offset_within_region: 0xc.. Is there any case where this would not be considered a bug..? register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 0xc size: 32768 readonly: 1 vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c size: 32768 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: Got !ranges_overlap, skipping register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 0xc8000 size: 2146664448 readonly: 0 vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c8000 size: 2146664448 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 5124 address_space_map: addr: 0xed000, plen: 5124 address_space_map: l: 4096, len: 5124 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 readonly: 0 address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 section size: 2146664448 address_space_map: l: 4096, len: 1028 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 readonly: 0 address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 section size: 2146664448 address_space_map: Calling qemu_ram_ptr_length: raddr: 0x
Re: [PATCH uq/master v2 1/2] kvm: reset state from the CPU's reset method
On Fri, Mar 22, 2013 at 09:37:16PM +0100, Paolo Bonzini wrote: Now that we have a CPU object with a reset method, it is better to keep the KVM reset close to the CPU reset. Using qemu_register_reset as we do now keeps them far apart. As a side effect, a CPU reset (cpu_reset) will reset the KVM state too. Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- include/sysemu/kvm.h | 2 -- kvm-all.c | 11 --- target-arm/kvm.c | 4 target-i386/cpu.c | 5 + target-i386/kvm_i386.h | 1 + target-ppc/kvm.c | 4 target-s390x/cpu.c | 4 target-s390x/cpu.h | 1 + 8 files changed, 11 insertions(+), 21 deletions(-) diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h index f2d97b5..50072c5 100644 --- a/include/sysemu/kvm.h +++ b/include/sysemu/kvm.h @@ -199,8 +199,6 @@ int kvm_arch_init_vcpu(CPUState *cpu); /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */ unsigned long kvm_arch_vcpu_id(CPUState *cpu); -void kvm_arch_reset_vcpu(CPUState *cpu); - int kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr); int kvm_arch_on_sigbus(int code, void *addr); diff --git a/kvm-all.c b/kvm-all.c index 9b433d3..57616ef 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -207,13 +207,6 @@ static int kvm_set_user_memory_region(KVMState *s, KVMSlot *slot) return kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, mem); } -static void kvm_reset_vcpu(void *opaque) -{ -CPUState *cpu = opaque; - -kvm_arch_reset_vcpu(cpu); -} - int kvm_init_vcpu(CPUState *cpu) { KVMState *s = kvm_state; @@ -253,10 +246,6 @@ int kvm_init_vcpu(CPUState *cpu) } ret = kvm_arch_init_vcpu(cpu); -if (ret == 0) { -qemu_register_reset(kvm_reset_vcpu, cpu); -kvm_arch_reset_vcpu(cpu); -} err: return ret; } diff --git a/target-arm/kvm.c b/target-arm/kvm.c index 82e2e08..841b85f 100644 --- a/target-arm/kvm.c +++ b/target-arm/kvm.c @@ -430,10 +430,6 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run) return 0; } -void kvm_arch_reset_vcpu(CPUState *cs) -{ -} - bool kvm_arch_stop_on_emulation_error(CPUState *cs) { return true; diff --git a/target-i386/cpu.c b/target-i386/cpu.c index a0640db..a5746cd 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -24,6 +24,7 @@ #include cpu.h #include sysemu/kvm.h #include sysemu/cpus.h +#include kvm_i386.h #include topology.h #include qemu/option.h @@ -2015,6 +2016,10 @@ static void x86_cpu_reset(CPUState *s) } s-halted = !cpu_is_bsp(cpu); + +if (kvm_enabled()) { +kvm_arch_reset_vcpu(s); +} #endif } diff --git a/target-i386/kvm_i386.h b/target-i386/kvm_i386.h index 4392ab4..3accc2d 100644 --- a/target-i386/kvm_i386.h +++ b/target-i386/kvm_i386.h @@ -14,6 +14,7 @@ #include sysemu/kvm.h bool kvm_allows_irq0_override(void); +void kvm_arch_reset_vcpu(CPUState *cs); int kvm_device_pci_assign(KVMState *s, PCIHostDeviceAddress *dev_addr, uint32_t flags, uint32_t *dev_id); diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index e663ff0..0adea12 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -424,10 +424,6 @@ int kvm_arch_init_vcpu(CPUState *cs) return ret; } -void kvm_arch_reset_vcpu(CPUState *cpu) -{ -} - static void kvm_sw_tlb_put(PowerPCCPU *cpu) { CPUPPCState *env = cpu-env; diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c index 23fe51f..6321384 100644 --- a/target-s390x/cpu.c +++ b/target-s390x/cpu.c @@ -84,6 +84,10 @@ static void s390_cpu_reset(CPUState *s) * after incrementing the cpu counter */ #if !defined(CONFIG_USER_ONLY) s-halted = 1; + +if (kvm_enabled()) { +kvm_arch_reset_vcpu(s); Does this compile with kvm support disabled? +} #endif tlb_flush(env, 1); } diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h index e351005..fc84159 100644 --- a/target-s390x/cpu.h +++ b/target-s390x/cpu.h @@ -352,6 +352,7 @@ void s390x_cpu_timer(void *opaque); int s390_virtio_hypercall(CPUS390XState *env); #ifdef CONFIG_KVM +void kvm_arch_reset_vcpu(CPUState *cs); void kvm_s390_interrupt(S390CPU *cpu, int type, uint32_t code); void kvm_s390_virtio_irq(S390CPU *cpu, int config_change, uint64_t token); void kvm_s390_interrupt_internal(S390CPU *cpu, int type, uint32_t parm, -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
-Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 - -- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* + * Flag indicating that debug registers are used by guest + * and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features in linux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE0x0 +#define KVMPPC_DEBUG_BREAKPOINT (1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ (1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* + * exiting to userspace because of h/w breakpoint, watchpoint + * (read, write or both) and software breakpoint. + */ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE 0x0 -#define KVMPPC_DEBUG_BREAKPOINT (1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ (1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr = ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* + * Since there is no shadow MSR, sync MSR_DE into the guest + * visible MSR. Do not allow guest to change MSR[DE]. + */ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets enabled. It should automatically get synchronized as soon as the next
Re: [PATCH] pmu: prepare for migration support
On Thu, Mar 28, 2013 at 05:18:35PM +0100, Paolo Bonzini wrote: In order to migrate the PMU state correctly, we need to restore the values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written). We also need to write the full 40-bit value of the performance counter, which would only be possible with a v3 architectural PMU's full-width counter MSRs. To distinguish host-initiated writes from the guest's, pass the full struct msr_data to kvm_pmu_set_msr. Signed-off-by: Paolo Bonzini pbonz...@redhat.com Applied, thanks. --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/pmu.c | 14 +++--- arch/x86/kvm/x86.c | 4 ++-- 3 files changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 36fba01..e2e09f3 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1029,7 +1029,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu); void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu); bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr); int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data); -int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data); +int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info); int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data); void kvm_handle_pmu_event(struct kvm_vcpu *vcpu); void kvm_deliver_pmi(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index cfc258a..c53e797 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -360,10 +360,12 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data) return 1; } -int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) +int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info) { struct kvm_pmu *pmu = vcpu-arch.pmu; struct kvm_pmc *pmc; + u32 index = msr_info-index; + u64 data = msr_info-data; switch (index) { case MSR_CORE_PERF_FIXED_CTR_CTRL: @@ -375,6 +377,10 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) } break; case MSR_CORE_PERF_GLOBAL_STATUS: + if (msr_info-host_initiated) { + pmu-global_status = data; + return 0; + } break; /* RO MSR */ case MSR_CORE_PERF_GLOBAL_CTRL: if (pmu-global_ctrl == data) @@ -386,7 +392,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) break; case MSR_CORE_PERF_GLOBAL_OVF_CTRL: if (!(data (pmu-global_ctrl_mask ~(3ull62 { - pmu-global_status = ~data; + if (!msr_info-host_initiated) + pmu-global_status = ~data; pmu-global_ovf_ctrl = data; return 0; } @@ -394,7 +401,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data) default: if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) || (pmc = get_fixed_pmc(pmu, index))) { - data = (s64)(s32)data; + if (!msr_info-host_initiated) + data = (s64)(s32)data; pmc-counter += data - read_pmc(pmc); return 0; } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) { diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3e0a8ba..1d928af 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2042,7 +2042,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) case MSR_P6_EVNTSEL0: case MSR_P6_EVNTSEL1: if (kvm_pmu_msr(vcpu, msr)) - return kvm_pmu_set_msr(vcpu, msr, data); + return kvm_pmu_set_msr(vcpu, msr_info); if (pr || data != 0) vcpu_unimpl(vcpu, disabled perfctr wrmsr: @@ -2088,7 +2088,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) if (msr (msr == vcpu-kvm-arch.xen_hvm_config.msr)) return xen_hvm_config(vcpu, data); if (kvm_pmu_msr(vcpu, msr)) - return kvm_pmu_set_msr(vcpu, msr, data); + return kvm_pmu_set_msr(vcpu, msr_info); if (!ignore_msrs) { vcpu_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n, msr, data); -- 1.8.1.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup
On Tue, Apr 02, 2013 at 03:15:31PM +0300, Michael S. Tsirkin wrote: On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote: On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote: On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote: Currently, vs-vs_endpoint is used indicate if the endpoint is setup or not. It is set or cleared in vhost_scsi_set_endpoint() or vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. However, when we check it in vhost_scsi_handle_vq(), we ignored the lock. Instead of using the vs-vs_endpoint and the vs-dev.mutex lock to indicate the status of the endpoint, we use per virtqueue vq-private_data to indicate it. In this way, we can only take the vq-mutex lock which is per queue and make the concurrent multiqueue process having less lock contention. Further, in the read side of vq-private_data, we can even do not take only lock if it is accessed in the vhost worker thread, because it is protected by vhost rcu. Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 38 +- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 5e3d4487..0524267 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -67,7 +67,6 @@ struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET]; char vs_vhost_wwpn[TRANSPORT_IQN_LEN]; - bool vs_endpoint; struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; @@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov) ((unsigned long)iov-iov_base PAGE_MASK)) PAGE_SHIFT; } +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq) +{ + bool ret = false; + + /* +* We can handle the vq only after the endpoint is setup by calling the +* VHOST_SCSI_SET_ENDPOINT ioctl. +* +* TODO: Check that we are running from vhost_worker which acts +* as read-side critical section for vhost kind of RCU. +* See the comments in struct vhost_virtqueue in drivers/vhost/vhost.h +*/ + if (rcu_dereference_check(vq-private_data, 1)) + ret = true; + + return ret; +} + static int tcm_vhost_check_true(struct se_portal_group *se_tpg) { return 1; @@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs, int head, ret; u8 target; - /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */ - if (unlikely(!vs-vs_endpoint)) + if (!tcm_vhost_check_endpoint(vq)) return; I would just move the check to under vq mutex, and avoid rcu completely. In vhost-net we are using private data outside lock so we can't do this, no such issue here. Are you talking about: handle_tx: /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; wmem = atomic_read(sock-sk-sk_wmem_alloc); if (wmem = sock-sk-sk_sndbuf) { mutex_lock(vq-mutex); tx_poll_start(net, sock); mutex_unlock(vq-mutex); return; } mutex_lock(vq-mutex); Why not do the atomic_read and tx_poll_start under the vq-mutex, and thus do the check under the lock as well. handle_rx: mutex_lock(vq-mutex); /* TODO: check that we are running from vhost_worker? */ struct socket *sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; mutex_lock(vq-mutex); Can't we can do the check under the vq-mutex
Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup
On Tue, Apr 02, 2013 at 11:10:02PM +0800, Asias He wrote: On Tue, Apr 02, 2013 at 03:15:31PM +0300, Michael S. Tsirkin wrote: On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote: On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote: On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote: On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin wrote: On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote: Currently, vs-vs_endpoint is used indicate if the endpoint is setup or not. It is set or cleared in vhost_scsi_set_endpoint() or vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. However, when we check it in vhost_scsi_handle_vq(), we ignored the lock. Instead of using the vs-vs_endpoint and the vs-dev.mutex lock to indicate the status of the endpoint, we use per virtqueue vq-private_data to indicate it. In this way, we can only take the vq-mutex lock which is per queue and make the concurrent multiqueue process having less lock contention. Further, in the read side of vq-private_data, we can even do not take only lock if it is accessed in the vhost worker thread, because it is protected by vhost rcu. Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 38 +- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 5e3d4487..0524267 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -67,7 +67,6 @@ struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET]; char vs_vhost_wwpn[TRANSPORT_IQN_LEN]; - bool vs_endpoint; struct vhost_dev dev; struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; @@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov) ((unsigned long)iov-iov_base PAGE_MASK)) PAGE_SHIFT; } +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq) +{ + bool ret = false; + + /* + * We can handle the vq only after the endpoint is setup by calling the + * VHOST_SCSI_SET_ENDPOINT ioctl. + * + * TODO: Check that we are running from vhost_worker which acts + * as read-side critical section for vhost kind of RCU. + * See the comments in struct vhost_virtqueue in drivers/vhost/vhost.h + */ + if (rcu_dereference_check(vq-private_data, 1)) + ret = true; + + return ret; +} + static int tcm_vhost_check_true(struct se_portal_group *se_tpg) { return 1; @@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs, int head, ret; u8 target; - /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */ - if (unlikely(!vs-vs_endpoint)) + if (!tcm_vhost_check_endpoint(vq)) return; I would just move the check to under vq mutex, and avoid rcu completely. In vhost-net we are using private data outside lock so we can't do this, no such issue here. Are you talking about: handle_tx: /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); if (!sock) return; wmem = atomic_read(sock-sk-sk_wmem_alloc); if (wmem = sock-sk-sk_sndbuf) { mutex_lock(vq-mutex); tx_poll_start(net, sock); mutex_unlock(vq-mutex); return; } mutex_lock(vq-mutex); Why not do the atomic_read and tx_poll_start under the vq-mutex, and thus do the check under the lock as well. handle_rx: mutex_lock(vq-mutex); /* TODO: check that we are running from vhost_worker? */ struct socket *sock = rcu_dereference_check(vq-private_data, 1);
[PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access
In vhost_scsi_handle_vq: tv_tpg = vs-vs_tpg[target]; if (!tv_tpg) { return } tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req, 1) vs-vs_tpg[target] might change after the NULL check and 2) the above line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use ACCESS_ONCE. Thanks mst for catching this up! Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 0524267..32d95e3 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -668,7 +668,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs, /* Extract the tpgt */ target = v_req.lun[1]; - tv_tpg = vs-vs_tpg[target]; + tv_tpg = ACCESS_ONCE(vs-vs_tpg[target]); /* Target does not exist, fail the request */ if (unlikely(!tv_tpg)) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access
On Tue, Apr 02, 2013 at 11:31:37PM +0800, Asias He wrote: In vhost_scsi_handle_vq: tv_tpg = vs-vs_tpg[target]; if (!tv_tpg) { return } tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req, 1) vs-vs_tpg[target] might change after the NULL check and 2) the above line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use ACCESS_ONCE. Thanks mst for catching this up! Signed-off-by: Asias He as...@redhat.com OK this might be ok for 3.9. Acked-by: Michael S. Tsirkin m...@redhat.com Nicholas can you pick this up pls? For 3.10 I still think it's best to get rid of it and stick vs-vs_tpg in vq-private_data. --- drivers/vhost/tcm_vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 0524267..32d95e3 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -668,7 +668,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs, /* Extract the tpgt */ target = v_req.lun[1]; - tv_tpg = vs-vs_tpg[target]; + tv_tpg = ACCESS_ONCE(vs-vs_tpg[target]); /* Target does not exist, fail the request */ if (unlikely(!tv_tpg)) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 04/02/2013 04:09 PM, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushanbharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushanbharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 - -- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* +* Flag indicating that debug registers are used by guest +* and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features inlinux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE 0x0 +#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ(1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* +* exiting to userspace because of h/w breakpoint, watchpoint +* (read, write or both) and software breakpoint. +*/ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE0x0 -#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ(1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr= ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* +* Since there is no shadow MSR, sync MSR_DE into the guest +* visible MSR. Do not allow guest to change MSR[DE]. +*/ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets enabled.
[PATCH v2] ARM: KVM: promote vfp_host pointer to generic host cpu context
We use the vfp_host pointer to store the host VFP context, should the guest start using VFP itself. Actually, we can use this pointer in a more generic way to store CPU speficic data, and arm64 is using it to dump the whole host state before switching to the guest. Simply rename the vfp_host field to host_cpu_context, and the corresponding type to kvm_cpu_context_t. No change in functionnality. Signed-off-by: Marc Zyngier marc.zyng...@arm.com --- arch/arm/include/asm/kvm_host.h | 8 +--- arch/arm/kernel/asm-offsets.c | 2 +- arch/arm/kvm/arm.c | 28 ++-- 3 files changed, 20 insertions(+), 18 deletions(-) diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h index 78813b8..a7a0bb5 100644 --- a/arch/arm/include/asm/kvm_host.h +++ b/arch/arm/include/asm/kvm_host.h @@ -87,7 +87,7 @@ struct kvm_vcpu_fault_info { u32 hyp_pc; /* PC when exception was taken from Hyp mode */ }; -typedef struct vfp_hard_struct kvm_kernel_vfp_t; +typedef struct vfp_hard_struct kvm_cpu_context_t; struct kvm_vcpu_arch { struct kvm_regs regs; @@ -105,8 +105,10 @@ struct kvm_vcpu_arch { struct kvm_vcpu_fault_info fault; /* Floating point registers (VFP and Advanced SIMD/NEON) */ - kvm_kernel_vfp_t vfp_guest; - kvm_kernel_vfp_t *vfp_host; + struct vfp_hard_struct vfp_guest; + + /* Host FP context */ + kvm_cpu_context_t *host_cpu_context; /* VGIC state */ struct vgic_cpu vgic_cpu; diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c index ee1ac39..92562a2 100644 --- a/arch/arm/kernel/asm-offsets.c +++ b/arch/arm/kernel/asm-offsets.c @@ -154,7 +154,7 @@ int main(void) DEFINE(VCPU_MIDR,offsetof(struct kvm_vcpu, arch.midr)); DEFINE(VCPU_CP15,offsetof(struct kvm_vcpu, arch.cp15)); DEFINE(VCPU_VFP_GUEST, offsetof(struct kvm_vcpu, arch.vfp_guest)); - DEFINE(VCPU_VFP_HOST,offsetof(struct kvm_vcpu, arch.vfp_host)); + DEFINE(VCPU_VFP_HOST,offsetof(struct kvm_vcpu, arch.host_cpu_context)); DEFINE(VCPU_REGS,offsetof(struct kvm_vcpu, arch.regs)); DEFINE(VCPU_USR_REGS,offsetof(struct kvm_vcpu, arch.regs.usr_regs)); DEFINE(VCPU_SVC_REGS,offsetof(struct kvm_vcpu, arch.regs.svc_regs)); diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c index e821c37..2ce90bb 100644 --- a/arch/arm/kvm/arm.c +++ b/arch/arm/kvm/arm.c @@ -48,7 +48,7 @@ __asm__(.arch_extension virt); #endif static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page); -static kvm_kernel_vfp_t __percpu *kvm_host_vfp_state; +static kvm_cpu_context_t __percpu *kvm_host_cpu_state; static unsigned long hyp_default_vectors; /* Per-CPU variable containing the currently running vcpu. */ @@ -325,7 +325,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu) void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { vcpu-cpu = cpu; - vcpu-arch.vfp_host = this_cpu_ptr(kvm_host_vfp_state); + vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state); /* * Check whether this vcpu requires the cache to be flushed on @@ -889,24 +889,24 @@ static int init_hyp_mode(void) } /* -* Map the host VFP structures +* Map the host CPU structures */ - kvm_host_vfp_state = alloc_percpu(kvm_kernel_vfp_t); - if (!kvm_host_vfp_state) { + kvm_host_cpu_state = alloc_percpu(kvm_cpu_context_t); + if (!kvm_host_cpu_state) { err = -ENOMEM; - kvm_err(Cannot allocate host VFP state\n); + kvm_err(Cannot allocate host CPU state\n); goto out_free_mappings; } for_each_possible_cpu(cpu) { - kvm_kernel_vfp_t *vfp; + kvm_cpu_context_t *cpu_ctxt; - vfp = per_cpu_ptr(kvm_host_vfp_state, cpu); - err = create_hyp_mappings(vfp, vfp + 1); + cpu_ctxt = per_cpu_ptr(kvm_host_cpu_state, cpu); + err = create_hyp_mappings(cpu_ctxt, cpu_ctxt + 1); if (err) { - kvm_err(Cannot map host VFP state: %d\n, err); - goto out_free_vfp; + kvm_err(Cannot map host CPU state: %d\n, err); + goto out_free_context; } } @@ -915,7 +915,7 @@ static int init_hyp_mode(void) */ err = kvm_vgic_hyp_init(); if (err) - goto out_free_vfp; + goto out_free_context; #ifdef CONFIG_KVM_ARM_VGIC vgic_present = true; @@ -933,8 +933,8 @@ static int init_hyp_mode(void) kvm_info(Hyp mode initialized successfully\n); return 0; -out_free_vfp: - free_percpu(kvm_host_vfp_state); +out_free_context: + free_percpu(kvm_host_cpu_state); out_free_mappings:
RFC: vfio API changes needed for powerpc
Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). Any feedback or direction? Thanks, Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 04/02/2013 09:09:34 AM, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support How does the normal debug register switching code work in Linux? Can't we just reuse that? Or rely on it to restore working state when another process gets scheduled in? Good point, I can see debug registers loading in function __switch_to()- switch_booke_debug_regs() in file arch/powerpc/kernel/process.c. So as long as assume that host will not use debug resources we can rely on this restore. But I am not sure that this is a fare assumption. As Scott earlier mentioned someone can use debug resource for kernel debugging also. Someone in the kernel can also use floating point registers. But then it's his responsibility to clean up the mess he leaves behind. I am neither convinced by what you said and nor even have much reason to oppose :) Scott, I remember you mentioned that host can use debug resources, you comment on this ? I thought the conclusion we reached was that it was OK as long as KVM waits until it actually needs the debug resources to mess with the registers. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). The new ioctl isn't really specific to PAMU (or whatever type2 is supposed to be, which nobody ever explains when I ask), so much as to the MSI implementation. It just exposes the MSI register as another device resource (well, technically a groupwide resource, unless we expose it on a per-device basis and provide enough information for userspace to recognize when it's the same for other devices in the group) to be mmapped, which userspace can choose to map in the IOMMU as well. Note that in the explicit case, userspace would have to program the MSI iova into the PCI device's config space (or communicate the chosen address to the kernel so it can set the config space registers). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
Hi Stuart, On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are passed to the IOMMU driver. So you can effectively have your own type2 ioctl extensions. Alexey has already posted patches to do this for SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to VFIO_IOMMU_GET_INFO to examine locked page requirements. As Scott notes we need to come up with a clean userspace interface for these though. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, and if it ever happened user space could adjust. Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. If only one MSI bank was involved the kernel could get clever and only enable the banks actually needed. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 03:38:42 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, What makes you so sure? Especially since you seem to be presenting this as not specifically an MPIC API. and if it ever happened user space could adjust. What bit of API is going to tell it that it needs to adjust? Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. Only if your existing mapping needs fit exactly in a power of two. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. Where did the assumption of a 64MiB subwindow size come from? If only one MSI bank was involved the kernel could get clever and only enable the banks actually needed. I'd rather see cleverness kept in userspace. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson alex.william...@redhat.com wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? The windows can be enabled/disabled event while the aperture is enabled (pretty sure)... B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Yes, it would potentially give user space the flexibility some more windows. It also makes more explicit when the MSI mappings are created. In option #A the MSI mappings would probably get created at the time of the first normal DMA map. So, you're saying with this approach you'd rather see a new type 2 ioctl instead of adding new flags to DMA map, right? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. It would be less overhead to just add an ioctl to enable MSI, maybe letting userspace pick which windows get used, but I'm still not sure what the value is to userspace in exposing it. Thanks, Thanks, Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. What do you mean by reduce the number exposed? Userspace decides how many entries there are, but it must be a power of two beteen 1 and 256. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? Subwindows can be modified while the aperture is enabled, but the aperture size and number of subwindows cannot be changed. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue dynamically modifying its other mappings? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 03:38:42 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote: On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote: Alex, We are in the process of implementing vfio-pci support for the Freescale IOMMU (PAMU). It is an aperture/window-based IOMMU and is quite different than x86, and will involve creating a 'type 2' vfio implementation. For each device's DMA mappings, PAMU has an overall aperture and a number of windows. All sizes and window counts must be power of 2. To illustrate, below is a mapping for a 256MB guest, including guest memory (backed by 64MB huge pages) and some windows for MSIs: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 5 0x1400 0xf_fe045000 4KB// msi bank 2 6 0x1800 0xf_fe046000 4KB// msi bank 3 7- - disabled There are a couple of updates needed to the vfio user-kernel interface that we would like your feedback on. 1. IOMMU geometry The kernel IOMMU driver now has an interface (see domain_set_attr, domain_get_attr) that lets us set the domain geometry using attributes. We want to expose that to user space, so envision needing a couple of new ioctls to do this: VFIO_IOMMU_SET_ATTR VFIO_IOMMU_GET_ATTR Note that this means attributes need to be updated for user-API appropriateness, such as using fixed-size types. 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) Where does userspace get the number 3 from? E.g. on newer chips there are 4 MSI banks. Maybe future chips have even more. Ok, then make the number 4. The chance of more MSI banks in future chips is nil, What makes you so sure? Especially since you seem to be presenting this as not specifically an MPIC API. and if it ever happened user space could adjust. What bit of API is going to tell it that it needs to adjust? Haven't thought through that completely, but I guess we could add an API to return the number of MSI banks for type 2 iommus. Also, practically speaking since memory is typically allocate in powers of 2 way you need to approximately double the window geometry anyway. Only if your existing mapping needs fit exactly in a power of two. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI group is involved in this VFIO group? What if four MSI groups are involved?). You'd need to either have a naturally aligned, power-of-two sized mapping that covers exactly the pages you want to map and no more, or you'd need to create a separate mapping for each MSI bank, and due to PAMU subwindow alignment restrictions these mappings could not be contiguous in iova-space. You're right, a single 24KB mapping wouldn't work-- in the case of 3 MSI banks perhaps we could just do one 64MB*3 mapping to identify which windows are used for MSIs. Where did the assumption of a 64MiB subwindow size come from? The example I was using. User space would need to create a mapping for window_size * msi_bank_count. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? We will define that as part of the type2 implementation. Highly unlikely anything but a PAMU will comply. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue dynamically modifying its other mappings? C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT VFIO_IOMMU_MAP_MSI(iova, size) Stuart -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson alex.william...@redhat.com wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. What kind of restrictions do you have around adding and removing windows while the aperture is enabled? The windows can be enabled/disabled event while the aperture is enabled (pretty sure)... B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. What's the value to userspace in determining which windows are used by which banks? It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Yes, it would potentially give user space the flexibility some more windows. It also makes more explicit when the MSI mappings are created. In option #A the MSI mappings would probably get created at the time of the first normal DMA map. So, you're saying with this approach you'd rather see a new type 2 ioctl instead of adding new flags to DMA map, right? I'm not sure I know enough yet to have a suggestion. What would be the purpose of userspace specifying the iova and size here? If userspace just needs to know that it needs X addition windows for MSI and can tell the kernel to use banks 0 through (X-1) for MSI, that sounds more like an ioctl interface than a DMA_MAP flag. Thanks, Alex C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu).
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote: 2. MSI window mappings The more problematic question is how to deal with MSIs. We need to create mappings for up to 3 MSI banks that a device may need to target to generate interrupts. The Linux MSI driver can allocate MSIs from the 3 banks any way it wants, and currently user space has no way of knowing which bank may be used for a given device. There are 3 options we have discussed and would like your direction: A. Implicit mappings -- with this approach user space would not explicitly map MSIs. User space would be required to set the geometry so that there are 3 unused windows (the last 3 windows) for MSIs, and it would be up to the kernel to create the mappings. This approach requires some specific semantics (leaving 3 windows) and it potentially gets a little weird-- when should the kernel actually create the MSI mappings? When should they be unmapped? Some convention would need to be established. VFIO would have control of SET/GET_ATTR, right? So we could reduce the number exposed to userspace on GET and transparently add MSI entries on SET. What do you mean by reduce the number exposed? Userspace decides how many entries there are, but it must be a power of two beteen 1 and 256. I didn't understand the API. On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? What kind of restrictions do you have around adding and removing windows while the aperture is enabled? Subwindows can be modified while the aperture is enabled, but the aperture size and number of subwindows cannot be changed. B. Explicit mapping using DMA map flags. The idea is that a new flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that a mapping is to be created for the supplied iova. No vaddr is given though. So in the above example there would be a a dma map at 0x1000 for 24KB (and no vaddr). It's up to the kernel to determine which bank gets mapped where. So, this option puts user space in control of which windows are used for MSIs and when MSIs are mapped/unmapped. There would need to be some semantics as to how this is used-- it only makes sense This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Type1 is arbitrary. It might as well be named brown and this one can be blue. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. It sounds like the case that there are X banks and if userspace wants to use MSI it needs to leave X windows available for that. Is this just buying userspace a few more windows to allow them the choice between MSI or RAM? Well, there could be that. But also, userspace will generally have a much better idea of the type of mappings it's creating, so it's easier to keep everything explicit at the kernel/user interface than require more complicated code in the kernel to figure things out automatically (not just for MSIs but in general). If the kernel automatically creates the MSI mappings, when does it assume that userspace is done creating its own? What if userspace doesn't need any DMA other than the MSIs? What if userspace wants to continue
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT VFIO_IOMMU_MAP_MSI(iova, size) How are MSIs related to devices on PAMU? On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:08:27 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? We will define that as part of the type2 implementation. Highly unlikely anything but a PAMU will comply. So then why not just call it pamu instead of being obfuscatory? There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, And what specifically does that ioctl do? It causes new mappings to be created, right? So you're changing (or at least adding to) the DMA map mechanism. it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. I don't see enough complexity in the mmap approach for anything to be vastly less complex in comparison. I think you're building the mmap approach up in your head to be a lot worse that it would actually be. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:16:11 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, What exploitation? It's not as if there is a pool of 256 global windows that users allocate from. The subwindow count is just how finely divided the aperture is. The only way one user will affect another is through cache contention (which is why we want the minimum number of subwindows that we can get away with). but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. If you double the subwindow count without userspace knowing, you have to double the aperture as well (and you may need to grow up or down depending on alignment). This means you also need to halve the maximum aperture that userspace can request. And you need to expose a different number of maximum subwindows in the IOMMU API based on whether we might have MSIs of this type. It's ugly and awkward, and removes the possibility for userspace to place the MSIs in some unused slot in the middle, or not use MSIs at all. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:32:04 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? Not sure what you mean -- PAMU doesn't get explicitly involved in MSIs. It's just another 4K page mapping (per relevant MSI bank). If you want isolation, you need to make sure that an MSI group is only used by one VFIO group, and that you're on a chip that has alias pages with just one MSI bank register each (newer chips do, but the first chip to have a PAMU didn't). This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. But how closely is the MSI situation related to a generic windowed IOMMU, then? We could just as well have a highly flexible IOMMU in terms of arbitrary 4K page mappings, but still handle MSIs as pages to be mapped rather than a translation table. Or we could have a windowed IOMMU that has an MSI translation table. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Well, that's my point. Is it reusable at all, anyway? If not, then giving it a more obscure name won't change that. If it is reusable, then where is the line drawn between things that are PAMU-specific or MPIC-specific and things that are part of the generic windowed IOMMU abstraction? Type1 is arbitrary. It might as well be named brown and this one can be blue. The difference is that type1 seems to refer to hardware that can do arbitrary 4K page mappings, possibly constrained by an aperture but nothing else. More than one IOMMU can reasonably fit that. The odds that another IOMMU would have exactly the same restrictions as PAMU seem smaller in comparison. In any case, if you had to deal with some Intel-only quirk, would it make sense to call it a type1 attribute? I'm not advocating one way or the other on whether an abstraction is viable here (though Stuart seems to think it's highly unlikely anything but a PAMU will comply), just that if it is to be abstracted rather than a hardware-specific interface, we need to document what is and is not part of the abstraction. Otherwise a non-PAMU-specific user won't know what they can rely on, and someone adding support for a new windowed IOMMU won't know if theirs is close enough, or they need to introduce a type3. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. Userspace would also need a way to find out the page offset and data value. That may be an argument in favor of having the two ioctls Stuart later suggested (get MSI count, and map MSI). Would there be any complication in the VFIO code from tracking a mapping that doesn't have a userspace virtual address associated with it? There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. Overhead in terms of code required and complexity. More things to reference count and shut down in the proper order on userspace exit. Thanks, That didn't stop others from having me convert the KVM device control API to use file descriptors instead of something more ad-hoc with a better-defined destruction order. :-) I don't know if it necessarily needs to be a separate fd -- it could be just another device resource like BARs, with some way for userspace to
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). How are MSIs related to devices on PAMU? PAMU doesn't care about MSIs. The relation of individual MSIs to a device is standard PCI stuff. Each MSI bank (which is part of the MPIC, not PAMU) can hold numerous MSIs. The VFIO user would want to map all MSI banks that are in use by any of the devices in the group. Ideally we'd let the VFIO grouping influence the allocation of MSIs. On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, What do you mean by IOMMU protection needs to be established? Wouldn't we just start with no mappings in place? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-v2 0/3] virtio/vhost: Add checks for uninitialized VQs
On Tue, 2013-04-02 at 15:01 +0300, Michael S. Tsirkin wrote: On Mon, Apr 01, 2013 at 11:58:21PM +, Nicholas A. Bellinger wrote: From: Nicholas Bellinger n...@linux-iscsi.org Hi folks, This series adds a virtio_queue_valid() for use by virtio-pci code in order to prevent opreations upon uninitialized VQs, which is currently expected to occur during seabios setup of virtio-scsi with in-flight vhost-scsi-pci device code. On the vhost side, it also adds virtio_queue_valid() sanity checks in vhost_virtqueue_[start,stop]() and vhost_verify_ring_mappings() in order to skip the same uninitialized VQs. Changes from v1: - Remove now unnecessary virtio_queue_get_num() calls in virtio-pci.c - Add virtio_queue_valid() calls in vhost_virtqueue_[start,stop]() Please review. --nab Looks reasonable. Acked-by: Michael S. Tsirkin m...@redhat.com Thanks MST! Anthony, do you want to pick these up now..? Or shall I include in the next vhost-scsi-pci PATCH-v3 series..? --nab So - does this fix the issues you saw with vhost-scsi? Michael S. Tsirkin (1): virtio: add API to check that ring is setup Nicholas Bellinger (2): virtio-pci: Add virtio_queue_valid checks ahead of virtio_queue_get_num vhost: Skip uninitialized VQs in vhost_virtqueue_[start,stop] hw/vhost.c | 12 hw/virtio-pci.c | 34 +++--- hw/virtio.c |5 + hw/virtio.h |1 + 4 files changed, 33 insertions(+), 19 deletions(-) -- 1.7.2.5 -- To unsubscribe from this list: send the line unsubscribe target-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access
On Tue, 2013-04-02 at 18:39 +0300, Michael S. Tsirkin wrote: On Tue, Apr 02, 2013 at 11:31:37PM +0800, Asias He wrote: In vhost_scsi_handle_vq: tv_tpg = vs-vs_tpg[target]; if (!tv_tpg) { return } tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req, 1) vs-vs_tpg[target] might change after the NULL check and 2) the above line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use ACCESS_ONCE. Thanks mst for catching this up! Signed-off-by: Asias He as...@redhat.com OK this might be ok for 3.9. Acked-by: Michael S. Tsirkin m...@redhat.com Nicholas can you pick this up pls? Applying to target-pending/master now. For 3.10 I still think it's best to get rid of it and stick vs-vs_tpg in vq-private_data. Your call here. Given that vhost-scsi-pci code + Seabios w/ virtio-scsi enabled will be broken without Asias's two extra vq-private_data and initialize vq-last_used_idx changes on the list, they will certainly need to hit 3.9.x code once your happy to ACK for v3.10. Asias, I assume you'll be updating this soon..? --nab -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status
Gleb Natapov wrote on 2013-04-02: On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2013-03-26: Il 22/03/2013 06:24, Yang Zhang ha scritto: +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu, + struct rtc_status *rtc_status, int irq) +{ + if (irq != RTC_GSI) + return; + + if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map)) + --rtc_status-pending_eoi; + + WARN_ON(rtc_status-pending_eoi 0); +} This is the only case where you're passing the struct rtc_status instead of the struct kvm_ioapic. Please use the latter, and make it the first argument. @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) irqe.level = 1; irqe.shorthand = 0; - return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL); + if (irq == RTC_GSI) { + ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, + ioapic-rtc_status.dest_map); + ioapic-rtc_status.pending_eoi = ret; I think you should either add a BUG_ON(ioapic-rtc_status.pending_eoi != 0); or use ioapic-rtc_status.pending_eoi += ret (or both). There may malicious guest to write EOI more than once. And the pending_eoi will be negative. But it should not be a bug. Just WARN_ON is enough. And we already do it in ack_eoi. So don't need to do duplicated thing here. Since we track vcpus that already called EOI and decrement pending_eoi only once for each vcpu malicious guest cannot trigger it, but we already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need another one here. += will be correct (since pending_eoi == 0 here), but confusing since it makes an impression that pending_eoi may not be zero. Yes, I also make the wrong impression. With previous implementation, the pening_eoi may not be zero: Calculate the destination vcpu via parse IOAPIC entry, and if using lowest priority deliver mode, set all possible vcpus in dest_map even it doesn't receive it finally. At same time, a malicious guest can send IPI with same vector of RTC to those vcpus who is in dest_map but not have RTC interrupt. Then the pending_eoi will be negative. Now, we set the dest_map with the vcpus who really received the interrupt. The above case cannot happen. So as you and Paolo suggested, it is better to use +=. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 3/6] KVM: Initialize irqfd from kvm_init().
On 02/28/2013 04:22 AM, Cornelia Huck wrote: Currently, eventfd introduces module_init/module_exit functions to initialize/cleanup the irqfd workqueue. This only works, however, if no other module_init/module_exit functions are built into the same module. Let's just move the initialization and cleanup to kvm_init and kvm_exit. This way, it is also clearer where kvm startup may fail. Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com I'm seeing this during boot: [6.763302] [ cut here ] [6.763763] WARNING: at kernel/workqueue.c:4204 destroy_workqueue+0x1df/0x3d0() [6.764507] Modules linked in: [6.764792] Pid: 1, comm: swapper/0 Tainted: GW 3.9.0-rc5-next-20130402-sasha-00015-g3522ec5 #324 [6.765654] Call Trace: [6.765875] [811074fb] warn_slowpath_common+0x8b/0xc0 [6.766436] [81107545] warn_slowpath_null+0x15/0x20 [6.766947] [8112ca7f] destroy_workqueue+0x1df/0x3d0 [6.768631] [8100d880] kvm_irqfd_exit+0x10/0x20 [6.77] [81004dbb] kvm_init+0x2ab/0x310 [6.770607] [86183dc0] ? cpu_has_kvm_support+0x4d/0x4d [6.771241] [86183fb4] vmx_init+0x1f4/0x437 [6.771709] [86183dc0] ? cpu_has_kvm_support+0x4d/0x4d [6.772266] [810020f2] do_one_initcall+0xb2/0x1b0 [6.772995] [86180021] kernel_init_freeable+0x15d/0x1ef [6.773857] [8617f801] ? loglevel+0x31/0x31 [6.774609] [83d51230] ? rest_init+0x140/0x140 [6.775551] [83d51239] kernel_init+0x9/0xf0 [6.776162] [83dbf37c] ret_from_fork+0x7c/0xb0 [6.776662] [83d51230] ? rest_init+0x140/0x140 [6.777241] ---[ end trace 10bba684ced4346a ]--- And I think it has something to do with this patch. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] KVM: Call kvm_apic_match_dest() to check destination vcpu
From: Yang Zhang yang.z.zh...@intel.com For a given vcpu, kvm_apic_match_dest() will tell you whether the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap() and use kvm_apic_match_dest() instead. Signed-off-by: Yang Zhang yang.z.zh...@intel.com --- arch/x86/kvm/lapic.c | 47 --- arch/x86/kvm/lapic.h |4 virt/kvm/ioapic.c|9 +++-- 3 files changed, 3 insertions(+), 57 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index a8e9369..e227474 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic) return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, - struct kvm_lapic_irq *irq, - u64 *eoi_exit_bitmap) -{ - struct kvm_lapic **dst; - struct kvm_apic_map *map; - unsigned long bitmap = 1; - int i; - - rcu_read_lock(); - map = rcu_dereference(vcpu-kvm-arch.apic_map); - - if (unlikely(!map)) { - __set_bit(irq-vector, (unsigned long *)eoi_exit_bitmap); - goto out; - } - - if (irq-dest_mode == 0) { /* physical mode */ - if (irq-delivery_mode == APIC_DM_LOWEST || - irq-dest_id == 0xff) { - __set_bit(irq-vector, - (unsigned long *)eoi_exit_bitmap); - goto out; - } - dst = map-phys_map[irq-dest_id 0xff]; - } else { - u32 mda = irq-dest_id (32 - map-ldr_bits); - - dst = map-logical_map[apic_cluster_id(map, mda)]; - - bitmap = apic_logical_id(map, mda); - } - - for_each_set_bit(i, bitmap, 16) { - if (!dst[i]) - continue; - if (dst[i]-vcpu == vcpu) { - __set_bit(irq-vector, - (unsigned long *)eoi_exit_bitmap); - break; - } - } - -out: - rcu_read_unlock(); -} - static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 2c721b9..baa20cf 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map *map, u32 ldr) return ldr map-lid_mask; } -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, - struct kvm_lapic_irq *irq, - u64 *eoi_bitmap); - static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu) { return vcpu-arch.apic-pending_events; diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c index 5ba005c..bb3906d 100644 --- a/virt/kvm/ioapic.c +++ b/virt/kvm/ioapic.c @@ -124,7 +124,6 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, { struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic; union kvm_ioapic_redirect_entry *e; - struct kvm_lapic_irq irqe; int index; spin_lock(ioapic-lock); @@ -135,11 +134,9 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu *vcpu, (e-fields.trig_mode == IOAPIC_LEVEL_TRIG || kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC, index))) { - irqe.dest_id = e-fields.dest_id; - irqe.vector = e-fields.vector; - irqe.dest_mode = e-fields.dest_mode; - irqe.delivery_mode = e-fields.delivery_mode 8; - kvm_calculate_eoi_exitmap(vcpu, irqe, eoi_exit_bitmap); + if (kvm_apic_match_dest(vcpu, NULL, 0, + e-fields.dest_id, e-fields.dest_mode)) + __set_bit(e-fileds.vector, (unsigned long *)eoi_exit_bitmap); } } spin_unlock(ioapic-lock); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/6] kvm: add device control API
On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com Some comments below... diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; you add it in a later patch. Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably should add the KVM_CAP_DEVICE_CTRL capability too. +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created Is this still a possible error code? --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQCHIP_NONE 0 +#define KVMPPC_IRQCHIP_MPIC 1 This define should go in the patch that adds the MPIC device. struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -549,6 +552,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irqchip_type; + void *irqchip_priv; Since you add this (irqchip_priv) only to remove it in a later patch and replace it by a device-specific pointer, why bother adding it here? And why not give irqchip_type the name it ultimately ends up with? diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 16b4595..bdfa526 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) tasklet_kill(vcpu-arch.tasklet); kvmppc_remove_vcpu_debugfs(vcpu); + + switch (vcpu-arch.irqchip_type) { + case KVMPPC_IRQCHIP_MPIC: + mpic_put(vcpu-arch.irqchip_priv); + break; + } This is going to break bisection, since you don't define mpic_put() in this patch. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 74d0ff3..20ce2d2 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_EPR 86 #define KVM_CAP_ARM_PSCI 87 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88 +#define KVM_CAP_DEVICE_CTRL 89 #ifdef KVM_CAP_IRQ_ROUTING @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping { #define KVM_ARM_SET_DEVICE_ADDR_IOW(KVMIO, 0xab, struct kvm_arm_device_addr) /* + * Device control API, available with KVM_CAP_DEVICE_CTRL + */ +#define KVM_CREATE_DEVICE_TEST 1 + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/02/2013 08:02:39 PM, Paul Mackerras wrote: On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com Some comments below... diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; Yes, it does (see below). you add it in a later patch. Maybe you're thinking of KVM_CAP_IRQ_MPIC? +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created Is this still a possible error code? If you mean ENOSPC, probably not -- it'd be replaced with whatever errors can come out of creating a file descriptor. --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQCHIP_NONE 0 +#define KVMPPC_IRQCHIP_MPIC 1 This define should go in the patch that adds the MPIC device. struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -549,6 +552,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irqchip_type; + void *irqchip_priv; Since you add this (irqchip_priv) only to remove it in a later patch and replace it by a device-specific pointer, why bother adding it here? And why not give irqchip_type the name it ultimately ends up with? Oops... These were patch shuffling accidents and will be removed from the next iteration. diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 16b4595..bdfa526 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) tasklet_kill(vcpu-arch.tasklet); kvmppc_remove_vcpu_debugfs(vcpu); + + switch (vcpu-arch.irqchip_type) { + case KVMPPC_IRQCHIP_MPIC: + mpic_put(vcpu-arch.irqchip_priv); + break; + } This is going to break bisection, since you don't define mpic_put() in this patch. Sigh. Something got messed up; I'll try to sort it out and resubmit. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 74d0ff3..20ce2d2 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_EPR 86 #define KVM_CAP_ARM_PSCI 87 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88 +#define KVM_CAP_DEVICE_CTRL 89 See, here's the capability. :-) /* + * Device control API, available with KVM_CAP_DEVICE_CTRL + */ +#define KVM_CREATE_DEVICE_TEST1 + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. That's actually why I moved it to a new
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/03/2013 01:30 AM, Scott Wood wrote: On 04/02/2013 01:59:57 AM, tiejun.chen wrote: On 04/02/2013 06:47 AM, Scott Wood wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff71541..ed033c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2158,6 +2158,17 @@ out: } #endif +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ +bool test = cd-flags KVM_CREATE_DEVICE_TEST; + +switch (cd-type) { +default: +return -ENODEV; +} Even after apply patch 5, looks here still misses something like: if (test) WARN_ON_ONCE(!cd-type); Why? How does userspace passing in a bad type value mean the kernel needs to report internal badness, why is a value of zero worse than any other bad value, and why only when the test flag is set? I just mean we need do something here since looks the 'test' variable is defined but unused, right? But please correct this as you expect :) And if the userspace can't guarantee cd-type is never zero, we should return -ENODEV as well after that switch(). Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/03/2013 09:34 AM, Scott Wood wrote: On 04/02/2013 08:28:01 PM, tiejun.chen wrote: On 04/03/2013 01:30 AM, Scott Wood wrote: On 04/02/2013 01:59:57 AM, tiejun.chen wrote: On 04/02/2013 06:47 AM, Scott Wood wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff71541..ed033c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2158,6 +2158,17 @@ out: } #endif +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ +bool test = cd-flags KVM_CREATE_DEVICE_TEST; + +switch (cd-type) { +default: +return -ENODEV; +} Even after apply patch 5, looks here still misses something like: if (test) WARN_ON_ONCE(!cd-type); Why? How does userspace passing in a bad type value mean the kernel needs to report internal badness, why is a value of zero worse than any other bad value, and why only when the test flag is set? I just mean we need do something here since looks the 'test' variable is defined but unused, right? But please correct this as you expect :) Yes, it's unused in this patch, but is used after patch 5 is applied. I didn't think it was worth adding a temporary unused annotation, since this part of the kernel doesn't use -Werror. Yes, its accepted in !-Werror case if we shouldn't warn something as you said. Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 0/6] device control and in-kernel MPIC
Fixed some patch shuffling errors and some minor issues. Scott Wood (6): kvm: add device control API kvm/ppc/mpic: import hw/openpic.c from QEMU kvm/ppc/mpic: remove some obviously unneeded code kvm/ppc/mpic: adapt to kernel style and environment kvm/ppc/mpic: in-kernel MPIC emulation kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC Documentation/virtual/kvm/api.txt | 78 ++ Documentation/virtual/kvm/devices/README |1 + Documentation/virtual/kvm/devices/mpic.txt | 37 + arch/powerpc/include/asm/kvm_host.h| 16 +- arch/powerpc/include/asm/kvm_ppc.h |9 + arch/powerpc/kvm/Kconfig |5 + arch/powerpc/kvm/Makefile |2 + arch/powerpc/kvm/booke.c | 12 +- arch/powerpc/kvm/mpic.c| 1784 arch/powerpc/kvm/powerpc.c | 38 +- include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h | 37 + virt/kvm/kvm_main.c| 40 + 13 files changed, 2051 insertions(+), 10 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/README create mode 100644 Documentation/virtual/kvm/devices/mpic.txt create mode 100644 arch/powerpc/kvm/mpic.c -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 3/6] kvm/ppc/mpic: remove some obviously unneeded code
Remove some parts of the code that are obviously QEMU or Raven specific before fixing style issues, to reduce the style issues that need to be fixed. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/kvm/mpic.c | 344 --- 1 file changed, 344 deletions(-) diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index 57655b9..d6d70a4 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -22,39 +22,6 @@ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN * THE SOFTWARE. */ -/* - * - * Based on OpenPic implementations: - * - Intel GW80314 I/O companion chip developer's manual - * - Motorola MPC8245 MPC8540 user manuals. - * - Motorola MCP750 (aka Raven) programmer manual. - * - Motorola Harrier programmer manuel - * - * Serial interrupts, as implemented in Raven chipset are not supported yet. - * - */ -#include hw.h -#include ppc/mac.h -#include pci/pci.h -#include openpic.h -#include sysbus.h -#include pci/msi.h -#include qemu/bitops.h -#include ppc.h - -//#define DEBUG_OPENPIC - -#ifdef DEBUG_OPENPIC -static const int debug_openpic = 1; -#else -static const int debug_openpic = 0; -#endif - -#define DPRINTF(fmt, ...) do { \ -if (debug_openpic) { \ -printf(fmt , ## __VA_ARGS__); \ -} \ -} while (0) #define MAX_CPU 32 #define MAX_SRC 256 @@ -82,21 +49,6 @@ static const int debug_openpic = 0; #define OPENPIC_CPU_REG_START0x2 #define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000) -/* Raven */ -#define RAVEN_MAX_CPU 2 -#define RAVEN_MAX_EXT 48 -#define RAVEN_MAX_IRQ 64 -#define RAVEN_MAX_TMR MAX_TMR -#define RAVEN_MAX_IPI MAX_IPI - -/* Interrupt definitions */ -#define RAVEN_FE_IRQ (RAVEN_MAX_EXT) /* Internal functional IRQ */ -#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1) /* Error IRQ */ -#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2) /* First timer IRQ */ -#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR) /* First IPI IRQ */ -/* First doorbell IRQ */ -#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI)) - typedef struct FslMpicInfo { int max_ext; } FslMpicInfo; @@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = { #define ILR_INTTGT_CINT 0x01 /* critical */ #define ILR_INTTGT_MCP0x02 /* machine check */ -/* The currently supported INTTGT values happen to be the same as QEMU's - * openpic output codes, but don't depend on this. The output codes - * could change (unlikely, but...) or support could be added for - * more INTTGT values. - */ -static const int inttgt_output[][2] = { - {ILR_INTTGT_INT, OPENPIC_OUTPUT_INT}, - {ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT}, - {ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK}, -}; - -static int inttgt_to_output(int inttgt) -{ - int i; - - for (i = 0; i ARRAY_SIZE(inttgt_output); i++) { - if (inttgt_output[i][0] == inttgt) { - return inttgt_output[i][1]; - } - } - - fprintf(stderr, %s: unsupported inttgt %d\n, __func__, inttgt); - return OPENPIC_OUTPUT_INT; -} - -static int output_to_inttgt(int output) -{ - int i; - - for (i = 0; i ARRAY_SIZE(inttgt_output); i++) { - if (inttgt_output[i][1] == output) { - return inttgt_output[i][0]; - } - } - - abort(); -} - #define MSIIR_OFFSET 0x140 #define MSIIR_SRS_SHIFT29 #define MSIIR_SRS_MASK (0x7 MSIIR_SRS_SHIFT) @@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr addr, unsigned len) return openpic_cpu_read_internal(opaque, addr, (addr 0x1f000) 12); } -static const MemoryRegionOps openpic_glb_ops_le = { - .write = openpic_gbl_write, - .read = openpic_gbl_read, - .endianness = DEVICE_LITTLE_ENDIAN, - .impl = { -.min_access_size = 4, -.max_access_size = 4, -}, -}; - static const MemoryRegionOps openpic_glb_ops_be = { .write = openpic_gbl_write, .read = openpic_gbl_read, - .endianness = DEVICE_BIG_ENDIAN, - .impl = { -.min_access_size = 4, -.max_access_size = 4, -}, -}; - -static const MemoryRegionOps openpic_tmr_ops_le = { - .write = openpic_tmr_write, - .read = openpic_tmr_read, - .endianness = DEVICE_LITTLE_ENDIAN, - .impl = { -.min_access_size = 4, -.max_access_size = 4, -}, }; static const MemoryRegionOps openpic_tmr_ops_be = { .write = openpic_tmr_write, .read = openpic_tmr_read, - .endianness = DEVICE_BIG_ENDIAN, - .impl = { -.min_access_size = 4, -.max_access_size = 4, -}, -}; - -static const MemoryRegionOps openpic_cpu_ops_le = { -
[RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
Enabling this capability connects the vcpu to the designated in-kernel MPIC. Using explicit connections between vcpus and irqchips allows for flexibility, but the main benefit at the moment is that it simplifies the code -- KVM doesn't need vm-global state to remember which MPIC object is associated with this vm, and it doesn't need to care about ordering between irqchip creation and vcpu creation. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt |8 ++ arch/powerpc/include/asm/kvm_host.h |8 ++ arch/powerpc/include/asm/kvm_ppc.h |2 ++ arch/powerpc/kvm/booke.c|4 ++- arch/powerpc/kvm/mpic.c | 49 +++ arch/powerpc/kvm/powerpc.c | 26 +++ include/uapi/linux/kvm.h|1 + 7 files changed, 92 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index d52f3f9..4c326ae 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector. When disabled (args[0] == 0), behavior is as if this facility is unsupported. When this capability is enabled, KVM_EXIT_EPR can occur. + +6.6 KVM_CAP_IRQ_MPIC + +Architectures: ppc +Parameters: args[0] is the MPIC device fd +args[1] is the MPIC CPU number for this vcpu + +This capability connects the vcpu to an in-kernel MPIC device. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 7e7aef9..2a2e235 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQ_DEFAULT 0 +#define KVMPPC_IRQ_MPIC1 + +struct openpic; + struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -554,6 +559,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irq_type; /* one of KVM_IRQ_* */ + struct openpic *mpic; /* KVM_IRQ_MPIC */ + #ifdef CONFIG_KVM_BOOK3S_64_HV struct kvm_vcpu_arch_shared shregs; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 3b63b97..f54707f 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr) } void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu); +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu, +u32 cpu); int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, struct kvm_config_tlb *cfg); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index cddc6b3..7d00222 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, if (update_epr == true) { if (vcpu-arch.epr_flags KVMPPC_EPR_USER) kvm_make_request(KVM_REQ_EPR_EXIT, vcpu); - else if (vcpu-arch.epr_flags KVMPPC_EPR_KERNEL) + else if (vcpu-arch.epr_flags KVMPPC_EPR_KERNEL) { + BUG_ON(vcpu-arch.irq_type != KVMPPC_IRQ_MPIC); kvmppc_mpic_set_epr(vcpu); + } } new_msr = msr_mask; diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index 8cda2fa..caffe3b 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst, void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu) { - struct openpic *opp = vcpu-arch.irqchip_priv; + struct openpic *opp = vcpu-arch.mpic; int cpu = vcpu-vcpu_id; unsigned long flags; @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp) static void unmap_mmio(struct openpic *opp) { - BUG_ON(opp-mmio_mapped); - opp-mmio_mapped = false; - - kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio); + if (opp-mmio_mapped) { + opp-mmio_mapped = false; + kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio); + } } static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr) @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = { .release = kvm_mpic_release, }; +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu, +u32 cpu) +{ + struct openpic *opp = mpic_filp-private_data; + int ret = 0;
[RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
Hook the MPIC code up to the KVM interfaces, add locking, etc. TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE support Signed-off-by: Scott Wood scottw...@freescale.com --- v3: mpic_put - kvmppc_mpic_put Documentation/virtual/kvm/devices/mpic.txt | 37 ++ arch/powerpc/include/asm/kvm_host.h|8 +- arch/powerpc/include/asm/kvm_ppc.h |7 + arch/powerpc/kvm/Kconfig |5 + arch/powerpc/kvm/Makefile |2 + arch/powerpc/kvm/booke.c | 10 +- arch/powerpc/kvm/mpic.c| 814 +--- arch/powerpc/kvm/powerpc.c | 12 +- include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |9 + virt/kvm/kvm_main.c|9 + 11 files changed, 714 insertions(+), 201 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/mpic.txt diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt new file mode 100644 index 000..79e000a --- /dev/null +++ b/Documentation/virtual/kvm/devices/mpic.txt @@ -0,0 +1,37 @@ +MPIC interrupt controller += + +Device types supported: + KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0 + KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2 + +Only one MPIC instance, of any type, may be instantiated. The created +MPIC will act as the system interrupt controller, connecting to each +vcpu's interrupt inputs. + +Groups: + KVM_DEV_MPIC_GRP_MISC + Attributes: +KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit) + Base address of the 256 KiB MPIC register space. Must be + naturally aligned. A value of zero disables the mapping. + Reset value is zero. + + KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit) +Access an MPIC register, as if the access were made from the guest. +attr is the byte offset into the MPIC register space. Accesses +must be 4-byte aligned. + +MSIs may be signaled by using this attribute group to write +to the relevant MSIIR. + + KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit) +IRQ input line for each standard openpic source. 0 is inactive and 1 +is active, regardless of interrupt sense. + +For edge-triggered interrupts: Writing 1 is considered an activating +edge, and writing 0 is ignored. Reading returns 1 if a previously +signaled edge has not been acknowledged, and 0 otherwise. + +attr is the IRQ number. IRQ numbers for standard sources are the +byte offset of the relevant IVPR from EIVPR0, divided by 32. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e34f8fe..7e7aef9 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -359,6 +359,11 @@ struct kvmppc_slb { #define KVMPPC_BOOKE_MAX_IAC 4 #define KVMPPC_BOOKE_MAX_DAC 2 +/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */ +#define KVMPPC_EPR_NONE0 /* EPR not supported */ +#define KVMPPC_EPR_USER1 /* exit to userspace to fill EPR */ +#define KVMPPC_EPR_KERNEL 2 /* in-kernel irqchip */ + struct kvmppc_booke_debug_reg { u32 dbcr0; u32 dbcr1; @@ -522,7 +527,7 @@ struct kvm_vcpu_arch { u8 sane; u8 cpu_type; u8 hcall_needed; - u8 epr_enabled; + u8 epr_flags; /* KVMPPC_EPR_xxx */ u8 epr_needed; u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */ @@ -589,5 +594,6 @@ struct kvm_vcpu_arch { #define KVM_MMIO_REG_FQPR 0x0060 #define __KVM_HAVE_ARCH_WQP +#define __KVM_HAVE_CREATE_DEVICE #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index f589307..3b63b97 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu); extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *); +int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); + /* * Cuts out inst bits with ordering according to spec. * That means the leftmost bit is zero. All given bits are included. @@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *); void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid); +struct openpic; +void kvmppc_mpic_put(struct openpic *opp); + #ifdef CONFIG_KVM_BOOK3S_64_HV static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr) { @@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr) #endif } +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu); + int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, struct kvm_config_tlb *cfg); int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu, diff --git a/arch/powerpc/kvm/Kconfig
[RFC PATCH v3 4/6] kvm/ppc/mpic: adapt to kernel style and environment
Remove braces that Linux style doesn't permit, remove space after '*' that Lindent added, keep error/debug strings contiguous, etc. Substitute type names, debug prints, etc. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/kvm/mpic.c | 445 ++- 1 file changed, 208 insertions(+), 237 deletions(-) diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index d6d70a4..1df67ae 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -42,22 +42,22 @@ #define OPENPIC_TMR_REG_SIZE 0x220 #define OPENPIC_MSI_REG_START0x1600 #define OPENPIC_MSI_REG_SIZE 0x200 -#define OPENPIC_SUMMARY_REG_START 0x3800 -#define OPENPIC_SUMMARY_REG_SIZE0x800 +#define OPENPIC_SUMMARY_REG_START0x3800 +#define OPENPIC_SUMMARY_REG_SIZE 0x800 #define OPENPIC_SRC_REG_START0x1 #define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20) #define OPENPIC_CPU_REG_START0x2 -#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000) +#define OPENPIC_CPU_REG_SIZE (0x100 + ((MAX_CPU - 1) * 0x1000)) -typedef struct FslMpicInfo { +struct fsl_mpic_info { int max_ext; -} FslMpicInfo; +}; -static FslMpicInfo fsl_mpic_20 = { +static struct fsl_mpic_info fsl_mpic_20 = { .max_ext = 12, }; -static FslMpicInfo fsl_mpic_42 = { +static struct fsl_mpic_info fsl_mpic_42 = { .max_ext = 12, }; @@ -100,44 +100,43 @@ static int get_current_cpu(void) { CPUState *cpu_single_cpu; - if (!cpu_single_env) { + if (!cpu_single_env) return -1; - } cpu_single_cpu = ENV_GET_CPU(cpu_single_env); return cpu_single_cpu-cpu_index; } -static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx); -static void openpic_cpu_write_internal(void *opaque, hwaddr addr, +static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx); +static void openpic_cpu_write_internal(void *opaque, gpa_t addr, uint32_t val, int idx); -typedef enum IRQType { +enum irq_type { IRQ_TYPE_NORMAL = 0, IRQ_TYPE_FSLINT,/* FSL internal interrupt -- level only */ IRQ_TYPE_FSLSPECIAL,/* FSL timer/IPI interrupt, edge, no polarity */ -} IRQType; +}; -typedef struct IRQQueue { +struct irq_queue { /* Round up to the nearest 64 IRQs so that the queue length * won't change when moving between 32 and 64 bit hosts. */ unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63) ~63)]; int next; int priority; -} IRQQueue; +}; -typedef struct IRQSource { +struct irq_source { uint32_t ivpr; /* IRQ vector/priority register */ uint32_t idr; /* IRQ destination register */ uint32_t destmask; /* bitmap of CPU destinations */ int last_cpu; int output; /* IRQ level, e.g. OPENPIC_OUTPUT_INT */ int pending;/* TRUE if IRQ is pending */ - IRQType type; + enum irq_type type; bool level:1; /* level-triggered */ - bool nomask:1; /* critical interrupts ignore mask on some FSL MPICs */ -} IRQSource; + bool nomask:1; /* critical interrupts ignore mask on some FSL MPICs */ +}; #define IVPR_MASK_SHIFT 31 #define IVPR_MASK_MASK(1 IVPR_MASK_SHIFT) @@ -158,22 +157,19 @@ typedef struct IRQSource { #define IDR_EP 0x8000 /* external pin */ #define IDR_CI 0x4000 /* critical interrupt */ -typedef struct IRQDest { +struct irq_dest { int32_t ctpr; /* CPU current task priority */ - IRQQueue raised; - IRQQueue servicing; + struct irq_queue raised; + struct irq_queue servicing; qemu_irq *irqs; /* Count of IRQ sources asserting on non-INT outputs */ uint32_t outputs_active[OPENPIC_OUTPUT_NB]; -} IRQDest; - -typedef struct OpenPICState { - SysBusDevice busdev; - MemoryRegion mem; +}; +struct openpic { /* Behavior control */ - FslMpicInfo *fsl; + struct fsl_mpic_info *fsl; uint32_t model; uint32_t flags; uint32_t nb_irqs; @@ -186,9 +182,6 @@ typedef struct OpenPICState { uint32_t brr1; uint32_t mpic_mode_mask; - /* Sub-regions */ - MemoryRegion sub_io_mem[6]; - /* Global registers */ uint32_t frr; /* Feature reporting register */ uint32_t gcr; /* Global configuration register */ @@ -196,9 +189,9 @@ typedef struct OpenPICState { uint32_t spve; /* Spurious vector register */ uint32_t tfrr; /* Timer frequency reporting register */ /* Source registers */ - IRQSource src[MAX_IRQ]; + struct irq_source src[MAX_IRQ]; /* Local registers per output pin */ - IRQDest dst[MAX_CPU]; + struct irq_dest
[RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
This is QEMU's hw/openpic.c from commit abd8d4a4d6dfea7ddea72f095f993e1de941614e (Update version for 1.4.0-rc0), run through Lindent with no other changes to ease merging future changes between Linux and QEMU. Remaining style issues (including those introduced by Lindent) will be fixed in a later patch. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/kvm/mpic.c | 1686 +++ 1 file changed, 1686 insertions(+) create mode 100644 arch/powerpc/kvm/mpic.c diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c new file mode 100644 index 000..57655b9 --- /dev/null +++ b/arch/powerpc/kvm/mpic.c @@ -0,0 +1,1686 @@ +/* + * OpenPIC emulation + * + * Copyright (c) 2004 Jocelyn Mayer + * 2011 Alexander Graf + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +/* + * + * Based on OpenPic implementations: + * - Intel GW80314 I/O companion chip developer's manual + * - Motorola MPC8245 MPC8540 user manuals. + * - Motorola MCP750 (aka Raven) programmer manual. + * - Motorola Harrier programmer manuel + * + * Serial interrupts, as implemented in Raven chipset are not supported yet. + * + */ +#include hw.h +#include ppc/mac.h +#include pci/pci.h +#include openpic.h +#include sysbus.h +#include pci/msi.h +#include qemu/bitops.h +#include ppc.h + +//#define DEBUG_OPENPIC + +#ifdef DEBUG_OPENPIC +static const int debug_openpic = 1; +#else +static const int debug_openpic = 0; +#endif + +#define DPRINTF(fmt, ...) do { \ +if (debug_openpic) { \ +printf(fmt , ## __VA_ARGS__); \ +} \ +} while (0) + +#define MAX_CPU 32 +#define MAX_SRC 256 +#define MAX_TMR 4 +#define MAX_IPI 4 +#define MAX_MSI 8 +#define MAX_IRQ (MAX_SRC + MAX_IPI + MAX_TMR) +#define VID 0x03 /* MPIC version ID */ + +/* OpenPIC capability flags */ +#define OPENPIC_FLAG_IDR_CRIT (1 0) +#define OPENPIC_FLAG_ILR (2 0) + +/* OpenPIC address map */ +#define OPENPIC_GLB_REG_START0x0 +#define OPENPIC_GLB_REG_SIZE 0x10F0 +#define OPENPIC_TMR_REG_START0x10F0 +#define OPENPIC_TMR_REG_SIZE 0x220 +#define OPENPIC_MSI_REG_START0x1600 +#define OPENPIC_MSI_REG_SIZE 0x200 +#define OPENPIC_SUMMARY_REG_START 0x3800 +#define OPENPIC_SUMMARY_REG_SIZE0x800 +#define OPENPIC_SRC_REG_START0x1 +#define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20) +#define OPENPIC_CPU_REG_START0x2 +#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000) + +/* Raven */ +#define RAVEN_MAX_CPU 2 +#define RAVEN_MAX_EXT 48 +#define RAVEN_MAX_IRQ 64 +#define RAVEN_MAX_TMR MAX_TMR +#define RAVEN_MAX_IPI MAX_IPI + +/* Interrupt definitions */ +#define RAVEN_FE_IRQ (RAVEN_MAX_EXT) /* Internal functional IRQ */ +#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1) /* Error IRQ */ +#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2) /* First timer IRQ */ +#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR) /* First IPI IRQ */ +/* First doorbell IRQ */ +#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI)) + +typedef struct FslMpicInfo { + int max_ext; +} FslMpicInfo; + +static FslMpicInfo fsl_mpic_20 = { + .max_ext = 12, +}; + +static FslMpicInfo fsl_mpic_42 = { + .max_ext = 12, +}; + +#define FRR_NIRQ_SHIFT16 +#define FRR_NCPU_SHIFT 8 +#define FRR_VID_SHIFT 0 + +#define VID_REVISION_1_2 2 +#define VID_REVISION_1_3 3 + +#define VIR_GENERIC 0x/* Generic Vendor ID */ + +#define GCR_RESET0x8000 +#define GCR_MODE_PASS0x +#define GCR_MODE_MIXED 0x2000 +#define GCR_MODE_PROXY 0x6000 + +#define TBCR_CI 0x8000 /* count inhibit */ +#define TCCR_TOG 0x8000 /* toggles when decrement to zero */ + +#define IDR_EP_SHIFT 31 +#define IDR_EP_MASK (1 IDR_EP_SHIFT)
[RFC PATCH v3 1/6] kvm: add device control API
Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- v3: remove some changes that were merged into this patch by accident, and fix the error documentation for KVM_CREATE_DEVICE. NOTE: I had some difficulty figuring out what ioctl numbers I should assign... it seems that at one point care was taken to keep vcpu and vm ioctls separate, but some overlap exists now (despite not exhausing the ioctl space). Some of that was my fault, but not all of it. :-) I moved to a new ioctl range for device control -- please let me know if there's something else you'd prefer I do. --- Documentation/virtual/kvm/api.txt| 70 ++ Documentation/virtual/kvm/devices/README |1 + include/uapi/linux/kvm.h | 27 virt/kvm/kvm_main.c | 31 + 4 files changed, 129 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..d52f3f9 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + + Other error conditions may be defined by individual device types or + have their standard meanings. + +Creates an emulated device in the kernel. The file descriptor returned +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific. See individual device documentation in +the devices directory. As with ONE_REG, the size of the data +transferred is defined by the particular attribute. + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 4.77 KVM_ARM_VCPU_INIT diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index 000..34a6983
Re: KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery
Hi, Thank you for testing the patch. Yangminqiang yangminqi...@huawei.com wrote: Hi Tomoki I tried your smart patch cpu isolation and direct interrupt delivery, http://article.gmane.org/gmane.linux.kernel/1353803 got output when I run qemu kvm_set_slave_cpu: Invalid argument So I wonder * Did I misuse your patches? * How is the offlined CPU assigned? or the Guest OS will automaticly detect and use it? Currently it is hard-coded in the patch for qemu-kvm just for testing: diff -Narup a/qemu-kvm-1.0/qemu-kvm-x86.c b/qemu-kvm-1.0/qemu-kvm-x86.c --- a/qemu-kvm-1.0/qemu-kvm-x86.c 2011-12-04 19:38:06.0 +0900 +++ b/qemu-kvm-1.0/qemu-kvm-x86.c 2012-09-06 20:19:44.828163734 +0900 @@ -139,12 +139,28 @@ static int kvm_enable_tpr_access_reporti return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, tac); } +static int kvm_set_slave_cpu(CPUState *env) +{ +int r, slave = env-cpu_index == 0 ? 2 : env-cpu_index == 1 ? 3 : -1; `slave' is the offlined CPU ID assigned, and `env-cpu_index' is the virtual CPU ID. You need to modify here and recompile qemu-kvm (or just offline cpu 2 and 3 for a 2vcpus guest ;) ). Thanks, Tomoki Sekiyama -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery
Hi, Thank you for testing the patch. Yangminqiang yangminqi...@huawei.com wrote: Hi Tomoki I tried your smart patch cpu isolation and direct interrupt delivery, http://article.gmane.org/gmane.linux.kernel/1353803 got output when I run qemu kvm_set_slave_cpu: Invalid argument So I wonder * Did I misuse your patches? * How is the offlined CPU assigned? or the Guest OS will automaticly detect and use it? Currently it is hard-coded in the patch for qemu-kvm just for testing: diff -Narup a/qemu-kvm-1.0/qemu-kvm-x86.c b/qemu-kvm-1.0/qemu-kvm-x86.c --- a/qemu-kvm-1.0/qemu-kvm-x86.c 2011-12-04 19:38:06.0 +0900 +++ b/qemu-kvm-1.0/qemu-kvm-x86.c 2012-09-06 20:19:44.828163734 +0900 @@ -139,12 +139,28 @@ static int kvm_enable_tpr_access_reporti return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, tac); } +static int kvm_set_slave_cpu(CPUState *env) +{ +int r, slave = env-cpu_index == 0 ? 2 : env-cpu_index == 1 ? 3 : -1; `slave' is the offlined CPU ID assigned, and `env-cpu_index' is the virtual CPU ID. You need to modify here and recompile qemu-kvm (or just offline cpu 2 and 3 for a 2vcpus guest ;) ). Thanks, Tomoki Sekiyama -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/6] kvm: add device control API
On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote: On 04/02/2013 08:02:39 PM, Paul Mackerras wrote: On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; Yes, it does (see below). you add it in a later patch. Maybe you're thinking of KVM_CAP_IRQ_MPIC? No, I was referring to the addition to kvm_dev_ioctl_check_extension() of a KVM_CAP_DEVICE_CTRL case. Since this patch adds the code to handle KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if userspace queries the KVM_CAP_DEVICE_CTRL capability. +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. That's actually why I moved it to a new section, with device control ioctls getting their own range, as the legacy device model and some other things did. 0xe0 is not the next ioctl that would be used for either vm or vcpu. The ioctl numbering is actually already a mess, with sometimes care being taken to keep vcpu and vm ioctls from overlapping, but on other places overlapping does happen. I'm not sure what exactly I should do here. Well, even if you are using a new range, I still think that KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM ioctls. I guess it's ultimately up to the maintainers. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote: On 04/02/2013 04:16:11 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote: The number of windows is always power of 2 (and max is 256). And to reduce PAMU cache pressure you want to use the fewest number of windows you can.So, I don't see practically how we could transparently steal entries to add the MSIs. Either user space knows to leave empty windows for MSIs and by convention the kernel knows which windows those are (as in option #A) or explicitly tell the kernel which windows (as in option #B). Ok, apparently I don't understand the API. Is it something like userspace calls GET_ATTR and finds out that there are 256 available windows, userspace determines that it needs 8 for RAM and then it has an MSI device, so it needs to call SET_ATTR and ask for 16? That seems prone to exploitation by the first userspace to allocate it's aperture, What exploitation? It's not as if there is a pool of 256 global windows that users allocate from. The subwindow count is just how finely divided the aperture is. The only way one user will affect another is through cache contention (which is why we want the minimum number of subwindows that we can get away with). but I'm also not sure why userspace could specify the (non-power of 2) number of windows it needs for RAM, then VFIO would see that the devices attached have MSI and add those windows and align to a power of 2. If you double the subwindow count without userspace knowing, you have to double the aperture as well (and you may need to grow up or down depending on alignment). This means you also need to halve the maximum aperture that userspace can request. And you need to expose a different number of maximum subwindows in the IOMMU API based on whether we might have MSIs of this type. It's ugly and awkward, and removes the possibility for userspace to place the MSIs in some unused slot in the middle, or not use MSIs at all. Ok, I missed this in Stuart's example: Total aperture: 512MB # of windows: 8 win gphys/ # iovaphys size --- 0 0x 0xX_XX00 64MB 1 0x0400 0xX_XX00 64MB 2 0x0800 0xX_XX00 64MB 3 0x0C00 0xX_XX00 64MB 4 0x1000 0xf_fe044000 4KB// msi bank 1 ^^ 5 0x1400 0xf_fe045000 4KB// msi bank 2 ^^ 6 0x1800 0xf_fe046000 4KB// msi bank 3 ^^ 7- - disabled So even though the MSI banks are 4k in this example, they're still on 64MB boundaries. If userspace were to leave this as 256 windows, each would be 2MB and we'd use 128 of them to map the same memory as these 4x64MB windows and thrash the iotlb harder. The picture is becoming clearer. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote: On 04/02/2013 04:32:04 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: On 04/02/2013 03:32:17 PM, Alex Williamson wrote: On x86 the interrupt remapper handles this transparently when MSI is enabled and userspace never gets direct access to the device MSI address/data registers. x86 has a totally different mechanism here, as far as I understand -- even before you get into restrictions on mappings. So what control will userspace have over programming the actually MSI vectors on PAMU? Not sure what you mean -- PAMU doesn't get explicitly involved in MSIs. It's just another 4K page mapping (per relevant MSI bank). If you want isolation, you need to make sure that an MSI group is only used by one VFIO group, and that you're on a chip that has alias pages with just one MSI bank register each (newer chips do, but the first chip to have a PAMU didn't). How does a user figure this out? This could also be done as another type2 ioctl extension. Again, what is type2, specifically? If someone else is adding their own IOMMU that is kind of, sort of like PAMU, how would they know if it's close enough? What assumptions can a user make when they see that they're dealing with type2? Naming always has and always will be a problem. I assume this is named type2 rather than PAMU because it's trying to expose a generic windowed IOMMU fitting the IOMMU API. But how closely is the MSI situation related to a generic windowed IOMMU, then? We could just as well have a highly flexible IOMMU in terms of arbitrary 4K page mappings, but still handle MSIs as pages to be mapped rather than a translation table. Or we could have a windowed IOMMU that has an MSI translation table. Like type1, it doesn't really make sense to name it IOMMU API because that's a kernel internal interface and we're designing a userspace interface that just happens to use that. Tagging it to a piece of hardware makes it less reusable. Well, that's my point. Is it reusable at all, anyway? If not, then giving it a more obscure name won't change that. If it is reusable, then where is the line drawn between things that are PAMU-specific or MPIC-specific and things that are part of the generic windowed IOMMU abstraction? Type1 is arbitrary. It might as well be named brown and this one can be blue. The difference is that type1 seems to refer to hardware that can do arbitrary 4K page mappings, possibly constrained by an aperture but nothing else. More than one IOMMU can reasonably fit that. The odds that another IOMMU would have exactly the same restrictions as PAMU seem smaller in comparison. In any case, if you had to deal with some Intel-only quirk, would it make sense to call it a type1 attribute? I'm not advocating one way or the other on whether an abstraction is viable here (though Stuart seems to think it's highly unlikely anything but a PAMU will comply), just that if it is to be abstracted rather than a hardware-specific interface, we need to document what is and is not part of the abstraction. Otherwise a non-PAMU-specific user won't know what they can rely on, and someone adding support for a new windowed IOMMU won't know if theirs is close enough, or they need to introduce a type3. So Alexey named the SPAPR IOMMU something related to spapr... surprisingly enough. I'm fine with that. If you think it's unique enough, name it something appropriately. I haven't seen the code and don't know the architecture sufficiently to have an opinion. What's the value to userspace in determining which windows are used by which banks? That depends on who programs the MSI config space address. What is important is userspace controlling which iovas will be dedicated to this, in case it wants to put something else there. So userspace is programming the MSI vectors, targeting a user programmed iova? But an iova selects a window and I thought there were some number of MSI banks and we don't really know which ones we'll need... still confused. Userspace would also need a way to find out the page offset and data value. That may be an argument in favor of having the two ioctls Stuart later suggested (get MSI count, and map MSI). Connecting the user set iova and host kernel assigned irq number is where I'm still lost, but I'll follow-up with that question in the other thread. Would there be any complication in the VFIO code from tracking a mapping that doesn't have a userspace virtual address associated with it? Only the VFIO iommu driver tracks mappings, the QEMU userspace component doesn't (replies on the memory API for type1), nor does any of the kernel framework code. There's going to be special stuff no matter
Re: RFC: vfio API changes needed for powerpc
On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote: On 04/02/2013 04:38:45 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote: C. Explicit mapping using normal DMA map. The last idea is that we would introduce a new ioctl to give user-space an fd to the MSI bank, which could be mmapped. The flow would be something like this: -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD -user space mmaps the fd, getting a vaddr -user space does a normal DMA map for desired iova This approach makes everything explicit, but adds a new ioctl applicable most likely only to the PAMU (type2 iommu). And the DMA_MAP of that mmap then allows userspace to select the window used? This one seems like a lot of overhead, adding a new ioctl, new fd, mmap, special mapping path, etc. There's going to be special stuff no matter what. This would keep it separated from the IOMMU map code. I'm not sure what you mean by overhead here... the runtime overhead of setting things up is not particularly relevant as long as it's reasonable. If you mean development and maintenance effort, keeping things well separated should help. We don't need to change DMA_MAP. If we can simply add a new type 2 ioctl that allows user space to set which windows are MSIs, it seems vastly less complex than an ioctl to supply a new fd, mmap of it, etc. So maybe 2 ioctls: VFIO_IOMMU_GET_MSI_COUNT Do you mean a count of actual MSIs or a count of MSI banks used by the whole VFIO group? I hope the latter, which would clarify how this is distinct from DEVICE_GET_IRQ_INFO. Is hotplug even on the table? Presumably dynamically adding a device could bring along additional MSI banks? VFIO_IOMMU_MAP_MSI(iova, size) Not sure how you mean size to be used -- for MPIC it would be 4K per bank, and you can only map one bank at a time (which bank you're mapping should be a parameter, if only so that the kernel doesn't have to keep iteration state for you). How are MSIs related to devices on PAMU? PAMU doesn't care about MSIs. The relation of individual MSIs to a device is standard PCI stuff. Each MSI bank (which is part of the MPIC, not PAMU) can hold numerous MSIs. The VFIO user would want to map all MSI banks that are in use by any of the devices in the group. Ideally we'd let the VFIO grouping influence the allocation of MSIs. The current VFIO MSI support has the host handling everything about MSI. The user never programs an MSI vector to the physical device, they set up everything through ioctl. On interrupt, we simply trigger an eventfd and leave it to things like KVM irqfd or QEMU to do the right thing in a virtual machine. Here the MSI vector has to go through a PAMU window to hit the correct MSI bank. So that means it has some component of the iova involved, which we're proposing here is controlled by userspace (whether that vector uses an offset from 0x1000 or 0x depending on which window slot is used to make the MSI bank). I assume we're still working in a model where the physical interrupt fires into the host and a host-based interrupt handler triggers an eventfd, right? So that means the vector also has host components so we trigger the correct ISR. How is that coordinated? Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? On x86 MSI count is very device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with it being a device ioctl is that you need to get the device FD, but the IOMMU protection needs to be established before you can get that... so there's an ordering problem if you need it from the device before configuring the IOMMU. Thanks, What do you mean by IOMMU protection needs to be established? Wouldn't we just start with no mappings in place? If no mappings blocks all DMA, sure, that's fine. Once the VFIO device FD is accessible by userspace we have to protect the host against DMA. If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that could be exploitable. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a
Re: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status
On Wed, Apr 03, 2013 at 12:21:05AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-04-02: On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote: Paolo Bonzini wrote on 2013-03-26: Il 22/03/2013 06:24, Yang Zhang ha scritto: +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu, +struct rtc_status *rtc_status, int irq) +{ +if (irq != RTC_GSI) +return; + +if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map)) +--rtc_status-pending_eoi; + +WARN_ON(rtc_status-pending_eoi 0); +} This is the only case where you're passing the struct rtc_status instead of the struct kvm_ioapic. Please use the latter, and make it the first argument. @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, int irq) irqe.level = 1; irqe.shorthand = 0; -return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL); +if (irq == RTC_GSI) { +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, +ioapic-rtc_status.dest_map); +ioapic-rtc_status.pending_eoi = ret; I think you should either add a BUG_ON(ioapic-rtc_status.pending_eoi != 0); or use ioapic-rtc_status.pending_eoi += ret (or both). There may malicious guest to write EOI more than once. And the pending_eoi will be negative. But it should not be a bug. Just WARN_ON is enough. And we already do it in ack_eoi. So don't need to do duplicated thing here. Since we track vcpus that already called EOI and decrement pending_eoi only once for each vcpu malicious guest cannot trigger it, but we already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need another one here. += will be correct (since pending_eoi == 0 here), but confusing since it makes an impression that pending_eoi may not be zero. Yes, I also make the wrong impression. With previous implementation, the pening_eoi may not be zero: Calculate the destination vcpu via parse IOAPIC entry, and if using lowest priority deliver mode, set all possible vcpus in dest_map even it doesn't receive it finally. At same time, a malicious guest can send IPI with same vector of RTC to those vcpus who is in dest_map but not have RTC interrupt. Then the pending_eoi will be negative. Now, we set the dest_map with the vcpus who really received the interrupt. The above case cannot happen. So as you and Paolo suggested, it is better to use +=. I am not suggesting that it is better to use +=. We can add BUG_ON(ioapic-rtc_status.pending_eoi != 0); but no need to resend patches just for that. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check
On Tue, 2013-04-02 at 16:27 +0300, Michael S. Tsirkin wrote: On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote: On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto: On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote: I think it's the right thing to do, but maybe not the right place to do this, need to reset after all IO is done, before ring memory is write protected. Our emails are crossing each other unfortunately, but I want to reinforce this: ring memory is not write protected. Understood. However, AFAICT the act of write protecting these ranges for ROM generates the offending callbacks to vhost_set_memory(). The part that I'm missing is if ring memory is not being write protected by make_bios_readonly_intel(), why are the vhost_set_memory() calls being invoked..? Because mappings change for the region that contains the ring. vhost doesn't know yet that the changes do not affect ring memory, vhost_set_memory() is called exactly to ascertain that. Hi Paolo Co, Here's a bit more information on what is going on with the same cpu_physical_memory_map() failure in vhost_verify_ring_mappings().. So as before, at the point that seabios is marking memory as readonly for ROM in src/shadow.c:make_bios_readonly_intel() with the following call: Calling pci_config_writeb(0x31): bdf: 0x pam: 0x005b the memory API update hook triggers back into vhost_region_del() code, and following occurs: Entering vhost_region_del section: 0x7fd30a213b60 offset_within_region: 0xc size: 2146697216 readonly: 0 vhost_region_del: is_rom: 0, rom_device: 0 vhost_region_del: readable: 1 vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648 vhost_region_del: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213b60 add: 0, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c size: 2146697216 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 5124 address_space_map: addr: 0xed000, plen: 5124 address_space_map: l: 4096, len: 5124 phys_page_find got PHYS_MAP_NODE_NIL .. address_space_map: section: 0x7fd30fabaed0 memory_region_is_ram: 0 readonly: 0 address_space_map: section: 0x7fd30fabaed0 offset_within_region: 0x0 section size: 18446744073709551615 Unable to map ring buffer for ring 2, l: 4096 So the interesting part is that phys_page_find() is not able to locate the corresponding page for vq-ring_phys: 0xed000 from the vhost_region_del() callback with section-offset_within_region: 0xc.. Is there any case where this would not be considered a bug..? register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 0xc size: 32768 readonly: 1 vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c size: 32768 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: Got !ranges_overlap, skipping register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 0xc8000 size: 2146664448 readonly: 0 vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1 Entering verify_ring_mappings: start_addr 0x000c8000 size: 2146664448 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0x0 ring_size: 0 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 5124 address_space_map: addr: 0xed000, plen: 5124 address_space_map: l: 4096, len: 5124 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 readonly: 0 address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 section size: 2146664448 address_space_map: l: 4096, len: 1028 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram:
RE: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers
Alexander Graf wrote: On 04/02/2013 12:44 PM, Kukjin Kim wrote: Alexander Graf wrote: When running on an exynos 5250 SoC, we don't initialize the architected timers. The chip however supports architected timers. Yes, exynos5250 can support, mct(multi core timer) is used though. When we don't initialize them, KVM will try to access them and run into NULL pointer dereferences attempting to do so. Yes, right. This patch is really more of a hack than a real fix, but does get me working with KVM on Arndale. Hmm, if you think, this is _really_ a hack, you need to add some comments about that for clearance, and since the mct.c file has been moved into drivers/clocksource/, this should be re-worked. BTW, I discussed about this with Thomas and Giridhar just now, we reached this 3rd patch could be dropped because the correct way is to add a dts node for arch timer which patch 2nd is already doing after 3.9-rc1 because of CLOCKSOURCE_OF_DECLARE macro. So if you' OK above, let me know so that I can take only 1st and 2nd patches to support KVM on exynos5250. I'd say go ahead and take them and I'll verify whether things work on your tree :). OK, I will. What's the git repo of your branch? You can test with my for-next branch but this series can be seen tomorrow night(KST) in my public tree. Any problems, please let me know. Thanks. - Kukjin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check
On Tue, 2013-04-02 at 21:04 -0700, Nicholas A. Bellinger wrote: On Tue, 2013-04-02 at 16:27 +0300, Michael S. Tsirkin wrote: On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote: On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto: On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote: I think it's the right thing to do, but maybe not the right place to do this, need to reset after all IO is done, before ring memory is write protected. Our emails are crossing each other unfortunately, but I want to reinforce this: ring memory is not write protected. Understood. However, AFAICT the act of write protecting these ranges for ROM generates the offending callbacks to vhost_set_memory(). The part that I'm missing is if ring memory is not being write protected by make_bios_readonly_intel(), why are the vhost_set_memory() calls being invoked..? Because mappings change for the region that contains the ring. vhost doesn't know yet that the changes do not affect ring memory, vhost_set_memory() is called exactly to ascertain that. SNIP Is it possible that what is going on here, is that we had a region at address 0x0 size 0x8000, and now a chunk from it is being made readonly, and to this end the whole old region is removed then new ones are added? Yes, I believe this is exactly what is happening.. If yes maybe the problem is that we don't use the atomic begin/commit ops in the memory API. Maybe the following will help? Completely untested, posting just to give you the idea: Mmmm, one question on how vhost_region_del() + vhost_region_add() + vhost_commit() should work.. Considering the following when the same seabios code snippet: pci_config_writeb(0x31): bdf: 0x pam: 0x005b is executed to mark an pc.ram area 0xc as readonly: Entering vhost_begin Entering vhost_region_del section: 0x7fd037a4bb60 offset_within_region: 0xc size: 2146697216 readonly: 0 vhost_region_del: is_rom: 0, rom_device: 0 vhost_region_del: readable: 1 vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648 vhost_region_del: name: pc.ram Entering vhost_set_memory, section: 0x7fd037a4bb60 add: 0, dev-started: 1 vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc Entering vhost_region_add section: 0x7fd037a4baa0 offset_within_region: 0xc size: 32768 readonly: 1 vhost_region_add is readonly !!! vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd037a4baa0 add: 1, dev-started: 1 vhost_dev_assign_memory(); reg-guest_phys_addr: 0xc vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc Entering vhost_region_add section: 0x7fd037a4baa0 offset_within_region: 0xc8000 size: 2146664448 readonly: 0 vhost_region_add: is_rom: 0, rom_device: 0 vhost_region_add: readable: 1 vhost_region_add: ram_addr 0x, addr: 0x 0 size: 2147483648 vhost_region_add: name: pc.ram Entering vhost_set_memory, section: 0x7fd037a4baa0 add: 1, dev-started: 1 vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc8000 phys_page_find got PHYS_MAP_NODE_NIL .. Entering vhost_commit Note that originally we'd see the cpu_physical_memory_map() failure in vhost_verify_ring_mappings() after the first -region_del() above. Adding a hardcoded cpu_physical_memory_map() testcase in vhost_commit() for phys_addr=0xed000, len=5124 (vq ring) does locate the correct *section from address_space_map(), which correct points to the section generated by the last vhost_region_add() above: Entering vhost_commit address_space_map: addr: 0xed000, plen: 5124 address_space_map: l: 4096, len: 5124 address_space_map: section: 0x7f41b325f020 memory_region_is_ram: 1 readonly: 0 address_space_map: section: 0x7f41b325f020 offset_within_region: 0xc8000 section size: 2146664448 address_space_map: l: 4096, len: 1028 address_space_map: section: 0x7f41b325f020 memory_region_is_ram: 1 readonly: 0 address_space_map: section: 0x7f41b325f020 offset_within_region: 0xc8000 section size: 2146664448 address_space_map: Calling qemu_ram_ptr_length: raddr: 0x ed000 rlen: 5124 address_space_map: After qemu_ram_ptr_length: raddr: 0x ed000 rlen: 5124 cpu_physical_memory_map(0xed000) got l: 5124 So, does using a -commit callback for MemoryListener mean that vhost_verify_ring_mappings() is OK to be called only from the final -commit callback, and not from each -region_del + -region_add callback..? Eg: I seem to recall something about vhost_verify_ring_mappings() being called during each
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/02/2013 06:47 AM, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt| 70 ++ Documentation/virtual/kvm/devices/README |1 + arch/powerpc/include/asm/kvm_host.h |6 +++ arch/powerpc/include/asm/kvm_ppc.h |2 + arch/powerpc/kvm/powerpc.c |7 +++ include/uapi/linux/kvm.h | 27 virt/kvm/kvm_main.c | 31 + 7 files changed, 144 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created + + Other error conditions may be defined by individual device types. + +Creates an emulated device in the kernel. The file descriptor returned +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific. See individual device documentation in +the devices directory. As with ONE_REG, the size of the data +transferred is defined by the particular attribute. + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 4.77 KVM_ARM_VCPU_INIT diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index 000..34a6983 --- /dev/null +++ b/Documentation/virtual/kvm/devices/README @@ -0,0 +1 @@ +This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e34f8fe..e0caae2 100644 ---
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 --- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* +* Flag indicating that debug registers are used by guest +* and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features in linux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE 0x0 +#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ(1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* +* exiting to userspace because of h/w breakpoint, watchpoint +* (read, write or both) and software breakpoint. +*/ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE0x0 -#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ(1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr = ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* +* Since there is no shadow MSR, sync MSR_DE into the guest +* visible MSR. Do not allow guest to change MSR[DE]. +*/ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets enabled. It should automatically get synchronized as soon as the next vpcu_load() happens. I think this is not required here as shadow_dbsr already have MSRP_DEP set. Will setup shadow_msrp when setting guest_debug and clear shadow_msrp when guest_debug is cleared. But that will also not be sufficient as it not sure when vcpu_load() will be called after the shadow_msrp is changed. So
RE: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
-Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushan bharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 - -- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* + * Flag indicating that debug registers are used by guest + * and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features in linux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE0x0 +#define KVMPPC_DEBUG_BREAKPOINT (1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ (1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* + * exiting to userspace because of h/w breakpoint, watchpoint + * (read, write or both) and software breakpoint. + */ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE 0x0 -#define KVMPPC_DEBUG_BREAKPOINT (1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ (1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr = ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* + * Since there is no shadow MSR, sync MSR_DE into the guest + * visible MSR. Do not allow guest to change MSR[DE]. + */ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets enabled. It should automatically get synchronized as soon as the next
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 04/02/2013 04:09 PM, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 21.03.2013, at 07:25, Bharat Bhushan wrote: From: Bharat Bhushanbharat.bhus...@freescale.com This patch adds the debug stub support on booke/bookehv. Now QEMU debug stub can use hw breakpoint, watchpoint and software breakpoint to debug guest. Debug registers are saved/restored on vcpu_put()/vcpu_get(). Also the debug registers are saved restored only if guest is using debug resources. Signed-off-by: Bharat Bhushanbharat.bhus...@freescale.com --- v2: - save/restore in vcpu_get()/vcpu_put() - some more minor cleanup based on review comments. arch/powerpc/include/asm/kvm_host.h | 10 ++ arch/powerpc/include/uapi/asm/kvm.h | 22 +++- arch/powerpc/kvm/booke.c| 252 - -- arch/powerpc/kvm/e500_emulate.c | 10 ++ 4 files changed, 272 insertions(+), 22 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index f4ba881..8571952 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -504,7 +504,17 @@ struct kvm_vcpu_arch { u32 mmucfg; u32 epr; u32 crit_save; + /* guest debug registers*/ struct kvmppc_booke_debug_reg dbg_reg; + /* shadow debug registers */ + struct kvmppc_booke_debug_reg shadow_dbg_reg; + /* host debug registers*/ + struct kvmppc_booke_debug_reg host_dbg_reg; + /* +* Flag indicating that debug registers are used by guest +* and requires save restore. + */ + bool debug_save_restore; #endif gpa_t paddr_accessed; gva_t vaddr_accessed; diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h index 15f9a00..d7ce449 100644 --- a/arch/powerpc/include/uapi/asm/kvm.h +++ b/arch/powerpc/include/uapi/asm/kvm.h @@ -25,6 +25,7 @@ /* Select powerpc specific features inlinux/kvm.h */ #define __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT +#define __KVM_HAVE_GUEST_DEBUG struct kvm_regs { __u64 pc; @@ -267,7 +268,24 @@ struct kvm_fpu { __u64 fpr[32]; }; +/* + * Defines for h/w breakpoint, watchpoint (read, write or both) and + * software breakpoint. + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status + * for KVM_DEBUG_EXIT. + */ +#define KVMPPC_DEBUG_NONE 0x0 +#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) +#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) +#define KVMPPC_DEBUG_WATCH_READ(1UL 3) struct kvm_debug_exit_arch { + __u64 address; + /* +* exiting to userspace because of h/w breakpoint, watchpoint +* (read, write or both) and software breakpoint. +*/ + __u32 status; + __u32 reserved; }; /* for KVM_SET_GUEST_DEBUG */ @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch { * Type denotes h/w breakpoint, read watchpoint, write * watchpoint or watchpoint (both read and write). */ -#define KVMPPC_DEBUG_NOTYPE0x0 -#define KVMPPC_DEBUG_BREAKPOINT(1UL 1) -#define KVMPPC_DEBUG_WATCH_WRITE (1UL 2) -#define KVMPPC_DEBUG_WATCH_READ(1UL 3) __u32 type; __u32 reserved; } bp[16]; diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 1de93a8..bf20056 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu *vcpu) #endif } +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) { + /* Synchronize guest's desire to get debug interrupts into shadow +MSR */ #ifndef CONFIG_KVM_BOOKE_HV + vcpu-arch.shadow_msr= ~MSR_DE; + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr MSR_DE; #endif + + /* Force enable debug interrupts when user space wants to debug */ + if (vcpu-guest_debug) { +#ifdef CONFIG_KVM_BOOKE_HV + /* +* Since there is no shadow MSR, sync MSR_DE into the guest +* visible MSR. Do not allow guest to change MSR[DE]. +*/ + vcpu-arch.shared-msr |= MSR_DE; + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP); This mtspr should really just be a bit or in shadow_mspr when guest_debug gets
Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
On 04/02/2013 09:09:34 AM, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Tuesday, April 02, 2013 1:57 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Thursday, March 28, 2013 10:06 PM To: Bhushan Bharat-R65777 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421; Bhushan Bharat-R65777 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support How does the normal debug register switching code work in Linux? Can't we just reuse that? Or rely on it to restore working state when another process gets scheduled in? Good point, I can see debug registers loading in function __switch_to()- switch_booke_debug_regs() in file arch/powerpc/kernel/process.c. So as long as assume that host will not use debug resources we can rely on this restore. But I am not sure that this is a fare assumption. As Scott earlier mentioned someone can use debug resource for kernel debugging also. Someone in the kernel can also use floating point registers. But then it's his responsibility to clean up the mess he leaves behind. I am neither convinced by what you said and nor even have much reason to oppose :) Scott, I remember you mentioned that host can use debug resources, you comment on this ? I thought the conclusion we reached was that it was OK as long as KVM waits until it actually needs the debug resources to mess with the registers. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/6] kvm: add device control API
On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com Some comments below... diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; you add it in a later patch. Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably should add the KVM_CAP_DEVICE_CTRL capability too. +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created Is this still a possible error code? --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQCHIP_NONE 0 +#define KVMPPC_IRQCHIP_MPIC 1 This define should go in the patch that adds the MPIC device. struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -549,6 +552,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irqchip_type; + void *irqchip_priv; Since you add this (irqchip_priv) only to remove it in a later patch and replace it by a device-specific pointer, why bother adding it here? And why not give irqchip_type the name it ultimately ends up with? diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 16b4595..bdfa526 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) tasklet_kill(vcpu-arch.tasklet); kvmppc_remove_vcpu_debugfs(vcpu); + + switch (vcpu-arch.irqchip_type) { + case KVMPPC_IRQCHIP_MPIC: + mpic_put(vcpu-arch.irqchip_priv); + break; + } This is going to break bisection, since you don't define mpic_put() in this patch. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 74d0ff3..20ce2d2 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_EPR 86 #define KVM_CAP_ARM_PSCI 87 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88 +#define KVM_CAP_DEVICE_CTRL 89 #ifdef KVM_CAP_IRQ_ROUTING @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping { #define KVM_ARM_SET_DEVICE_ADDR_IOW(KVMIO, 0xab, struct kvm_arm_device_addr) /* + * Device control API, available with KVM_CAP_DEVICE_CTRL + */ +#define KVM_CREATE_DEVICE_TEST 1 + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/02/2013 08:02:39 PM, Paul Mackerras wrote: On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com Some comments below... diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..77328aa 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; Yes, it does (see below). you add it in a later patch. Maybe you're thinking of KVM_CAP_IRQ_MPIC? +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created Is this still a possible error code? If you mean ENOSPC, probably not -- it'd be replaced with whatever errors can come out of creating a file descriptor. --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQCHIP_NONE 0 +#define KVMPPC_IRQCHIP_MPIC 1 This define should go in the patch that adds the MPIC device. struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -549,6 +552,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irqchip_type; + void *irqchip_priv; Since you add this (irqchip_priv) only to remove it in a later patch and replace it by a device-specific pointer, why bother adding it here? And why not give irqchip_type the name it ultimately ends up with? Oops... These were patch shuffling accidents and will be removed from the next iteration. diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 16b4595..bdfa526 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu) tasklet_kill(vcpu-arch.tasklet); kvmppc_remove_vcpu_debugfs(vcpu); + + switch (vcpu-arch.irqchip_type) { + case KVMPPC_IRQCHIP_MPIC: + mpic_put(vcpu-arch.irqchip_priv); + break; + } This is going to break bisection, since you don't define mpic_put() in this patch. Sigh. Something got messed up; I'll try to sort it out and resubmit. diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 74d0ff3..20ce2d2 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_EPR 86 #define KVM_CAP_ARM_PSCI 87 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88 +#define KVM_CAP_DEVICE_CTRL 89 See, here's the capability. :-) /* + * Device control API, available with KVM_CAP_DEVICE_CTRL + */ +#define KVM_CREATE_DEVICE_TEST1 + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. That's actually why I moved it to a new
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/03/2013 01:30 AM, Scott Wood wrote: On 04/02/2013 01:59:57 AM, tiejun.chen wrote: On 04/02/2013 06:47 AM, Scott Wood wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff71541..ed033c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2158,6 +2158,17 @@ out: } #endif +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ +bool test = cd-flags KVM_CREATE_DEVICE_TEST; + +switch (cd-type) { +default: +return -ENODEV; +} Even after apply patch 5, looks here still misses something like: if (test) WARN_ON_ONCE(!cd-type); Why? How does userspace passing in a bad type value mean the kernel needs to report internal badness, why is a value of zero worse than any other bad value, and why only when the test flag is set? I just mean we need do something here since looks the 'test' variable is defined but unused, right? But please correct this as you expect :) And if the userspace can't guarantee cd-type is never zero, we should return -ENODEV as well after that switch(). Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/6] kvm: add device control API
On 04/03/2013 09:34 AM, Scott Wood wrote: On 04/02/2013 08:28:01 PM, tiejun.chen wrote: On 04/03/2013 01:30 AM, Scott Wood wrote: On 04/02/2013 01:59:57 AM, tiejun.chen wrote: On 04/02/2013 06:47 AM, Scott Wood wrote: diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ff71541..ed033c0 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2158,6 +2158,17 @@ out: } #endif +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ +bool test = cd-flags KVM_CREATE_DEVICE_TEST; + +switch (cd-type) { +default: +return -ENODEV; +} Even after apply patch 5, looks here still misses something like: if (test) WARN_ON_ONCE(!cd-type); Why? How does userspace passing in a bad type value mean the kernel needs to report internal badness, why is a value of zero worse than any other bad value, and why only when the test flag is set? I just mean we need do something here since looks the 'test' variable is defined but unused, right? But please correct this as you expect :) Yes, it's unused in this patch, but is used after patch 5 is applied. I didn't think it was worth adding a temporary unused annotation, since this part of the kernel doesn't use -Werror. Yes, its accepted in !-Werror case if we shouldn't warn something as you said. Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC
Enabling this capability connects the vcpu to the designated in-kernel MPIC. Using explicit connections between vcpus and irqchips allows for flexibility, but the main benefit at the moment is that it simplifies the code -- KVM doesn't need vm-global state to remember which MPIC object is associated with this vm, and it doesn't need to care about ordering between irqchip creation and vcpu creation. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt |8 ++ arch/powerpc/include/asm/kvm_host.h |8 ++ arch/powerpc/include/asm/kvm_ppc.h |2 ++ arch/powerpc/kvm/booke.c|4 ++- arch/powerpc/kvm/mpic.c | 49 +++ arch/powerpc/kvm/powerpc.c | 26 +++ include/uapi/linux/kvm.h|1 + 7 files changed, 92 insertions(+), 6 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index d52f3f9..4c326ae 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector. When disabled (args[0] == 0), behavior is as if this facility is unsupported. When this capability is enabled, KVM_EXIT_EPR can occur. + +6.6 KVM_CAP_IRQ_MPIC + +Architectures: ppc +Parameters: args[0] is the MPIC device fd +args[1] is the MPIC CPU number for this vcpu + +This capability connects the vcpu to an in-kernel MPIC device. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 7e7aef9..2a2e235 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg { u64 dac[KVMPPC_BOOKE_MAX_DAC]; }; +#define KVMPPC_IRQ_DEFAULT 0 +#define KVMPPC_IRQ_MPIC1 + +struct openpic; + struct kvm_vcpu_arch { ulong host_stack; u32 host_pid; @@ -554,6 +559,9 @@ struct kvm_vcpu_arch { unsigned long magic_page_pa; /* phys addr to map the magic page to */ unsigned long magic_page_ea; /* effect. addr to map the magic page to */ + int irq_type; /* one of KVM_IRQ_* */ + struct openpic *mpic; /* KVM_IRQ_MPIC */ + #ifdef CONFIG_KVM_BOOK3S_64_HV struct kvm_vcpu_arch_shared shregs; diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 3b63b97..f54707f 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr) } void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu); +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu, +u32 cpu); int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, struct kvm_config_tlb *cfg); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index cddc6b3..7d00222 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu *vcpu, if (update_epr == true) { if (vcpu-arch.epr_flags KVMPPC_EPR_USER) kvm_make_request(KVM_REQ_EPR_EXIT, vcpu); - else if (vcpu-arch.epr_flags KVMPPC_EPR_KERNEL) + else if (vcpu-arch.epr_flags KVMPPC_EPR_KERNEL) { + BUG_ON(vcpu-arch.irq_type != KVMPPC_IRQ_MPIC); kvmppc_mpic_set_epr(vcpu); + } } new_msr = msr_mask; diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c index 8cda2fa..caffe3b 100644 --- a/arch/powerpc/kvm/mpic.c +++ b/arch/powerpc/kvm/mpic.c @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct irq_dest *dst, void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu) { - struct openpic *opp = vcpu-arch.irqchip_priv; + struct openpic *opp = vcpu-arch.mpic; int cpu = vcpu-vcpu_id; unsigned long flags; @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp) static void unmap_mmio(struct openpic *opp) { - BUG_ON(opp-mmio_mapped); - opp-mmio_mapped = false; - - kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio); + if (opp-mmio_mapped) { + opp-mmio_mapped = false; + kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio); + } } static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr) @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = { .release = kvm_mpic_release, }; +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu, +u32 cpu) +{ + struct openpic *opp = mpic_filp-private_data; + int ret = 0;
[RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation
Hook the MPIC code up to the KVM interfaces, add locking, etc. TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE support Signed-off-by: Scott Wood scottw...@freescale.com --- v3: mpic_put - kvmppc_mpic_put Documentation/virtual/kvm/devices/mpic.txt | 37 ++ arch/powerpc/include/asm/kvm_host.h|8 +- arch/powerpc/include/asm/kvm_ppc.h |7 + arch/powerpc/kvm/Kconfig |5 + arch/powerpc/kvm/Makefile |2 + arch/powerpc/kvm/booke.c | 10 +- arch/powerpc/kvm/mpic.c| 814 +--- arch/powerpc/kvm/powerpc.c | 12 +- include/linux/kvm_host.h |2 + include/uapi/linux/kvm.h |9 + virt/kvm/kvm_main.c|9 + 11 files changed, 714 insertions(+), 201 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/mpic.txt diff --git a/Documentation/virtual/kvm/devices/mpic.txt b/Documentation/virtual/kvm/devices/mpic.txt new file mode 100644 index 000..79e000a --- /dev/null +++ b/Documentation/virtual/kvm/devices/mpic.txt @@ -0,0 +1,37 @@ +MPIC interrupt controller += + +Device types supported: + KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0 + KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2 + +Only one MPIC instance, of any type, may be instantiated. The created +MPIC will act as the system interrupt controller, connecting to each +vcpu's interrupt inputs. + +Groups: + KVM_DEV_MPIC_GRP_MISC + Attributes: +KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit) + Base address of the 256 KiB MPIC register space. Must be + naturally aligned. A value of zero disables the mapping. + Reset value is zero. + + KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit) +Access an MPIC register, as if the access were made from the guest. +attr is the byte offset into the MPIC register space. Accesses +must be 4-byte aligned. + +MSIs may be signaled by using this attribute group to write +to the relevant MSIIR. + + KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit) +IRQ input line for each standard openpic source. 0 is inactive and 1 +is active, regardless of interrupt sense. + +For edge-triggered interrupts: Writing 1 is considered an activating +edge, and writing 0 is ignored. Reading returns 1 if a previously +signaled edge has not been acknowledged, and 0 otherwise. + +attr is the IRQ number. IRQ numbers for standard sources are the +byte offset of the relevant IVPR from EIVPR0, divided by 32. diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index e34f8fe..7e7aef9 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -359,6 +359,11 @@ struct kvmppc_slb { #define KVMPPC_BOOKE_MAX_IAC 4 #define KVMPPC_BOOKE_MAX_DAC 2 +/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */ +#define KVMPPC_EPR_NONE0 /* EPR not supported */ +#define KVMPPC_EPR_USER1 /* exit to userspace to fill EPR */ +#define KVMPPC_EPR_KERNEL 2 /* in-kernel irqchip */ + struct kvmppc_booke_debug_reg { u32 dbcr0; u32 dbcr1; @@ -522,7 +527,7 @@ struct kvm_vcpu_arch { u8 sane; u8 cpu_type; u8 hcall_needed; - u8 epr_enabled; + u8 epr_flags; /* KVMPPC_EPR_xxx */ u8 epr_needed; u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */ @@ -589,5 +594,6 @@ struct kvm_vcpu_arch { #define KVM_MMIO_REG_FQPR 0x0060 #define __KVM_HAVE_ARCH_WQP +#define __KVM_HAVE_CREATE_DEVICE #endif /* __POWERPC_KVM_HOST_H__ */ diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index f589307..3b63b97 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu); extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *); +int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq); + /* * Cuts out inst bits with ordering according to spec. * That means the leftmost bit is zero. All given bits are included. @@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union kvmppc_one_reg *); void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid); +struct openpic; +void kvmppc_mpic_put(struct openpic *opp); + #ifdef CONFIG_KVM_BOOK3S_64_HV static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr) { @@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, u32 epr) #endif } +void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu); + int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu, struct kvm_config_tlb *cfg); int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu, diff --git a/arch/powerpc/kvm/Kconfig
[RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU
This is QEMU's hw/openpic.c from commit abd8d4a4d6dfea7ddea72f095f993e1de941614e (Update version for 1.4.0-rc0), run through Lindent with no other changes to ease merging future changes between Linux and QEMU. Remaining style issues (including those introduced by Lindent) will be fixed in a later patch. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/kvm/mpic.c | 1686 +++ 1 file changed, 1686 insertions(+) create mode 100644 arch/powerpc/kvm/mpic.c diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c new file mode 100644 index 000..57655b9 --- /dev/null +++ b/arch/powerpc/kvm/mpic.c @@ -0,0 +1,1686 @@ +/* + * OpenPIC emulation + * + * Copyright (c) 2004 Jocelyn Mayer + * 2011 Alexander Graf + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ +/* + * + * Based on OpenPic implementations: + * - Intel GW80314 I/O companion chip developer's manual + * - Motorola MPC8245 MPC8540 user manuals. + * - Motorola MCP750 (aka Raven) programmer manual. + * - Motorola Harrier programmer manuel + * + * Serial interrupts, as implemented in Raven chipset are not supported yet. + * + */ +#include hw.h +#include ppc/mac.h +#include pci/pci.h +#include openpic.h +#include sysbus.h +#include pci/msi.h +#include qemu/bitops.h +#include ppc.h + +//#define DEBUG_OPENPIC + +#ifdef DEBUG_OPENPIC +static const int debug_openpic = 1; +#else +static const int debug_openpic = 0; +#endif + +#define DPRINTF(fmt, ...) do { \ +if (debug_openpic) { \ +printf(fmt , ## __VA_ARGS__); \ +} \ +} while (0) + +#define MAX_CPU 32 +#define MAX_SRC 256 +#define MAX_TMR 4 +#define MAX_IPI 4 +#define MAX_MSI 8 +#define MAX_IRQ (MAX_SRC + MAX_IPI + MAX_TMR) +#define VID 0x03 /* MPIC version ID */ + +/* OpenPIC capability flags */ +#define OPENPIC_FLAG_IDR_CRIT (1 0) +#define OPENPIC_FLAG_ILR (2 0) + +/* OpenPIC address map */ +#define OPENPIC_GLB_REG_START0x0 +#define OPENPIC_GLB_REG_SIZE 0x10F0 +#define OPENPIC_TMR_REG_START0x10F0 +#define OPENPIC_TMR_REG_SIZE 0x220 +#define OPENPIC_MSI_REG_START0x1600 +#define OPENPIC_MSI_REG_SIZE 0x200 +#define OPENPIC_SUMMARY_REG_START 0x3800 +#define OPENPIC_SUMMARY_REG_SIZE0x800 +#define OPENPIC_SRC_REG_START0x1 +#define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20) +#define OPENPIC_CPU_REG_START0x2 +#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000) + +/* Raven */ +#define RAVEN_MAX_CPU 2 +#define RAVEN_MAX_EXT 48 +#define RAVEN_MAX_IRQ 64 +#define RAVEN_MAX_TMR MAX_TMR +#define RAVEN_MAX_IPI MAX_IPI + +/* Interrupt definitions */ +#define RAVEN_FE_IRQ (RAVEN_MAX_EXT) /* Internal functional IRQ */ +#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1) /* Error IRQ */ +#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2) /* First timer IRQ */ +#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR) /* First IPI IRQ */ +/* First doorbell IRQ */ +#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI)) + +typedef struct FslMpicInfo { + int max_ext; +} FslMpicInfo; + +static FslMpicInfo fsl_mpic_20 = { + .max_ext = 12, +}; + +static FslMpicInfo fsl_mpic_42 = { + .max_ext = 12, +}; + +#define FRR_NIRQ_SHIFT16 +#define FRR_NCPU_SHIFT 8 +#define FRR_VID_SHIFT 0 + +#define VID_REVISION_1_2 2 +#define VID_REVISION_1_3 3 + +#define VIR_GENERIC 0x/* Generic Vendor ID */ + +#define GCR_RESET0x8000 +#define GCR_MODE_PASS0x +#define GCR_MODE_MIXED 0x2000 +#define GCR_MODE_PROXY 0x6000 + +#define TBCR_CI 0x8000 /* count inhibit */ +#define TCCR_TOG 0x8000 /* toggles when decrement to zero */ + +#define IDR_EP_SHIFT 31 +#define IDR_EP_MASK (1 IDR_EP_SHIFT)
[RFC PATCH v3 1/6] kvm: add device control API
Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- v3: remove some changes that were merged into this patch by accident, and fix the error documentation for KVM_CREATE_DEVICE. NOTE: I had some difficulty figuring out what ioctl numbers I should assign... it seems that at one point care was taken to keep vcpu and vm ioctls separate, but some overlap exists now (despite not exhausing the ioctl space). Some of that was my fault, but not all of it. :-) I moved to a new ioctl range for device control -- please let me know if there's something else you'd prefer I do. --- Documentation/virtual/kvm/api.txt| 70 ++ Documentation/virtual/kvm/devices/README |1 + include/uapi/linux/kvm.h | 27 virt/kvm/kvm_main.c | 31 + 4 files changed, 129 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 976eb65..d52f3f9 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + + Other error conditions may be defined by individual device types or + have their standard meanings. + +Creates an emulated device in the kernel. The file descriptor returned +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 fd; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific. See individual device documentation in +the devices directory. As with ONE_REG, the size of the data +transferred is defined by the particular attribute. + +struct kvm_device_attr { + __u32 flags; /* no flags currently defined */ + __u32 group; /* device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: device ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 4.77 KVM_ARM_VCPU_INIT diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index 000..34a6983
Re: [RFC PATCH v2 1/6] kvm: add device control API
On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote: On 04/02/2013 08:02:39 PM, Paul Mackerras wrote: On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote: +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL I notice this patch doesn't add this capability; Yes, it does (see below). you add it in a later patch. Maybe you're thinking of KVM_CAP_IRQ_MPIC? No, I was referring to the addition to kvm_dev_ioctl_check_extension() of a KVM_CAP_DEVICE_CTRL case. Since this patch adds the code to handle KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if userspace queries the KVM_CAP_DEVICE_CTRL capability. +/* ioctl for vm fd */ +#define KVM_CREATE_DEVICE _IOWR(KVMIO, 0xe0, struct kvm_create_device) This define should go with the other VM ioctls, otherwise the next person to add a VM ioctl will probably miss it and reuse the 0xe0 code. That's actually why I moved it to a new section, with device control ioctls getting their own range, as the legacy device model and some other things did. 0xe0 is not the next ioctl that would be used for either vm or vcpu. The ioctl numbering is actually already a mess, with sometimes care being taken to keep vcpu and vm ioctls from overlapping, but on other places overlapping does happen. I'm not sure what exactly I should do here. Well, even if you are using a new range, I still think that KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM ioctls. I guess it's ultimately up to the maintainers. Paul. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html