Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
On 15/02/13 14:24, Paul Mackerras wrote: On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote: +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ + /* liobn, stt, stt-window_size); */ + if (ioba = stt-window_size) { + pr_err(%s failed on ioba=%lx\n, __func__, ioba); Doesn't this give the guest a way to spam the host logs? And in fact printk in real mode is potentially problematic. I would just leave out this statement. + return H_PARAMETER; + } + + page = stt-pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); I would like to see an explanation of why we are confident that page_address() will work correctly in real mode, across all the combinations of config options that we can have for a ppc64 book3s kernel. It was there before this patch, I just moved it so I would think it has been explained before :) There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled. CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not enabled on PPC64 either. So this definition is supposed to work on PPC64: #define page_address(page) lowmem_page_address(page) where lowmem_page_address() is arithmetic operation on a page struct address: static __always_inline void *lowmem_page_address(const struct page *page) { return __va(PFN_PHYS(page_to_pfn(page))); } PPC32 will use page_address() from mm/highmem.c, I need some lesson about memory layout in 32bit but for now I cannot see how it can possibly fail here. + + /* FIXME: Need to validate the TCE itself */ + /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; + + return H_SUCCESS; +} + +/* + * Real mode handlers */ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu-kvm; struct kvmppc_spapr_tce_table *stt; - /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */ - /* liobn, ioba, tce); */ + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; + + /* Emulated IO */ + return emulated_h_put_tce(stt, ioba, tce); +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + unsigned long *tces; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) { - if (stt-liobn == liobn) { - unsigned long idx = ioba SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; + tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL); + if (!tces) + return H_TOO_HARD; - /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ - /* liobn, stt, stt-window_size); */ - if (ioba = stt-window_size) - return H_PARAMETER; + /* Emulated IO */ + for (i = 0; (i npages) !ret; ++i, ioba += IOMMU_PAGE_SIZE) + ret = emulated_h_put_tce(stt, ioba, tces[i]); So, tces is a pointer to somewhere inside a real page. Did we check somewhere that tces[npages-1] is in the same page as tces[0]? If so, I missed it. If we didn't, then we probably should check and do something about it. - page = stt-pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); + return ret; +} - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } - } +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + /* Emulated
Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote: On 2013-02-14 19:46, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements basic I/O bitmap handling. Repeated string accesses are still reported to L1 unconditionally for now. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v3: - trap unconditionally if bitmap access fails arch/x86/kvm/vmx.c | 55 ++- 1 files changed, 53 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..2633199 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + bool string, rep; + u16 port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + string = exit_qualification 16; + rep = exit_qualification 32; + + /* TODO: interpret instruction and check range against bitmap */ + if (string rep) + return 1; Nonsense, rep ins/outs always works against the same port. We can simply drop this check and be done with the feature. I'll come up with v4. Actually this reminds me that we should check range of ports depending on operand size, not one port. But here is a catch, older cpus do not provide operand size as part of exit information. Jan + + port = exit_qualification 16; + size = (exit_qualification 7) + 1; + + last_bitmap = (gpa_t)-1; + b = -1; + + while (size 0) { + if (port 0x8000) + bitmap = vmcs12-io_bitmap_a; + else + bitmap = vmcs12-io_bitmap_b; + bitmap += (port 0x7fff) / 8; + + if (last_bitmap != bitmap) + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1)) + return 1; + if (b (1 (port 7))) + return 1; + + port++; + size--; + last_bitmap = bitmap; + } + + return 0; +} + /* * Return 1 if we should exit from L2 to L1 to handle an MSR access access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed @@ -6097,8 +6149,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: - /* TODO: support IO bitmaps */ - return 1; + return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling
On 2013-02-18 09:44, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote: On 2013-02-14 19:46, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements basic I/O bitmap handling. Repeated string accesses are still reported to L1 unconditionally for now. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v3: - trap unconditionally if bitmap access fails arch/x86/kvm/vmx.c | 55 ++- 1 files changed, 53 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..2633199 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + bool string, rep; + u16 port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + string = exit_qualification 16; + rep = exit_qualification 32; + + /* TODO: interpret instruction and check range against bitmap */ + if (string rep) + return 1; Nonsense, rep ins/outs always works against the same port. We can simply drop this check and be done with the feature. I'll come up with v4. Actually this reminds me that we should check range of ports depending on operand size, not one port. But here is a catch, older cpus do not provide operand size as part of exit information. You mean what bit 54 in VMX_BASIC is telling us? Too bad. OK, will write v5 which takes this into account. Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 09:53:22AM +0100, Jan Kiszka wrote: On 2013-02-18 09:44, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote: On 2013-02-14 19:46, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements basic I/O bitmap handling. Repeated string accesses are still reported to L1 unconditionally for now. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v3: - trap unconditionally if bitmap access fails arch/x86/kvm/vmx.c | 55 ++- 1 files changed, 53 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..2633199 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, +struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + bool string, rep; + u16 port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + string = exit_qualification 16; + rep = exit_qualification 32; + + /* TODO: interpret instruction and check range against bitmap */ + if (string rep) + return 1; Nonsense, rep ins/outs always works against the same port. We can simply drop this check and be done with the feature. I'll come up with v4. Actually this reminds me that we should check range of ports depending on operand size, not one port. But here is a catch, older cpus do not provide operand size as part of exit information. You mean what bit 54 in VMX_BASIC is telling us? Too bad. OK, will write v5 which takes this into account. Yes, this one. We can just exit unconditionally on older cpus. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] virtio-blk: use virtqueue_start_buf on req path
Il 17/02/2013 07:37, Asias He ha scritto: static int __virtblk_add_req(struct virtqueue *vq, - struct virtblk_req *vbr) + struct virtblk_req *vbr, + struct scatterlist *data_sg, + unsigned data_nents) { struct scatterlist sg; enum dma_data_direction dir; int ret; + int type = vbr-out_hdr.type ~VIRTIO_BLK_T_OUT; unsigned int nents = 2; unsigned int nsg = 2; - if (vbr-nents) { + if (type == VIRTIO_BLK_T_SCSI_CMD) { + BUG_ON(use_bio); Do we really need the BUG_ON? Even if with use_bio=1, VIRTIO_BLK_T_SCSI_CMD cmd can be fired. See this: I stand corrected... will send the patch with this removed. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5] KVM: nVMX: Improve I/O exit handling
From: Jan Kiszka jan.kis...@siemens.com This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. We still exit unconditionally in case the CPU does not provide information for ins/outs. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v5: - still exit unconditionally if CPU refuses to provide exit information on ins/outs arch/x86/kvm/vmx.c | 58 ++- 1 files changed, 56 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..ccc7c17 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -651,6 +651,7 @@ static struct vmcs_config { int size; int order; u32 revision_id; + u32 vmx_basic_high; u32 pin_based_exec_ctrl; u32 cpu_based_exec_ctrl; u32 cpu_based_2nd_exec_ctrl; @@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm) return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } +static inline bool cpu_has_stringio_exit_info(void) +{ + return vmcs_config.vmx_basic_high (VMX_BASIC_INOUT 32); +} + static inline bool cpu_has_secondary_exec_ctrls(void) { return vmcs_config.cpu_based_exec_ctrl @@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmcs_conf-size = vmx_msr_high 0x1fff; vmcs_conf-order = get_order(vmcs_config.size); vmcs_conf-revision_id = vmx_msr_low; + vmcs_conf-vmx_basic_high = vmx_msr_high; vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control; vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control; @@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + u16 port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + /* TODO: for older CPUs, derive access width from instruction */ + if (!cpu_has_stringio_exit_info()) + return 1; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + port = exit_qualification 16; + size = (exit_qualification 7) + 1; + + last_bitmap = (gpa_t)-1; + b = -1; + + while (size 0) { + if (port 0x8000) + bitmap = vmcs12-io_bitmap_a; + else + bitmap = vmcs12-io_bitmap_b; + bitmap += (port 0x7fff) / 8; + + if (last_bitmap != bitmap) + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1)) + return 1; + if (b (1 (port 7))) + return 1; + + port++; + size--; + last_bitmap = bitmap; + } + + return 0; +} + /* * Return 1 if we should exit from L2 to L1 to handle an MSR access access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed @@ -6097,8 +6152,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: - /* TODO: support IO bitmaps */ - return 1; + return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH ppc-next v2 42/52] target-ppc: Convert CPU definitions
Turn the array of model definitions into a set of self-registering QOM types with their own class_init. Unique identifiers are obtained from the combination of PVR, SVR and family identifiers; this requires all alias #defines to be removed from the list. Possibly there are some more left after this commit that are not currently being compiled. Prepares for introducing abstract intermediate CPU types for families. Keep the right-aligned macro line breaks within 78 chars to aid three-way merges. Signed-off-by: Andreas Färber afaer...@suse.de --- target-ppc/cpu-qom.h| 17 - target-ppc/cpu.h| 20 -- target-ppc/kvm.c| 32 + target-ppc/translate_init.c | 163 +-- 4 Dateien geändert, 115 Zeilen hinzugefügt(+), 117 Zeilen entfernt(-) diff --git a/target-ppc/cpu-qom.h b/target-ppc/cpu-qom.h index b338f8f..7220908 100644 --- a/target-ppc/cpu-qom.h +++ b/target-ppc/cpu-qom.h @@ -51,8 +51,21 @@ typedef struct PowerPCCPUClass { void (*parent_reset)(CPUState *cpu); -/* TODO inline fields here */ -ppc_def_t *info; +uint32_t pvr; +uint32_t svr; +uint64_t insns_flags; +uint64_t insns_flags2; +uint64_t msr_mask; +powerpc_mmu_t mmu_model; +powerpc_excp_t excp_model; +powerpc_input_t bus_model; +uint32_t flags; +int bfd_mach; +#if defined(TARGET_PPC64) +const struct ppc_segment_page_sizes *sps; +#endif +void (*init_proc)(CPUPPCState *env); +int (*check_pow)(CPUPPCState *env); } PowerPCCPUClass; /** diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h index 8c081db..86ebd3a 100644 --- a/target-ppc/cpu.h +++ b/target-ppc/cpu.h @@ -307,7 +307,6 @@ enum powerpc_input_t { #define PPC_INPUT(env) (env-bus_model) /*/ -typedef struct ppc_def_t ppc_def_t; typedef struct opc_handler_t opc_handler_t; /*/ @@ -902,25 +901,6 @@ struct ppc_segment_page_sizes { /* The whole PowerPC CPU context */ #define NB_MMU_MODES 3 -struct ppc_def_t { -const char *name; -uint32_t pvr; -uint32_t svr; -uint64_t insns_flags; -uint64_t insns_flags2; -uint64_t msr_mask; -powerpc_mmu_t mmu_model; -powerpc_excp_t excp_model; -powerpc_input_t bus_model; -uint32_t flags; -int bfd_mach; -#if defined(TARGET_PPC64) -const struct ppc_segment_page_sizes *sps; -#endif -void (*init_proc)(CPUPPCState *env); -int (*check_pow)(CPUPPCState *env); -}; - struct CPUPPCState { /* First are the most commonly used resources * during translated code execution diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index 2c64c63..e601059 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -1263,7 +1263,7 @@ static void kvmppc_host_cpu_initfn(Object *obj) assert(kvm_enabled()); -if (pcc-info-pvr != mfpvr()) { +if (pcc-pvr != mfpvr()) { fprintf(stderr, Your host CPU is unsupported.\n Please choose a supported model instead, see -cpu ?.\n); exit(1); @@ -1275,30 +1275,38 @@ static void kvmppc_host_cpu_class_init(ObjectClass *oc, void *data) PowerPCCPUClass *pcc = POWERPC_CPU_CLASS(oc); uint32_t host_pvr = mfpvr(); PowerPCCPUClass *pvr_pcc; -ppc_def_t *spec; uint32_t vmx = kvmppc_get_vmx(); uint32_t dfp = kvmppc_get_dfp(); -spec = g_malloc0(sizeof(*spec)); - pvr_pcc = ppc_cpu_class_by_pvr(host_pvr); if (pvr_pcc != NULL) { -memcpy(spec, pvr_pcc-info, sizeof(*spec)); +pcc-pvr = pvr_pcc-pvr; +pcc-svr = pvr_pcc-svr; +pcc-insns_flags = pvr_pcc-insns_flags; +pcc-insns_flags2 = pvr_pcc-insns_flags2; +pcc-msr_mask = pvr_pcc-msr_mask; +pcc-mmu_model= pvr_pcc-mmu_model; +pcc-excp_model = pvr_pcc-excp_model; +pcc-bus_model= pvr_pcc-bus_model; +pcc-flags= pvr_pcc-flags; +pcc-bfd_mach = pvr_pcc-bfd_mach; +#ifdef TARGET_PPC64 +pcc-sps = pvr_pcc-sps; +#endif +pcc-init_proc= pvr_pcc-init_proc; +pcc-check_pow= pvr_pcc-check_pow; } -pcc-info = spec; -/* Override the display name for -cpu ? and QMP */ -pcc-info-name = host; -/* Now fix up the spec with information we can query from the host */ +/* Now fix up the class with information we can query from the host */ if (vmx != -1) { /* Only override when we know what the host supports */ -alter_insns(spec-insns_flags, PPC_ALTIVEC, vmx 0); -alter_insns(spec-insns_flags2, PPC2_VSX, vmx 1); +alter_insns(pcc-insns_flags, PPC_ALTIVEC, vmx 0); +alter_insns(pcc-insns_flags2, PPC2_VSX, vmx 1); } if (dfp != -1) { /* Only override when we know what the host supports */ -
Re: [PATCH v5] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 10:17:14AM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. We still exit unconditionally in case the CPU does not provide information for ins/outs. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v5: - still exit unconditionally if CPU refuses to provide exit information on ins/outs arch/x86/kvm/vmx.c | 58 ++- 1 files changed, 56 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..ccc7c17 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -651,6 +651,7 @@ static struct vmcs_config { int size; int order; u32 revision_id; + u32 vmx_basic_high; u32 pin_based_exec_ctrl; u32 cpu_based_exec_ctrl; u32 cpu_based_2nd_exec_ctrl; @@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm) return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } +static inline bool cpu_has_stringio_exit_info(void) +{ + return vmcs_config.vmx_basic_high (VMX_BASIC_INOUT 32); +} + static inline bool cpu_has_secondary_exec_ctrls(void) { return vmcs_config.cpu_based_exec_ctrl @@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmcs_conf-size = vmx_msr_high 0x1fff; vmcs_conf-order = get_order(vmcs_config.size); vmcs_conf-revision_id = vmx_msr_low; + vmcs_conf-vmx_basic_high = vmx_msr_high; vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control; vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control; @@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, +struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + u16 port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + /* TODO: for older CPUs, derive access width from instruction */ + if (!cpu_has_stringio_exit_info()) + return 1; + Sigh, actually I am stupid :( The information that is not available on older cpus is address size in VM-exit instruction information, not operand size on exit qualification, so your v4 is correct. Except handling of port wrap around: If the “use I/O bitmaps” VM-execution control is 1, the instruction causes a VM exit if it attempts to access an I/O port corresponding to a bit set to 1 in the appropriate I/O bitmap (see Section 24.6.4). If an I/O operation “wraps around” the 16-bit I/O-port space (accesses ports H and H), the I/O instruction causes a VM exit (the “unconditional I/O exiting” VM-execution control is ignored if the “use I/O bitmaps” VM-execution control is 1). + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + port = exit_qualification 16; + size = (exit_qualification 7) + 1; + + last_bitmap = (gpa_t)-1; + b = -1; + + while (size 0) { + if (port 0x8000) + bitmap = vmcs12-io_bitmap_a; + else + bitmap = vmcs12-io_bitmap_b; + bitmap += (port 0x7fff) / 8; + + if (last_bitmap != bitmap) + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1)) + return 1; + if (b (1 (port 7))) + return 1; + + port++; + size--; + last_bitmap = bitmap; + } + + return 0; +} + /* * Return 1 if we should exit from L2 to L1 to handle an MSR access access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed @@ -6097,8 +6152,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: - /* TODO: support IO bitmaps */ - return 1; + return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); -- 1.7.3.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.
On Fri, Feb 15, 2013 at 11:10:46AM -0500, Sanjay Lal wrote: On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote: On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote: +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva) +{ + gpa_t gpa; + uint32_t kseg = KSEGX(gva); + + if ((kseg == CKSEG0) || (kseg == CKSEG1)) You seems to be using KVM_GUEST_KSEGX variants on gva in all other places. Why not here? This function is invoked to handle 2 scenarios: (1) Parse the boot code config tables setup by QEMU's Malta emulation. The pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not Guest KSEG0 addresses. Where is it called for that purpose? The only place where gva_to_gpa callback is called is in kvm/kvm_mips_emul.c:kvm_mips_emulate_(store|load) (2) Handle I/O accesses by the guest. On MIPS platforms, I/O device registers are mapped into the KSEG1 address space (unmapped, uncached). Again like (1) these are actual KSEG1 addresses, which cause an exception and are passed onto QEMU for I/O emulation. So guest KSEG1 registers is mapped to 0xA000-0xBFFF ranges just like on a host? Can you give corresponding segment names to those ranges Guest User address space: 0x - 0x4000 (useg?) Guest Kernel Unmapped: 0x4000 - 0x6000 (kseg0?) Guest Kernel Mapped:0x6000 - 0x8000 (?) -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/5] virtio-scsi multiqueue
On 02/12/2013 09:06 PM, Paolo Bonzini wrote: This series implements virtio-scsi queue steering, which gives performance improvements of up to 50% (measured both with QEMU and tcm_vhost backends). The patches build on top of the new virtio APIs at http://permalink.gmane.org/gmane.linux.kernel.virtualization/18431; the new API simplifies the locking of the virtio-scsi driver nicely, thus it makes sense to require them as a prerequisite. Changes from the previous post, which can be found at http://permalink.gmane.org/gmane.linux.kernel.virtualization/17869: - patches 1 and 2 (virtio: add functions for piecewise addition of buffers, virtio-scsi: use functions for piecewise composition of buffers) split into their own series - new cleanup patch virtio-scsi: push vq lock/unlock into virtscsi_vq_done - reorganized code to move ACCESS_ONCE in a clearer place - included Wanlong Gao's CPU hotplug patches Ok for 3.9? It would probably be easier to get it in via Rusty's tree because of the prerequisites. James, can I get your Acked-by? I can't apply this series on top of Rusty's virtio-next, I missed something or needed rebase them ? Thanks, Wanlong Gao Paolo Paolo Bonzini (4): virtio-scsi: redo allocation of target data virtio-scsi: pass struct virtio_scsi to virtqueue completion function virtio-scsi: push vq lock/unlock into virtscsi_vq_done virtio-scsi: introduce multiqueue support Wanlong Gao (1): virtio-scsi: reset virtqueue affinity when doing cpu hotplug drivers/scsi/virtio_scsi.c | 360 +++- 1 files changed, 292 insertions(+), 68 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5] KVM: nVMX: Improve I/O exit handling
On 2013-02-18 10:36, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 10:17:14AM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. We still exit unconditionally in case the CPU does not provide information for ins/outs. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v5: - still exit unconditionally if CPU refuses to provide exit information on ins/outs arch/x86/kvm/vmx.c | 58 ++- 1 files changed, 56 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..ccc7c17 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -651,6 +651,7 @@ static struct vmcs_config { int size; int order; u32 revision_id; +u32 vmx_basic_high; u32 pin_based_exec_ctrl; u32 cpu_based_exec_ctrl; u32 cpu_based_2nd_exec_ctrl; @@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm) return (cpu_has_vmx_tpr_shadow()) (irqchip_in_kernel(kvm)); } +static inline bool cpu_has_stringio_exit_info(void) +{ +return vmcs_config.vmx_basic_high (VMX_BASIC_INOUT 32); +} + static inline bool cpu_has_secondary_exec_ctrls(void) { return vmcs_config.cpu_based_exec_ctrl @@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmcs_conf-size = vmx_msr_high 0x1fff; vmcs_conf-order = get_order(vmcs_config.size); vmcs_conf-revision_id = vmx_msr_low; +vmcs_conf-vmx_basic_high = vmx_msr_high; vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control; vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control; @@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) +{ +unsigned long exit_qualification; +gpa_t bitmap, last_bitmap; +u16 port; +int size; +u8 b; + +if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) +return 1; + +if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) +return 0; + +/* TODO: for older CPUs, derive access width from instruction */ +if (!cpu_has_stringio_exit_info()) +return 1; + Sigh, actually I am stupid :( The information that is not available on older cpus is address size in VM-exit instruction information, not operand size on exit qualification, so your v4 is correct. Except handling of port wrap around: If the “use I/O bitmaps” VM-execution control is 1, the instruction causes a VM exit if it attempts to access an I/O port corresponding to a bit set to 1 in the appropriate I/O bitmap (see Section 24.6.4). If an I/O operation “wraps around” the 16-bit I/O-port space (accesses ports H and H), the I/O instruction causes a VM exit (the “unconditional I/O exiting” VM-execution control is ignored if the “use I/O bitmaps” VM-execution control is 1). Ah, indeed. OK, let's see if this simple patch manages to reach two-digit revision numbers... Jan signature.asc Description: OpenPGP digital signature
Re: [PATCH v3 0/5] virtio-scsi multiqueue
Il 18/02/2013 10:32, Wanlong Gao ha scritto: Ok for 3.9? It would probably be easier to get it in via Rusty's tree because of the prerequisites. James, can I get your Acked-by? I can't apply this series on top of Rusty's virtio-next, I missed something or needed rebase them ? It's on top of the patches at http://permalink.gmane.org/gmane.linux.kernel.virtualization/18431. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6] KVM: nVMX: Improve I/O exit handling
This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v6: - drop the bogus check of vmx_basix.54 again - exit unconditionally on I/O address wrap around arch/x86/kvm/vmx.c | 49 +++-- 1 files changed, 47 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..b4ce43c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5908,6 +5908,52 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, + struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + unsigned int port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + port = exit_qualification 16; + size = (exit_qualification 7) + 1; + + last_bitmap = (gpa_t)-1; + b = -1; + + while (size 0) { + if (port 0x8000) + bitmap = vmcs12-io_bitmap_a; + else if (port 0x1) + bitmap = vmcs12-io_bitmap_b; + else + return 1; + bitmap += (port 0x7fff) / 8; + + if (last_bitmap != bitmap) + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1)) + return 1; + if (b (1 (port 7))) + return 1; + + port++; + size--; + last_bitmap = bitmap; + } + + return 0; +} + /* * Return 1 if we should exit from L2 to L1 to handle an MSR access access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed @@ -6097,8 +6143,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: - /* TODO: support IO bitmaps */ - return 1; + return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com --- Changes in v6: - drop the bogus check of vmx_basix.54 again - exit unconditionally on I/O address wrap around arch/x86/kvm/vmx.c | 49 +++-- 1 files changed, 47 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 6667042..b4ce43c 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5908,6 +5908,52 @@ static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = { static const int kvm_vmx_max_exit_handlers = ARRAY_SIZE(kvm_vmx_exit_handlers); +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu, +struct vmcs12 *vmcs12) +{ + unsigned long exit_qualification; + gpa_t bitmap, last_bitmap; + unsigned int port; + int size; + u8 b; + + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING)) + return 1; + + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS)) + return 0; + + exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + + port = exit_qualification 16; + size = (exit_qualification 7) + 1; + + last_bitmap = (gpa_t)-1; + b = -1; + + while (size 0) { + if (port 0x8000) + bitmap = vmcs12-io_bitmap_a; + else if (port 0x1) + bitmap = vmcs12-io_bitmap_b; + else + return 1; + bitmap += (port 0x7fff) / 8; + + if (last_bitmap != bitmap) + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1)) + return 1; + if (b (1 (port 7))) + return 1; + + port++; + size--; + last_bitmap = bitmap; + } + + return 0; +} + /* * Return 1 if we should exit from L2 to L1 to handle an MSR access access, * rather than handle it ourselves in L0. I.e., check whether L1 expressed @@ -6097,8 +6143,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu) case EXIT_REASON_DR_ACCESS: return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING); case EXIT_REASON_IO_INSTRUCTION: - /* TODO: support IO bitmaps */ - return 1; + return nested_vmx_exit_handled_io(vcpu, vmcs12); case EXIT_REASON_MSR_READ: case EXIT_REASON_MSR_WRITE: return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason); -- 1.7.3.4 -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
Can you tell us why mpic should be in kernel? Is it used often by modern guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs are delivered through mpic too)? On x86 we actually would've preferred to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for historical reasons creation of PIC/IOAPIC/LAPIC are bundled together) hence my question. On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote: Scott Wood (6): kvm: add device control API kvm/ppc: add a notifier chain for vcpu creation/destruction. kvm/ppc/mpic: import hw/openpic.c from QEMU kvm/ppc/mpic: remove some obviously unneeded code kvm/ppc/mpic: adapt to kernel style and environment kvm/ppc/mpic: in-kernel MPIC emulation Documentation/virtual/kvm/api.txt | 76 ++ Documentation/virtual/kvm/devices/README |1 + Documentation/virtual/kvm/devices/mpic.txt | 36 + arch/powerpc/include/asm/kvm_host.h|9 +- arch/powerpc/include/asm/kvm_ppc.h |4 + arch/powerpc/kvm/Kconfig |5 + arch/powerpc/kvm/Makefile |2 + arch/powerpc/kvm/booke.c | 10 +- arch/powerpc/kvm/mpic.c| 1890 arch/powerpc/kvm/powerpc.c | 31 +- include/linux/kvm_host.h | 44 +- include/uapi/linux/kvm.h | 34 + virt/kvm/kvm_main.c| 141 +++ 13 files changed, 2268 insertions(+), 15 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/README create mode 100644 Documentation/virtual/kvm/devices/mpic.txt create mode 100644 arch/powerpc/kvm/mpic.c -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? x86 obviously support old way and will have to for some, very long, time. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Scott, what other devices are you planning to support with this interface? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. I'm also a bit confused here. steal time will then only account the cpu time lost due to quotas from cfs bandwidth control? Also what do you exactly mean by expected steal time? Is it steal time due to overcommitting minus scheduler quotas? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for 2013-02-19
Hi Please send in any agenda topics you are interested in. Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] qemu vfio-pci: PCIe and generic config space mangling
This series generalizes the places where we already rely on QEMU config space emulation to make it easier to add various other fields. It turns out that when we start running on Q35 Windows cares quite a bit about the PCIe type we expose. Drivers will fail to load if not set to something valid and appropriate for the bus. We don't have anything in place yet to tell what the bus is, so for now we rely on a temporary option that will eventually get replaced by something automatic. Thanks, Alex --- Alex Williamson (2): vfio-pci: Generalize PCI config mangling vfio-pci: Add PCIe capability mangling based on bus type hw/vfio_pci.c | 210 ++--- 1 file changed, 171 insertions(+), 39 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] vfio-pci: Generalize PCI config mangling
Kernel-side vfio virtualizes all of config space, but some parts are unique to Qemu. For instance we may or may not expose the ROM BAR, Qemu manages MSI/MSIX, and Qemu manages the multi-function bit so that single function devices can appear as multi-function and vica versa. Generalize this into a bitmap of Qemu emulated bits. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 80 ++--- 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 66537b7..53b23f3 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -117,6 +117,7 @@ typedef struct VFIODevice { int fd; VFIOINTx intx; unsigned int config_size; +uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ off_t config_offset; /* Offset of config space region within device fd */ unsigned int rom_size; off_t rom_offset; /* Offset of ROM region within device fd */ @@ -963,44 +964,29 @@ static const MemoryRegionOps vfio_bar_ops = { static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) { VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); -uint32_t val = 0; +uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val; -/* - * We only need QEMU PCI config support for the ROM BAR, the MSI and MSIX - * capabilities, and the multifunction bit below. We let VFIO handle - * virtualizing everything else. Performance is not a concern here. - */ -if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) || -(pdev-cap_present QEMU_PCI_CAP_MSIX - ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) || -(pdev-cap_present QEMU_PCI_CAP_MSI - ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size))) { +memcpy(emu_bits, vdev-emulated_config_bits + addr, len); +emu_bits = le32_to_cpu(emu_bits); -val = pci_default_read_config(pdev, addr, len); -} else { -if (pread(vdev-fd, val, len, vdev-config_offset + addr) != len) { +if (emu_bits) { +emu_val = pci_default_read_config(pdev, addr, len); +} + +if (~emu_bits (0xU (32 - len * 8))) { +ssize_t ret; + +ret = pread(vdev-fd, phys_val, len, vdev-config_offset + addr); +if (ret != len) { error_report(%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m\n, __func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, vdev-host.function, addr, len); return -errno; } -val = le32_to_cpu(val); +phys_val = le32_to_cpu(phys_val); } -/* Multifunction bit is virualized in QEMU */ -if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) { -uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION; - -if (len == 4) { -mask = 16; -} - -if (pdev-cap_present QEMU_PCI_CAP_MULTIFUNCTION) { -val |= mask; -} else { -val = ~mask; -} -} +val = (emu_val emu_bits) | (phys_val ~emu_bits); DPRINTF(%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n, __func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, @@ -1026,12 +1012,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, vdev-host.slot, vdev-host.function, addr, val, len); } -/* Write standard header bits to emulation */ -if (addr PCI_CONFIG_HEADER_SIZE) { -pci_default_write_config(pdev, addr, val, len); -return; -} - /* MSI/MSI-X Enabling/Disabling */ if (pdev-cap_present QEMU_PCI_CAP_MSI ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size)) { @@ -1046,9 +1026,7 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled !is_enabled) { vfio_disable_msi(vdev); } -} - -if (pdev-cap_present QEMU_PCI_CAP_MSIX +} else if (pdev-cap_present QEMU_PCI_CAP_MSIX ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) { int is_enabled, was_enabled = msix_enabled(pdev); @@ -1061,6 +1039,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled !is_enabled) { vfio_disable_msix(vdev); } +} else { +/* Write everything to QEMU to keep emulated bits correct */ +pci_default_write_config(pdev, addr, val, len); } } @@ -2003,6 +1984,16 @@ static int vfio_initfn(PCIDevice *pdev) goto out_put; } +/* vfio emulates a lot for us, but some bits need extra love */ +vdev-emulated_config_bits = g_malloc0(vdev-config_size); + +/* QEMU can choose to expose the ROM or not */ +memset(vdev-emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4); + +/* QEMU can change multi-function devices to single function, or reverse */ +
[PATCH 2/2] vfio-pci: Add PCIe capability mangling based on bus type
Windows seems to pay particular interest to the PCIe header type of devices and will fail to load drivers if we attached Endpoint devices or Legacy Endpoint devices to the Root Complex. We don't yet have a good way to determine the bus type, so for now we add an experimental x-bustype option which will later be replaced by some mechanism to determine this automatcally. The new option is defined as: x-bustype=n where n is one of: 0: Legacy PCI [default] 1: PCI Express 2: PCI Express Root Complex Conversion of PCIe types is does as follows: * Legacy PCI * No change, capability is unmodified for compatibility. * PCI Express * Integrated Root Complex Endpoint - Endpoint * PCI Express Root Complext * Endpoint - Integrated Root Complex Endpoint * Legacy Endpoint - none, capability hidden We also take this opportunity to explicitly limit supported devices to Endpoints, Legacy Endpoints, and Root Complex Integrated Endpoints. We don't currently have support for other types and users often cause themselves problems by assigning them. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 130 + 1 file changed, 129 insertions(+), 1 deletion(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 53b23f3..7d6468b 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -130,6 +130,7 @@ typedef struct VFIODevice { PCIHostDeviceAddress host; QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; +uint8_t bustype; bool reset_works; } VFIODevice; @@ -1506,6 +1507,123 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos) return next - pos; } +static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask) +{ +pci_set_word(buf, (pci_get_word(buf) ~mask) | val); +} + +static void vfio_add_emulated_word(VFIODevice *vdev, int pos, + uint16_t val, uint16_t mask) +{ +vfio_set_word_bits(vdev-pdev.config + pos, val, mask); +vfio_set_word_bits(vdev-pdev.wmask + pos, ~mask, mask); +vfio_set_word_bits(vdev-emulated_config_bits + pos, mask, mask); +} + +static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask) +{ +pci_set_long(buf, (pci_get_long(buf) ~mask) | val); +} + +static void vfio_add_emulated_long(VFIODevice *vdev, int pos, + uint32_t val, uint32_t mask) +{ +vfio_set_long_bits(vdev-pdev.config + pos, val, mask); +vfio_set_long_bits(vdev-pdev.wmask + pos, ~mask, mask); +vfio_set_long_bits(vdev-emulated_config_bits + pos, mask, mask); +} + +static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) +{ +uint16_t flags; +uint8_t type; +enum { +VFIO_BUS_TYPE_PCI, +VFIO_BUS_TYPE_PCIE, +VFIO_BUS_TYPE_PCIE_RC, +}; + +flags = pci_get_word(vdev-pdev.config + pos + PCI_CAP_FLAGS); +type = (flags PCI_EXP_FLAGS_TYPE) 4; + +switch (type) { +case PCI_EXP_TYPE_ENDPOINT: +case PCI_EXP_TYPE_LEG_END: +case PCI_EXP_TYPE_RC_END: +break; +default: +error_report(vfio: Assignment of PCIe type 0x%x devices is not + currently supported\n, type); +return -EINVAL; +} + +if (vdev-bustype VFIO_BUS_TYPE_PCIE_RC) { +error_report(vfio: Unknown x-bustype %d. Accepted values:\n + \t%d: Legacy PCI [default]\n + \t%d: PCI Express\n + \t%d: PCI Express Root-Complex\n, vdev-bustype, + VFIO_BUS_TYPE_PCI, VFIO_BUS_TYPE_PCIE, + VFIO_BUS_TYPE_PCIE_RC); +return -EINVAL; +} + +switch (vdev-bustype) { +case VFIO_BUS_TYPE_PCI: +/* + * Use express capability as-is on PCI bus. It doesn't make much + * sense to even expose, but some drivers (ex. tg3) depend on it + * and guests don't seem to be particular about it. We'll need + * to revist this if we ever expose an IOMMU to the guest. + */ +return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size); + +case VFIO_BUS_TYPE_PCIE: +if (type == PCI_EXP_TYPE_RC_END) { +/* Type becomes non-Integrated Endpoint */ +vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS, + PCI_EXP_TYPE_ENDPOINT 4, + PCI_EXP_FLAGS_TYPE); +/* XXX Implement LNKCAP fields? */ +} + +return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size); + +case VFIO_BUS_TYPE_PCIE_RC: +switch (type) { +case PCI_EXP_TYPE_ENDPOINT: +/* Type becomes Integrated Endpoint */ +vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS, + PCI_EXP_TYPE_RC_END 4, + PCI_EXP_FLAGS_TYPE); + +/* Link Capabilities, Status, and
[PATCH v2 0/2] qemu vfio-pci: PCIe and generic config space mangling
v2: Removing trailing \n from error_report calls This series generalizes the places where we already rely on QEMU config space emulation to make it easier to add various other fields. It turns out that when we start running on Q35 Windows cares quite a bit about the PCIe type we expose. Drivers will fail to load if not set to something valid and appropriate for the bus. We don't have anything in place yet to tell what the bus is, so for now we rely on a temporary option that will eventually get replaced by something automatic. Thanks, Alex --- Alex Williamson (2): vfio-pci: Generalize PCI config mangling vfio-pci: Add PCIe capability mangling based on bus type hw/vfio_pci.c | 209 ++--- 1 file changed, 170 insertions(+), 39 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] vfio-pci: Generalize PCI config mangling
Kernel-side vfio virtualizes all of config space, but some parts are unique to Qemu. For instance we may or may not expose the ROM BAR, Qemu manages MSI/MSIX, and Qemu manages the multi-function bit so that single function devices can appear as multi-function and vica versa. Generalize this into a bitmap of Qemu emulated bits. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 80 ++--- 1 file changed, 42 insertions(+), 38 deletions(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index ad9ae36..2106b87 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -117,6 +117,7 @@ typedef struct VFIODevice { int fd; VFIOINTx intx; unsigned int config_size; +uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ off_t config_offset; /* Offset of config space region within device fd */ unsigned int rom_size; off_t rom_offset; /* Offset of ROM region within device fd */ @@ -963,44 +964,29 @@ static const MemoryRegionOps vfio_bar_ops = { static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) { VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); -uint32_t val = 0; +uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val; -/* - * We only need QEMU PCI config support for the ROM BAR, the MSI and MSIX - * capabilities, and the multifunction bit below. We let VFIO handle - * virtualizing everything else. Performance is not a concern here. - */ -if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) || -(pdev-cap_present QEMU_PCI_CAP_MSIX - ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) || -(pdev-cap_present QEMU_PCI_CAP_MSI - ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size))) { +memcpy(emu_bits, vdev-emulated_config_bits + addr, len); +emu_bits = le32_to_cpu(emu_bits); -val = pci_default_read_config(pdev, addr, len); -} else { -if (pread(vdev-fd, val, len, vdev-config_offset + addr) != len) { +if (emu_bits) { +emu_val = pci_default_read_config(pdev, addr, len); +} + +if (~emu_bits (0xU (32 - len * 8))) { +ssize_t ret; + +ret = pread(vdev-fd, phys_val, len, vdev-config_offset + addr); +if (ret != len) { error_report(%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m, __func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, vdev-host.function, addr, len); return -errno; } -val = le32_to_cpu(val); +phys_val = le32_to_cpu(phys_val); } -/* Multifunction bit is virualized in QEMU */ -if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) { -uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION; - -if (len == 4) { -mask = 16; -} - -if (pdev-cap_present QEMU_PCI_CAP_MULTIFUNCTION) { -val |= mask; -} else { -val = ~mask; -} -} +val = (emu_val emu_bits) | (phys_val ~emu_bits); DPRINTF(%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n, __func__, vdev-host.domain, vdev-host.bus, vdev-host.slot, @@ -1026,12 +1012,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, vdev-host.slot, vdev-host.function, addr, val, len); } -/* Write standard header bits to emulation */ -if (addr PCI_CONFIG_HEADER_SIZE) { -pci_default_write_config(pdev, addr, val, len); -return; -} - /* MSI/MSI-X Enabling/Disabling */ if (pdev-cap_present QEMU_PCI_CAP_MSI ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size)) { @@ -1046,9 +1026,7 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled !is_enabled) { vfio_disable_msi(vdev); } -} - -if (pdev-cap_present QEMU_PCI_CAP_MSIX +} else if (pdev-cap_present QEMU_PCI_CAP_MSIX ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) { int is_enabled, was_enabled = msix_enabled(pdev); @@ -1061,6 +1039,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } else if (was_enabled !is_enabled) { vfio_disable_msix(vdev); } +} else { +/* Write everything to QEMU to keep emulated bits correct */ +pci_default_write_config(pdev, addr, val, len); } } @@ -2003,6 +1984,16 @@ static int vfio_initfn(PCIDevice *pdev) goto out_put; } +/* vfio emulates a lot for us, but some bits need extra love */ +vdev-emulated_config_bits = g_malloc0(vdev-config_size); + +/* QEMU can choose to expose the ROM or not */ +memset(vdev-emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4); + +/* QEMU can change multi-function devices to single function, or reverse */ +
[PATCH v2 2/2] vfio-pci: Add PCIe capability mangling based on bus type
Windows seems to pay particular interest to the PCIe header type of devices and will fail to load drivers if we attached Endpoint devices or Legacy Endpoint devices to the Root Complex. We don't yet have a good way to determine the bus type, so for now we add an experimental x-bustype option which will later be replaced by some mechanism to determine this automatcally. The new option is defined as: x-bustype=n where n is one of: 0: Legacy PCI [default] 1: PCI Express 2: PCI Express Root Complex Conversion of PCIe types is does as follows: * Legacy PCI * No change, capability is unmodified for compatibility. * PCI Express * Integrated Root Complex Endpoint - Endpoint * PCI Express Root Complext * Endpoint - Integrated Root Complex Endpoint * Legacy Endpoint - none, capability hidden We also take this opportunity to explicitly limit supported devices to Endpoints, Legacy Endpoints, and Root Complex Integrated Endpoints. We don't currently have support for other types and users often cause themselves problems by assigning them. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 129 + 1 file changed, 128 insertions(+), 1 deletion(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 2106b87..330a687 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -130,6 +130,7 @@ typedef struct VFIODevice { PCIHostDeviceAddress host; QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; +uint8_t bustype; bool reset_works; } VFIODevice; @@ -1506,6 +1507,122 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, uint8_t pos) return next - pos; } +static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask) +{ +pci_set_word(buf, (pci_get_word(buf) ~mask) | val); +} + +static void vfio_add_emulated_word(VFIODevice *vdev, int pos, + uint16_t val, uint16_t mask) +{ +vfio_set_word_bits(vdev-pdev.config + pos, val, mask); +vfio_set_word_bits(vdev-pdev.wmask + pos, ~mask, mask); +vfio_set_word_bits(vdev-emulated_config_bits + pos, mask, mask); +} + +static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask) +{ +pci_set_long(buf, (pci_get_long(buf) ~mask) | val); +} + +static void vfio_add_emulated_long(VFIODevice *vdev, int pos, + uint32_t val, uint32_t mask) +{ +vfio_set_long_bits(vdev-pdev.config + pos, val, mask); +vfio_set_long_bits(vdev-pdev.wmask + pos, ~mask, mask); +vfio_set_long_bits(vdev-emulated_config_bits + pos, mask, mask); +} + +static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) +{ +uint16_t flags; +uint8_t type; +enum { +VFIO_BUS_TYPE_PCI, +VFIO_BUS_TYPE_PCIE, +VFIO_BUS_TYPE_PCIE_RC, +}; + +flags = pci_get_word(vdev-pdev.config + pos + PCI_CAP_FLAGS); +type = (flags PCI_EXP_FLAGS_TYPE) 4; + +switch (type) { +case PCI_EXP_TYPE_ENDPOINT: +case PCI_EXP_TYPE_LEG_END: +case PCI_EXP_TYPE_RC_END: +break; +default: +error_report(vfio: Assignment of PCIe type 0x%x + devices is not currently supported, type); +return -EINVAL; +} + +if (vdev-bustype VFIO_BUS_TYPE_PCIE_RC) { +error_report(vfio: Unknown bustype %d. Accepted values: + %d (Legacy PCI [default]), %d (PCI Express), + %d (PCI Express Root-Complex), vdev-bustype, + VFIO_BUS_TYPE_PCI, VFIO_BUS_TYPE_PCIE, + VFIO_BUS_TYPE_PCIE_RC); +return -EINVAL; +} + +switch (vdev-bustype) { +case VFIO_BUS_TYPE_PCI: +/* + * Use express capability as-is on PCI bus. It doesn't make much + * sense to even expose, but some drivers (ex. tg3) depend on it + * and guests don't seem to be particular about it. We'll need + * to revist this if we ever expose an IOMMU to the guest. + */ +return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size); + +case VFIO_BUS_TYPE_PCIE: +if (type == PCI_EXP_TYPE_RC_END) { +/* Type becomes non-Integrated Endpoint */ +vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS, + PCI_EXP_TYPE_ENDPOINT 4, + PCI_EXP_FLAGS_TYPE); +/* XXX Implement LNKCAP fields? */ +} + +return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size); + +case VFIO_BUS_TYPE_PCIE_RC: +switch (type) { +case PCI_EXP_TYPE_ENDPOINT: +/* Type becomes Integrated Endpoint */ +vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS, + PCI_EXP_TYPE_RC_END 4, + PCI_EXP_FLAGS_TYPE); + +/* Link Capabilities, Status, and Control goes away */ +if
[PATCH 0/2] qemu vfio-pci: VGA passthrough support
I've pushed kernel side vfio vga support to my next branch and intend to try to get those in for kernel v3.9. Those changes can be found here: git://github.com/awilliam/linux-vfio.git (next) This is the matching QEMU updates. Patch 1/2 adds basic VGA range support. This can be used with the option -vga none to disable emulated VGA and also requires vfio-pci device option x-vga=on. As noted by the x- prefix, this support is entirely experimental. Expectations should range anywhere from working perfectly to crashing the host. Patch 2/2 contains numerous quirks for devices that I have onhand to test, but hopefully covers a faily broad spectrum (at least for ATI/AMD and Nvidia). These quirks trap various device specific addresses and insert device virtual addresses in place of device physical addresses. If we could identity map devices we could avoid these, but that also puts numerous constraints on our configuration. The vfio kernel VGA support makes use of the VGA arbiter to switch between host VGA devices and I have been successful in running multiple devices simultaneously. At this point only secondary devices have been tested. The primary graphics likely needs extra work to completely turn off host kernel access. Intel integrated graphics have not been tested and are most certainly broken as they require access to devices other than the VGA device alone. Nvidia devices will work with nouveau in Linux guests, but do not work with Nvidia drivers in Windows (probably Linux too). I think we need to put these behind emulated PCIe root ports before they'll work, but Q35 doesn't work well enough for that yet. ATI/AMD cards seem to have the best hope of working with their Catalyst drivers. For best results, use Q35, -vga none, x-vga=on, and x-bustype=2, for example: -M q35 -vga none -device vfio-pci,host=x:xx.x,x-vga=on,x-bustype=2 All the qemu patches necessary for this can be found in this branch: git://github.com/awilliam/qemu-vfio.git (vfio-vga-v3) This series will depend on a kernel header update which I'll send patches for once the kernel patches are upstream. Thanks, Alex --- Alex Williamson (2): qemu vfio-pci: Add support for VGA MMIO and I/O port access qemu vfio-pci: Graphics device quirks hw/vfio_pci.c | 651 + 1 file changed, 649 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] qemu vfio-pci: Add support for VGA MMIO and I/O port access
Most VGA cards need some kind of quirk to fully operate since they hide backdoors to get to other registers outside of PCI config space within the registers, but this provides the base infrastructure. If we could identity map PCI resources for assigned devices we would need a lot fewer quirks. To enable this, use a kernel side vfio-pci driver that incorporates VGA support (patches posted), and use the -vga none option and add the x-vga=on option for the vfio-pci device. The x- denotes this as an experimental feature. You may also need to use a cached copy of the VGA BIOS for your device, passing it to vfio-pci using the romfile= option. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 173 + 1 file changed, 173 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 330a687..307cbc1 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -59,6 +59,18 @@ typedef struct VFIOBAR { uint8_t nr; /* cache the BAR number for debug */ } VFIOBAR; +typedef struct VFIOVGARegion { +MemoryRegion mem; +off_t offset; +int nr; +} VFIOVGARegion; + +typedef struct VFIOVGA { +off_t fd_offset; +int fd; +VFIOVGARegion region[3]; +} VFIOVGA; + typedef struct VFIOINTx { bool pending; /* interrupt pending */ bool kvm_accel; /* set when QEMU bypass through KVM enabled */ @@ -127,11 +139,16 @@ typedef struct VFIODevice { int nr_vectors; /* Number of MSI/MSIX vectors currently in use */ int interrupt; /* Current interrupt type */ VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */ +VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */ PCIHostDeviceAddress host; QLIST_ENTRY(VFIODevice) next; struct VFIOGroup *group; +uint32_t features; +#define VFIO_FEATURE_ENABLE_VGA_BIT 0 +#define VFIO_FEATURE_ENABLE_VGA (1 VFIO_FEATURE_ENABLE_VGA_BIT) uint8_t bustype; bool reset_works; +bool has_vga; } VFIODevice; typedef struct VFIOGroup { @@ -959,6 +976,89 @@ static const MemoryRegionOps vfio_bar_ops = { .endianness = DEVICE_LITTLE_ENDIAN, }; +static void vfio_vga_write(void *opaque, hwaddr addr, + uint64_t data, unsigned size) +{ +VFIOVGARegion *region = opaque; +VFIOVGA *vga = container_of(region, VFIOVGA, region[region-nr]); +union { +uint8_t byte; +uint16_t word; +uint32_t dword; +uint64_t qword; +} buf; +off_t offset = vga-fd_offset + region-offset + addr; + +switch (size) { +case 1: +buf.byte = data; +break; +case 2: +buf.word = cpu_to_le16(data); +break; +case 4: +buf.dword = cpu_to_le32(data); +break; +default: +hw_error(vfio: unsupported write size, %d bytes\n, size); +break; +} + +if (pwrite(vga-fd, buf, size, offset) != size) { +error_report(%s(,0x%HWADDR_PRIx, 0x%PRIx64, %d) failed: %m, + __func__, region-offset + addr, data, size); +} + +DPRINTF(%s(0x%HWADDR_PRIx, 0x%PRIx64, %d)\n, +__func__, region-offset + addr, data, size); +} + +static uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size) +{ +VFIOVGARegion *region = opaque; +VFIOVGA *vga = container_of(region, VFIOVGA, region[region-nr]); +union { +uint8_t byte; +uint16_t word; +uint32_t dword; +uint64_t qword; +} buf; +uint64_t data = 0; +off_t offset = vga-fd_offset + region-offset + addr; + +if (pread(vga-fd, buf, size, offset) != size) { +error_report(%s(,0x%HWADDR_PRIx, %d) failed: %m, + __func__, region-offset + addr, size); +return (uint64_t)-1; +} + +switch (size) { +case 1: +data = buf.byte; +break; +case 2: +data = le16_to_cpu(buf.word); +break; +case 4: +data = le32_to_cpu(buf.dword); +break; +default: +hw_error(vfio: unsupported read size, %d bytes\n, size); +break; +} + +DPRINTF(%s(0x%HWADDR_PRIx, %d) = 0x%PRIx64\n, +__func__, region-offset + addr, size, data); + +return data; +} + +static const MemoryRegionOps vfio_vga_ops = { +.read = vfio_vga_read, +.write = vfio_vga_write, +.endianness = DEVICE_LITTLE_ENDIAN, +}; + /* * PCI config space */ @@ -1479,6 +1579,27 @@ static void vfio_map_bars(VFIODevice *vdev) for (i = 0; i PCI_ROM_SLOT; i++) { vfio_map_bar(vdev, i); } + +if (vdev-has_vga) { +memory_region_init_io(vdev-vga.region[0].mem, vfio_vga_ops, + vdev-vga.region[0], vfio-vga-mmio@0xa, + 0xb - 0xa + 1); +memory_region_add_subregion_overlap(pci_address_space(vdev-pdev), +0xa, vdev-vga.region[0].mem, +1); + +
[PATCH 2/2] qemu vfio-pci: Graphics device quirks
Apparently graphics vendors need to come up with new ways to retrieve PCI BAR addresses on every revision of their chip. These are the ones that I've found on the following assortment of cards: Advanced Micro Devices [AMD] nee ATI Cedar PRO [Radeon HD 5450/6350] Advanced Micro Devices [AMD] nee ATI RV370 [Radeon X550] NVIDIA Corporation G98 [GeForce 8400 GS] NVIDIA Corporation G86 [Quadro NVS 290] NVIDIA Corporation G72 [GeForce 7300 LE] With these quirks, most ATI/AMD and Nvidia cards will post and work with standard VGA drivers. ATI/AMD devices may also work with the Catalyst driver (try x-bustype=2). I suspect Nvidia accelerated drivers will require working root ports, they currently report error code 43 under Windows. It's relatively easy to figure out many of these quirks. First enable DEBUG_UNASSIGNED in exec.c, then enable DEBUG_VFIO in hw/vfio_pci.c. Log the output and can kill Qemu when Unassigned access errors start to spew (ignore the ones at very low offsets). If the unassigned access matches a range covered by the device (consult lspci or /proc/iomem), then look back in the vfio-pci debug output for read from the device that returned an address or partial address matching the unassigned access. Then follow these examples for creating quirks to trap that access and return the emulated BAR address. None of these would be necessary if we could identity map devices. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 478 + 1 file changed, 476 insertions(+), 2 deletions(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 307cbc1..769ab05 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -48,6 +48,16 @@ do { } while (0) #endif +struct VFIODevice; + +typedef struct VFIOQuirk { +MemoryRegion mem; +struct VFIODevice *vdev; +QLIST_ENTRY(VFIOQuirk) next; +uint32_t data; +uint32_t data2; +} VFIOQuirk; + typedef struct VFIOBAR { off_t fd_offset; /* offset of BAR within device fd */ int fd; /* device fd, allows us to pass VFIOBAR as opaque data */ @@ -57,12 +67,14 @@ typedef struct VFIOBAR { size_t size; uint32_t flags; /* VFIO region flags (rd/wr/mmap) */ uint8_t nr; /* cache the BAR number for debug */ +QLIST_HEAD(, VFIOQuirk) quirks; } VFIOBAR; typedef struct VFIOVGARegion { MemoryRegion mem; off_t offset; int nr; +QLIST_HEAD(, VFIOQuirk) quirks; } VFIOVGARegion; typedef struct VFIOVGA { @@ -82,8 +94,6 @@ typedef struct VFIOINTx { QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */ } VFIOINTx; -struct VFIODevice; - typedef struct VFIOMSIVector { EventNotifier interrupt; /* eventfd triggered on interrupt */ struct VFIODevice *vdev; /* back pointer to device */ @@ -1060,6 +1070,460 @@ static const MemoryRegionOps vfio_vga_ops = { }; /* + * Device specific quirks + */ + +/* + * Device 1002:68f9 (Advanced Micro Devices [AMD] nee ATI Cedar PRO [Radeon + * HD 5450/6350]) reports the upper byte of the physical address of the + * I/O port BAR4 through VGA register 0x3c3. The BAR is 256 bytes, so the + * lower byte is known to be zero. Test for this quirk on all ATI/AMD + * devices. + */ +static uint64_t vfio_ati_3c3_quirk_read(void *opaque, +hwaddr addr, unsigned size) +{ +VFIOQuirk *quirk = opaque; +VFIODevice *vdev = quirk-vdev; +PCIDevice *pdev = vdev-pdev; +uint64_t data = vfio_vga_read(vdev-vga.region[2], addr + 0x3, size); + +if (data == quirk-data) { +data = pci_get_byte(pdev-config + PCI_BASE_ADDRESS_4 + 1); +DPRINTF(%s(0x3c3, 1) = 0x%PRIx64\n, __func__, data); +} + +return data; +} + +static const MemoryRegionOps vfio_ati_3c3_quirk = { +.read = vfio_ati_3c3_quirk_read, +.endianness = DEVICE_LITTLE_ENDIAN, +}; + +/* + * Some devices seem to read 0xff from 0x3c3 during init so probing + * is unreliable. Save the physical address and return the virtual + * address any time the read matches the physical address. + */ +static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev) +{ +PCIDevice *pdev = vdev-pdev; +off_t physoffset = vdev-config_offset + PCI_BASE_ADDRESS_4; +uint32_t physbar; +VFIOQuirk *quirk; + +if (pci_get_word(pdev-config + PCI_VENDOR_ID) != 0x1002 || +vdev-bars[4].size 256) { +return; +} + +/* Get I/O port BAR physical address */ +if (pread(vdev-fd, physbar, 4, physoffset) != 4) { +error_report(vfio: probe failed for ATI/AMD 0x3c3 quirk on device + %04x:%02x:%02x.%x, vdev-host.domain, + vdev-host.bus, vdev-host.slot, vdev-host.function); +return; +} + +quirk = g_malloc0(sizeof(*quirk)); +quirk-vdev = vdev; +quirk-data = (physbar 8) 0xff; + +memory_region_init_io(quirk-mem, vfio_ati_3c3_quirk, quirk, + vfio-ati-3c3-quirk, 1); +
[PATCH] qemu vfio-pci: Add extra debugging
Often when debugging it's useful to be able to disable bypass paths so no interactions with the device are missed. Add some extra debug options to do this. Also add device info on read/write BAR accesses, which is useful when debugging more than one assigned device. A couple DPRINTFs also had redundant vfio: prefixes. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 40 ++-- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 769ab05..7aa3d88 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -48,6 +48,10 @@ do { } while (0) #endif +/* Extra debugging, trap acceleration paths for more logging */ +#define VFIO_ALLOW_MMAP 1 +#define VFIO_ALLOW_KVM_INTX 1 + struct VFIODevice; typedef struct VFIOQuirk { @@ -304,7 +308,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev) int ret, argsz; int32_t *pfd; -if (!kvm_irqfds_enabled() || +if (!VFIO_ALLOW_KVM_INTX || !kvm_irqfds_enabled() || vdev-intx.route.mode != PCI_INTX_ENABLED || !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) { return; @@ -924,8 +928,16 @@ static void vfio_bar_write(void *opaque, hwaddr addr, __func__, addr, data, size); } -DPRINTF(%s(BAR%d+0x%HWADDR_PRIx, 0x%PRIx64, %d)\n, -__func__, bar-nr, addr, data, size); +#ifdef DEBUG_VFIO +{ +VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar-nr]); + +DPRINTF(%s(%04x:%02x:%02x.%x:BAR%d+0x%HWADDR_PRIx, 0x%PRIx64 +, %d)\n, __func__, vdev-host.domain, vdev-host.bus, +vdev-host.slot, vdev-host.function, bar-nr, addr, +data, size); +} +#endif /* * A read or write to a BAR always signals an INTx EOI. This will @@ -971,8 +983,16 @@ static uint64_t vfio_bar_read(void *opaque, break; } -DPRINTF(%s(BAR%d+0x%HWADDR_PRIx, %d) = 0x%PRIx64\n, -__func__, bar-nr, addr, size, data); +#ifdef DEBUG_VFIO +{ +VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar-nr]); + +DPRINTF(%s(%04x:%02x:%02x.%x:BAR%d+0x%HWADDR_PRIx +, %d) = 0x%PRIx64\n, __func__, vdev-host.domain, +vdev-host.bus, vdev-host.slot, vdev-host.function, +bar-nr, addr, size, data); +} +#endif /* Same as write above */ vfio_eoi(container_of(bar, VFIODevice, bars[bar-nr])); @@ -1676,7 +1696,7 @@ static void vfio_listener_region_add(MemoryListener *listener, int ret; if (vfio_listener_skipped_section(section)) { -DPRINTF(vfio: SKIPPING region_add %HWADDR_PRIx - %PRIx64\n, +DPRINTF(SKIPPING region_add %HWADDR_PRIx - %PRIx64\n, section-offset_within_address_space, section-offset_within_address_space + section-size - 1); return; @@ -1700,7 +1720,7 @@ static void vfio_listener_region_add(MemoryListener *listener, section-offset_within_region + (iova - section-offset_within_address_space); -DPRINTF(vfio: region_add %HWADDR_PRIx - %HWADDR_PRIx [%p]\n, +DPRINTF(region_add %HWADDR_PRIx - %HWADDR_PRIx [%p]\n, iova, end - 1, vaddr); ret = vfio_dma_map(container, iova, end - iova, vaddr, section-readonly); @@ -1720,7 +1740,7 @@ static void vfio_listener_region_del(MemoryListener *listener, int ret; if (vfio_listener_skipped_section(section)) { -DPRINTF(vfio: SKIPPING region_del %HWADDR_PRIx - %PRIx64\n, +DPRINTF(SKIPPING region_del %HWADDR_PRIx - %PRIx64\n, section-offset_within_address_space, section-offset_within_address_space + section-size - 1); return; @@ -1740,7 +1760,7 @@ static void vfio_listener_region_del(MemoryListener *listener, return; } -DPRINTF(vfio: region_del %HWADDR_PRIx - %HWADDR_PRIx\n, +DPRINTF(region_del %HWADDR_PRIx - %HWADDR_PRIx\n, iova, end - 1); ret = vfio_dma_unmap(container, iova, end - iova); @@ -1943,7 +1963,7 @@ static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem, { int ret = 0; -if (size bar-flags VFIO_REGION_INFO_FLAG_MMAP) { +if (VFIO_ALLOW_MMAP size bar-flags VFIO_REGION_INFO_FLAG_MMAP) { int prot = 0; if (bar-flags VFIO_REGION_INFO_FLAG_READ) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] qemu vfio-pci: Move devices to D0 on reset
Guests may leave devices in a low power state at reboot, but we expect devices to be woken up for the next boot. Make this happen. Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/vfio_pci.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c index 7aa3d88..0a45500 100644 --- a/hw/vfio_pci.c +++ b/hw/vfio_pci.c @@ -161,6 +161,7 @@ typedef struct VFIODevice { #define VFIO_FEATURE_ENABLE_VGA_BIT 0 #define VFIO_FEATURE_ENABLE_VGA (1 VFIO_FEATURE_ENABLE_VGA_BIT) uint8_t bustype; +uint8_t pm_cap; bool reset_works; bool has_vga; } VFIODevice; @@ -2297,6 +2298,8 @@ static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos) case PCI_CAP_ID_MSIX: ret = vfio_setup_msix(vdev, pos); break; +case PCI_CAP_ID_PM: +vdev-pm_cap = pos; default: ret = pci_add_capability(pdev, cap_id, pos, size); break; @@ -2869,6 +2872,26 @@ static void vfio_pci_reset(DeviceState *dev) vfio_disable_interrupts(vdev); +/* Make sure the device is in D0 */ +if (vdev-pm_cap) { +uint16_t pmcsr; +uint8_t state; + +pmcsr = vfio_pci_read_config(pdev, vdev-pm_cap + PCI_PM_CTRL, 2); +state = pmcsr PCI_PM_CTRL_STATE_MASK; +if (state) { +pmcsr = ~PCI_PM_CTRL_STATE_MASK; +vfio_pci_write_config(pdev, vdev-pm_cap + PCI_PM_CTRL, pmcsr, 2); +/* vfio handles the necessary delay here */ +pmcsr = vfio_pci_read_config(pdev, vdev-pm_cap + PCI_PM_CTRL, 2); +state = pmcsr PCI_PM_CTRL_STATE_MASK; +if (state) { +error_report(vfio: Unable to power on device, stuck in D%d\n, + state); +} +} +} + /* * Stop any ongoing DMA by disconecting I/O, MMIO, and bus master. * Also put INTx Disable in known state. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. OK. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. They do need to appear in the interrupt source space if we want to inject or irqfd them. Most users won't want to do that, but we have had a customer directly assign IPIs (to communicate with an OS running on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't using them) and MPIC timers to a guest. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. I view anything other than passing the actual MPIC register values around as overdesign here, given that it is communication between hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the kernel side. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set mask-by-priority register) and the priority of the highest-prio in-service interrupt -- which would current processor priorty be? Etc. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) If you mean the way to inject interrupts, it's simpler this way. Why go out of our way to inject common glue code into a communication path between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM? Or rather, why make that common glue be specific to this one function when we could reuse the same communication glue used for other things, such as device state? And that's just for regular interrupts. MSIs are vastly simpler on MPIC than what x86 does. x86 obviously support old way and will have to for some, very long, time. Sure. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. I wasn't aware that that's how it worked. :-P I was trying to be considerate by not making the entire thing gratuitously PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and APIC). We already had a discussion on ARM's set address ioctl and rather than extend *that* interface, they preferred to just stick something ARM-specific in ASAP with the understanding that it would be replaced (or more accurately, kept around as a thin wrapper around the new stuff) later. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Scott, what other devices are you planning to support with this interface? At the moment I do not have plans for other devices, though what does it hurt for the capability to be there? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
On 02/18/2013 06:04:51 AM, Gleb Natapov wrote: Can you tell us why mpic should be in kernel? Is it used often by modern guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs are delivered through mpic too)? Yes, MSIs are delivered through the mpic. Plus, MSIs are only (normally) for PCI(e). We have embedded system on a chips with important non-PCI devices, which cannot use MSIs. On x86 we actually would've preferred to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for historical reasons creation of PIC/IOAPIC/LAPIC are bundled together) hence my question. We don't have that same split on this hardware. MPIC is one device that covers all of it. Some of the functionality is per-CPU, but it is not easily extracted from the rest. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood scottw...@freescale.com wrote: On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) If you mean the way to inject interrupts, it's simpler this way. Why go out of our way to inject common glue code into a communication path between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM? Or rather, why make that common glue be specific to this one function when we could reuse the same communication glue used for other things, such as device state? And that's just for regular interrupts. MSIs are vastly simpler on MPIC than what x86 does. x86 obviously support old way and will have to for some, very long, time. Sure. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. I wasn't aware that that's how it worked. :-P I was trying to be considerate by not making the entire thing gratuitously PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and APIC). We already had a discussion on ARM's set address ioctl and rather than extend *that* interface, they preferred to just stick something ARM-specific in ASAP with the understanding that it would be replaced (or more accurately, kept around as a thin wrapper around the new stuff) later. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Well it won't improve functionality much from the current hardware point of view, but the proposed interface is superior to what we have now. Adding and coordinating new interfaces is indeed a pain, so a generic interface which is flexible enough to cater for a certain group of needs, is welcome imho. And this does seem to fit the bill. I can imagine that if there's support for a future ARM gic version or if we add in-kernel support for other stuff on ARM, then this interface will be useful, and in fact using the current interface to support two separate, but similar, interrupt controllers could get messy from a user space point of view. I am definitely willing to change to use this interface, the agreement on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this. I had some nits on the RFC, which I'll send separately. Scott, what other devices are you planning to support with this interface? At the moment I do not have plans for other devices, though what does it hurt for the capability to be there? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt| 76 ++ Documentation/virtual/kvm/devices/README |1 + include/linux/kvm_host.h | 21 + include/uapi/linux/kvm.h | 25 ++ virt/kvm/kvm_main.c | 127 ++ 5 files changed, 250 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index c2534c3..5bcdb42 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created + + Other error conditions may be defined by individual device types. + +Creates an emulated device in the kernel. The returned handle +can be used with KVM_SET/GET_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 id; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device id is invalid + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific except for certain global attributes. See +individual device documentation in the devices directory. As with +ONE_REG, the size of the data transferred is defined by the particular +attribute. + +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific: + KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE + +struct kvm_device_attr { + __u32 dev;/* id from KVM_CREATE_DEVICE */ + __u32 group; /* KVM_DEV_ATTR_COMMON or device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device id is invalid + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 5. The kvm_run structure diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index
Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote: Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. 1) Can you please describe, in english, the mechanics of subtracting cpu hardlimit values from steal time reported via run_delay supposed to work? The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. There is no expected steal time over a fixed period of real time. 2) From the description of patch 1: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This is outdated, right? Because overcommitted environment is exactly what steal time should report. Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d
On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote: On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote: On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote: On 02/12/2013 04:26 PM, Peter Hurley wrote: With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA device/console): [0.666410] udevd[97]: starting version 175 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d and boots to an initramfs prompt. git bisect (log attached) blames: commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f Merge: 3596f5b 949db15 Author: H. Peter Anvin h...@linux.intel.com Date: Fri Jan 25 16:31:21 2013 -0800 Merge tag 'v3.8-rc5' into x86/mm The __pa() fixup series that follows touches KVM code that is not present in the existing branch based on v3.7-rc5, so merge in the current upstream from Linus. Signed-off-by: H. Peter Anvin h...@linux.intel.com This only happens with the VGA device/console but that is the default configuration for Ubuntu/KVM because it blacklists pretty much every fb driver. I am guessing this is another bad use of __pa()... need to look into that. Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible there? He is using 64bit guest and on those __pa() happens to be working. Is it possible that slow_virt_to_phys() does not work as expected? Peter (the bug reporter :)) can you run your guest kernel with loglevel=7 and attach send me console output? Attached. BTW, this message happens on 'good' boots too: [0.00] [ cut here ] [0.00] WARNING: at /home/peter/src/kernels/next/arch/x86/kernel/pvclock.c:182 pvclock_init_vsyscall+0x22/0x60() [0.00] Hardware name: Bochs [0.00] Modules linked in: [0.00] Pid: 0, comm: swapper Not tainted 3.8.0-next-20130204-xeon #20130204 [0.00] Call Trace: [0.00] [8105812f] warn_slowpath_common+0x7f/0xc0 [0.00] [8105818a] warn_slowpath_null+0x1a/0x20 [0.00] [81d20521] pvclock_init_vsyscall+0x22/0x60 [0.00] [81d20480] kvm_setup_vsyscall_timeinfo+0x74/0xd8 [0.00] [81d201d1] kvm_guest_init+0xd0/0xe9 [0.00] [81d13f7c] setup_arch+0xbee/0xcaf [0.00] [816cbceb] ? printk+0x61/0x63 [0.00] [81d0cbc3] start_kernel+0xd3/0x3f0 [0.00] [81d0c5e4] x86_64_start_reservations+0x2a/0x2c [0.00] [81d0c6d7] x86_64_start_kernel+0xf1/0x100 [0.00] ---[ end trace b47bb564b2d6ec76 ]--- Regards, Peter Hurley Sending a patch for this, thanks for the report. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL 00/14] ppc patch queue 2013-02-15
Pulled, thanks. On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote: Hi Marcelo / Gleb, This is my current patch queue for ppc. Please pull. Highlights of this queue drop are: - BookE: Fast mapping support for 4k backed memory - BookE: Handle alignment interrupts Alex The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e: Jan Kiszka (1): KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr are available in the git repository at: git://github.com/agraf/linux-2.6.git kvm-ppc-next Alexander Graf (11): KVM: PPC: E500: Move write_stlbe higher KVM: PPC: E500: Explicitly mark shadow maps invalid KVM: PPC: E500: Propagate errors when shadow mapping KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping KVM: PPC: E500: Split host and guest MMU parts KVM: PPC: e500: Implement TLB1-in-TLB0 mapping KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code Merge commit 'origin/next' into kvm-ppc-next KVM: PPC: BookE: Handle alignment interrupts Merge commit 'origin/next' into kvm-ppc-next Bharat Bhushan (3): KVM: PPC: booke: use vcpu reference from thread_struct KVM: PPC: booke: Allow multiple exception types booke: Added DBCR4 SPR number arch/powerpc/include/asm/kvm_ppc.h |2 - arch/powerpc/include/asm/reg.h |2 - arch/powerpc/include/asm/reg_booke.h |1 + arch/powerpc/kernel/asm-offsets.c|2 +- arch/powerpc/kvm/Makefile|9 +- arch/powerpc/kvm/booke.c | 30 +- arch/powerpc/kvm/booke.h |1 + arch/powerpc/kvm/booke_interrupts.S | 49 +- arch/powerpc/kvm/e500.c | 16 +- arch/powerpc/kvm/e500.h |1 + arch/powerpc/kvm/e500_mmu.c | 809 +++ arch/powerpc/kvm/e500_mmu_host.c | 699 + arch/powerpc/kvm/e500_mmu_host.h | 18 + arch/powerpc/kvm/e500_tlb.c | 1430 -- 14 files changed, 1610 insertions(+), 1459 deletions(-) create mode 100644 arch/powerpc/kvm/e500_mmu.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.h delete mode 100644 arch/powerpc/kvm/e500_tlb.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On 02/18/2013 06:44:20 PM, Christoffer Dall wrote: On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: index 0350e0d..dbaf012 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -335,6 +335,25 @@ struct kvm_memslots { short id_to_index[KVM_MEM_SLOTS_NUM]; }; +/* + * The worst case number of simultaneous devices will likely be very low + * (usually zero or one) for the forseeable future. If the worst case + * exceeds this, then it can be increased, or we can convert to idr. + */ This comment is on the heavy side (if at all needed). If you want to remind people of idr, just put that in a single line. A define is a define is a define. OK. +#define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xac, struct kvm_create_device) +#define KVM_SET_DEVICE_ATTR _IOW(KVMIO, 0xad, struct kvm_device_attr) +#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xae, struct kvm_device_attr) _IOWR ? struct kvm_device_attr itself is write-only, though the data pointed to by the addr field goes the other way for GET. ONE_REG is in the same situation and also uses _IOW for both. +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ + struct kvm_device *dev = NULL; + bool test = cd-flags KVM_CREATE_DEVICE_TEST; + int id; + int r; + + mutex_lock(kvm-lock); + + id = kvm-num_devices; + if (id = KVM_MAX_DEVICES !test) { + r = -ENOSPC; + goto out; + } + + switch (cd-type) { + default: + r = -ENODEV; + goto out; + } do we really believe that there will be any arch-generic recognition of types? shouldn't this be a call to an arch-specific function instead. Which makes me wonder whether the device type IDs should be arch specific as well... I prefer to look at it from the other direction -- is there any reason why this *should* be architecture specific? What will that make easier? By doing device recognition here we don't need a separate copy of this per arch (including some #ifdef or modifying every arch at once -- including ARM which I can't modify yet because it's not merged), and *if* we should end up with an in-kernel-emulated device that gets used across multiple architectures, it would be annoying to have to modify all relevant architectures (and worse to deal with per-arch numberspaces). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d
On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote: On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote: On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote: On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote: On 02/12/2013 04:26 PM, Peter Hurley wrote: With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA device/console): [0.666410] udevd[97]: starting version 175 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 7fff069e277f sp 7fff068c9ef8 error d and boots to an initramfs prompt. git bisect (log attached) blames: commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f Merge: 3596f5b 949db15 Author: H. Peter Anvin h...@linux.intel.com Date: Fri Jan 25 16:31:21 2013 -0800 Merge tag 'v3.8-rc5' into x86/mm The __pa() fixup series that follows touches KVM code that is not present in the existing branch based on v3.7-rc5, so merge in the current upstream from Linus. Signed-off-by: H. Peter Anvin h...@linux.intel.com This only happens with the VGA device/console but that is the default configuration for Ubuntu/KVM because it blacklists pretty much every fb driver. I am guessing this is another bad use of __pa()... need to look into that. Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible there? This is in the linux-next repo (any git tag after 'next-20130204' will reproduce this). It's a pretty large merge commit. This doesn't happen on 3.8-rc7. I'll try to repro this on kvm.git sometime this week. Otherwise, we can wait for it to show up in 3.9. Regards, Peter Hurley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] Alter steal-time reporting in the guest
On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote: 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com: In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. Sorry, I'm no expert in this area. But I don't really understand what is confusing for the end user here. I suppose that what is wanted is to subtract stolen time due to 'known reasons' from steal time reporting. 'Known reasons' being, for example, hard caps. So a vcpu executing instructions with no halt, but limited to 80% of available bandwidth, would not have 20% of stolen time reported. But yes, a description of the scenario that is being dealt with, with details, is important. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. I'm also a bit confused here. steal time will then only account the cpu time lost due to quotas from cfs bandwidth control? Also what do you exactly mean by expected steal time? Is it steal time due to overcommitting minus scheduler quotas? Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
x86: pvclock kvm: align allocation size to page size
To match whats mapped via vsyscalls to userspace. Reported-by: Peter Hurley pe...@hurleysoftware.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index 220a360..5bedbdd 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -218,6 +218,9 @@ static void kvm_shutdown(void) void __init kvmclock_init(void) { unsigned long mem; + int size; + + size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); if (!kvm_para_available()) return; @@ -231,16 +234,14 @@ void __init kvmclock_init(void) printk(KERN_INFO kvm-clock: Using msrs %x and %x, msr_kvm_system_time, msr_kvm_wall_clock); - mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS, -PAGE_SIZE); + mem = memblock_alloc(size, PAGE_SIZE); if (!mem) return; hv_clock = __va(mem); if (kvm_register_clock(boot clock)) { hv_clock = NULL; - memblock_free(mem, - sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); + memblock_free(mem, size); return; } pv_time_ops.sched_clock = kvm_clock_read; @@ -275,7 +276,7 @@ int __init kvm_setup_vsyscall_timeinfo(void) struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; - size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS; + size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS); preempt_disable(); cpu = smp_processor_id(); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling
On Mon, Feb 18, 2013 at 12:32:37PM +0200, Gleb Natapov wrote: On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote: This prevents trapping L2 I/O exits if L1 has neither unconditional nor bitmap-based exiting enabled. Furthermore, it implements I/O bitmap handling. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: nVMX: Trap unconditionally if msr bitmap access fails
On Sun, Feb 17, 2013 at 10:56:36AM +0200, Gleb Natapov wrote: On Thu, Feb 14, 2013 at 07:46:27PM +0100, Jan Kiszka wrote: This avoids basing decisions on uninitialized variables, potentially leaking kernel data to the L1 guest. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Reviewed-by: Gleb Natapov g...@redhat.com Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.
On Feb 18, 2013, at 1:44 AM, Gleb Natapov wrote: On Fri, Feb 15, 2013 at 11:10:46AM -0500, Sanjay Lal wrote: On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote: On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote: +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva) +{ + gpa_t gpa; + uint32_t kseg = KSEGX(gva); + + if ((kseg == CKSEG0) || (kseg == CKSEG1)) You seems to be using KVM_GUEST_KSEGX variants on gva in all other places. Why not here? This function is invoked to handle 2 scenarios: (1) Parse the boot code config tables setup by QEMU's Malta emulation. The pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not Guest KSEG0 addresses. Where is it called for that purpose? The only place where gva_to_gpa callback is called is in kvm/kvm_mips_emul.c:kvm_mips_emulate_(store|load) Load/stores from/to KSEG1 generate the Address Error Load/Store exceptions. The handler calls kvm_mips_emul.c:kvm_mips_emulate_(store|load) which then call the gva_to_gpa callback. (2) Handle I/O accesses by the guest. On MIPS platforms, I/O device registers are mapped into the KSEG1 address space (unmapped, uncached). Again like (1) these are actual KSEG1 addresses, which cause an exception and are passed onto QEMU for I/O emulation. So guest KSEG1 registers is mapped to 0xA000-0xBFFF ranges just like on a host? Can you give corresponding segment names to those ranges Guest User address space: 0x - 0x4000 (useg?) Guest Kernel Unmapped: 0x4000 - 0x6000 (kseg0?) Guest Kernel Mapped:0x6000 - 0x8000 (?) Yes, now that you mention it :-). I'll add a corresponding Guest Kernel KSEG1 segment name. Regards Sanjay -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 54061] New: guest panic after live migration
https://bugzilla.kernel.org/show_bug.cgi?id=54061 Summary: guest panic after live migration Product: Virtualization Version: unspecified Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm AssignedTo: virtualization_...@kernel-bugs.osdl.org ReportedBy: yongjie@intel.com Regression: No Created an attachment (id=93511) -- (https://bugzilla.kernel.org/attachment.cgi?id=93511) guest panic after migration Environment: Host OS (ia32/ia32e/IA64):ia32e Guest OS (ia32/ia32e/IA64):ia32e Guest OS Type (Linux/Windows):Linux (e.g. RHEL6.3) kvm.git next branch Commit:cbd29cb6e38af6119df2cdac0c58acf0e85c177e qemu-kvm.git Commit:4d9367b76f71c6d938cf8201392abe4bfb1136cb Hardware:SandyBridge-EP, Westmere-EP Bug detailed description: -- After live migration, guest will panic. This should be a KVM kernel bug. kvm + qemu-kvm = result cbd29cb6 + 4d9367b7 = bad b0da5bec + 4d9367b7 = good Reproduce steps: 1. start up a host with kvm (commit: cbd29cb6) 2. Start a TCP daemon for migration: qemu-system-x86_64 -m 1024 -smp 2 -net nic,macaddr=00:12:32:45:12:54 -net tap /root/rhel6u3.img -incoming tcp:localhost: 3. create a guest qemu-system-x86_64 -m 1024 -smp 2 -net nic,macaddr=00:12:32:45:12:54 -net tap /root/rhel6u3.img 4. ctrl+Alt+2 switch to QEMU monitor 5. in monitor: migrate tcp:localhost: Current result: after live migration, guest panic Expected result: after live migration, guest work fine. Basic root-causing log: -- WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: GB W --- ) Hardware name: Bochs list_add corruption. prev-next should be next (88003fae0ac0), but was 8800365c3000. (prev=8800365f9040). Modules linked in: autofs4 sunrpc ipv6 uinput ppdev parport_pc parport microcode sg 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 12, comm: events/1 Tainted: GB W --- 2.6.32-279.el6.x86_64 #1 Call Trace: [8106b747] ? warn_slowpath_common+0x87/0xc0 [8106b836] ? warn_slowpath_fmt+0x46/0x50 [8128301f] ? __list_add+0x8f/0xa0 [81163f64] ? free_block+0x154/0x170 [811641b1] ? drain_array+0xc1/0x100 [8116517e] ? cache_reap+0x8e/0x260 [81137090] ? vmstat_update+0x0/0x40 [811650f0] ? cache_reap+0x0/0x260 [8108c760] ? worker_thread+0x170/0x2a0 [810920d0] ? autoremove_wake_function+0x0/0x40 [8108c5f0] ? worker_thread+0x0/0x2a0 [81091d66] ? kthread+0x96/0xa0 [8100c14a] ? child_rip+0xa/0x20 [81091cd0] ? kthread+0x0/0xa0 [8100c140] ? child_rip+0x0/0x20 ---[ end trace f17758832a0dcb5e ]--- general protection fault: [#1] SMP last sysfs file: /sys/devices/pci:00/:00:03.0/irq CPU 1 Modules linked in: autofs4 sunrpc ipv6 uinput ppdev parport_pc parport microcode sg 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib] Pid: 1173, comm: rs:main Q:Reg Tainted: GB W --- 2.6.32-279.el6.x86_64 #1 Bochs Bochs RIP: 0010:[81282f00] [81282f00] list_del+0x10/0xa0 RSP: 0018:880037547a78 EFLAGS: 00010096 RAX: dead00200200 RBX: eaceb940 RCX: RDX: 0010 RSI: 88003edd00d0 RDI: eaceb940 RBP: 880037547a88 R08: R09: R10: R11: R12: 88003edd00c0 R13: 880116c0 R14: 362e R15: eaceb918 FS: 7fc44b7cc700() GS:88000230() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fc44c5aba10 CR3: 3dc44000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process rs:main Q:Reg (pid: 1173, threadinfo 880037546000, task 880037062ae0) Stack: 0282 0001 880037547ba8 811258a8 d 880037547ab8 0001 88003728b400 d 00c7f118 0040 88033c28 Call Trace: [811258a8] get_page_from_freelist+0x288/0x820 [a00869f6] ? jbd2_journal_stop+0x1e6/0x2b0 [jbd2] [81126f31] __alloc_pages_nodemask+0x111/0x940 [81161d62] kmem_getpages+0x62/0x170 [811623cf] cache_grow+0x2cf/0x320 [81162622]
[PATCH] kvm/powerpc/e500mc: fix tlb invalidation on cpu migration
The existing check handles the case where we've migrated to a different core than we last ran on, but it doesn't handle the case where we're still on the same cpu we last ran on, but some other vcpu has run on this cpu in the meantime. Signed-off-by: Scott Wood scottw...@freescale.com --- This seems to have been the cause of the userspace segfaults I was seeing (the other TLB patches I posted are still needed as well). arch/powerpc/kvm/e500mc.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c index 1f89d26..8637689 100644 --- a/arch/powerpc/kvm/e500mc.c +++ b/arch/powerpc/kvm/e500mc.c @@ -111,6 +111,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr) void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu); + static struct kvm_vcpu *last_vcpu_on_cpu[NR_CPUS]; kvmppc_booke_vcpu_load(vcpu, cpu); @@ -136,8 +137,11 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu) mtspr(SPRN_GDEAR, vcpu-arch.shared-dar); mtspr(SPRN_GESR, vcpu-arch.shared-esr); - if (vcpu-arch.oldpir != mfspr(SPRN_PIR)) + if (vcpu-arch.oldpir != mfspr(SPRN_PIR) || + last_vcpu_on_cpu[smp_processor_id()] != vcpu) { kvmppc_e500_tlbil_all(vcpu_e500); + last_vcpu_on_cpu[smp_processor_id()] = vcpu; + } kvmppc_load_guest_fp(vcpu); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
buildbot failure in kvm on next-i386
The Buildbot has detected a new failure on builder next-i386 while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-i386/builds/804 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed git sincerely, -The Buildbot
buildbot failure in kvm on next-ppc44x
The Buildbot has detected a new failure on builder next-ppc44x while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-ppc44x/builds/804 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed git sincerely, -The Buildbot
buildbot failure in kvm on next-ppc64
The Buildbot has detected a new failure on builder next-ppc64 while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-ppc64/builds/805 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed git sincerely, -The Buildbot N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf
buildbot failure in kvm on next-x86_64
The Buildbot has detected a new failure on builder next-x86_64 while building kvm. Full details are available at: http://buildbot.b1-systems.de/kvm/builders/next-x86_64/builds/804 Buildbot URL: http://buildbot.b1-systems.de/kvm/ Buildslave for this Build: b1_kvm_1 Build Reason: The Nightly scheduler named 'nightly_next' triggered this build Build Source Stamp: [branch next] HEAD Blamelist: BUILD FAILED: failed git sincerely, -The Buildbot
Re: [RFC PATCH 1/6] kvm: add device control API
On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood scottw...@freescale.com wrote: On 02/18/2013 06:44:20 PM, Christoffer Dall wrote: On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: index 0350e0d..dbaf012 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -335,6 +335,25 @@ struct kvm_memslots { short id_to_index[KVM_MEM_SLOTS_NUM]; }; +/* + * The worst case number of simultaneous devices will likely be very low + * (usually zero or one) for the forseeable future. If the worst case + * exceeds this, then it can be increased, or we can convert to idr. + */ This comment is on the heavy side (if at all needed). If you want to remind people of idr, just put that in a single line. A define is a define is a define. OK. +#define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xac, struct kvm_create_device) +#define KVM_SET_DEVICE_ATTR _IOW(KVMIO, 0xad, struct kvm_device_attr) +#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xae, struct kvm_device_attr) _IOWR ? struct kvm_device_attr itself is write-only, though the data pointed to by the addr field goes the other way for GET. ONE_REG is in the same situation and also uses _IOW for both. ok. Btw., what about the size of the attr? implicitly defined through the attr id? +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ + struct kvm_device *dev = NULL; + bool test = cd-flags KVM_CREATE_DEVICE_TEST; + int id; + int r; + + mutex_lock(kvm-lock); + + id = kvm-num_devices; + if (id = KVM_MAX_DEVICES !test) { + r = -ENOSPC; + goto out; + } + + switch (cd-type) { + default: + r = -ENODEV; + goto out; + } do we really believe that there will be any arch-generic recognition of types? shouldn't this be a call to an arch-specific function instead. Which makes me wonder whether the device type IDs should be arch specific as well... I prefer to look at it from the other direction -- is there any reason why this *should* be architecture specific? What will that make easier? The fact that you don't have to create static inlines for the architectures that don't define the functions that get called or have to similar #ifdef tricks, and I also think it's easier to read the arch-specific bits of the code that way, instead of some arbitrary function that you have to trace through to figure out where it's called from. By doing device recognition here we don't need a separate copy of this per arch (including some #ifdef or modifying every arch at once -- including ARM which I can't modify yet because it's not merged), and *if* we should end up with an in-kernel-emulated device that gets used across multiple architectures, it would be annoying to have to modify all relevant architectures (and worse to deal with per-arch numberspaces). I would say that's exactly what you're going to need with your approach: switch (cd-type) { case KVM_ARM_VGIC_V2_0: kvm_arm_vgic_v2_0_create(...); } are you going to ifdef here in this function, or? I think it's cleaner to have the single arch-specific hooks and handle the cases there. The use case of having a single device which is so central to the system that we emulate it inside the kernel and is shared across multiple archs is pretty far fetched to me. However, this is internal and can always be changed, so if everyone agrees on the overall API, whichever way you implement it is fine with me. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 54071] New: kvm guests stuck (not in io) and cannot be killed using -9
https://bugzilla.kernel.org/show_bug.cgi?id=54071 Summary: kvm guests stuck (not in io) and cannot be killed using -9 Product: Virtualization Version: unspecified Kernel Version: 3.5.0-23.35 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: high Priority: P1 Component: kvm AssignedTo: virtualization_...@kernel-bugs.osdl.org ReportedBy: i...@corinlangosch.com Regression: No I have two stuck kvm guests hanging in state R (for several hours now) on two different hosts. Actually they are not consuming any CPU and cannot be killed with kill -9. They are not responding to anything. Both systems are Ubuntu 12.10 Quantal running on a AMD II X4 965 Processor with 16GB RAM. Stack trace of the first stuck guest: cat /proc/2647/stack [81683e06] retint_careful+0x14/0x32 [] 0x Stack trace of the second stuck guest: cat /proc/11932/stack [81084d2a] __cond_resched+0x2a/0x40 [811544aa] try_to_unmap_file+0x3a/0x6d0 [811553c1] try_to_unmap+0x31/0x70 [811722a8] migrate_pages+0x318/0x500 [811453b4] compact_zone+0x1e4/0x350 [811458bf] try_to_compact_pages+0x16f/0x1d0 [81677a94] __alloc_pages_direct_compact+0xaa/0x191 [8112bee3] __alloc_pages_nodemask+0x473/0x920 [81164b30] alloc_pages_current+0xb0/0x120 [8116bf5d] new_slab+0x22d/0x2c0 [816790ea] __slab_alloc+0x314/0x46e [811708a1] __kmalloc_node_track_caller+0xa1/0x1b0 [81566e99] __alloc_skb+0x79/0x230 [81560a31] sock_alloc_send_pskb+0x1d1/0x320 [81499ec1] tun_get_user+0x131/0x4a0 [8149a759] tun_chr_aio_write+0x69/0x90 [8118297a] do_sync_readv_writev+0xda/0x120 [81182c64] do_readv_writev+0xd4/0x1e0 [81182da5] vfs_writev+0x35/0x60 [81182f2a] sys_writev+0x4a/0xb0 [8168bb69] system_call_fastpath+0x16/0x1b [] 0x Packages: Kernel: 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux KVM: 1.2.0+noroms-0ubuntu2.12.10.2 Command line: kvm -name vm-16762 -machine pc-1.2 -enable-kvm -cpu kvm64 -smp sockets=1,cores=2 -m 1024 -vga cirrus -drive id=drive1245,if=none,cache=writeback,aio=native,format=raw,media=disk,file=rbd:kvm1/vm-16762-disk-1 -device virtio-blk-pci,id=hdrive1245,addr=0x12,drive=drive1245 -netdev tap,id=netdev1243,ifname=tap1243,script=,downscript= -device virtio-net-pci,id=nic1243,addr=0x03,mac=02:48:6e:2b:4e:55,netdev=netdev1243 -chardev socket,id=qmp,path=/var/run/kvm/vm-16762/qmp,server,nowait -mon chardev=qmp,mode=control -monitor unix:/var/run/kvm/vm-16762/mon,server,nowait -vnc 10.0.0.60:21,password -usbdevice tablet -nodefaults -pidfile /var/run/kvm/vm-16762/pid -boot menu=on -k de -chroot /var/run/kvm/vm-16762 -runas kvm -daemonize -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH V4 RESEND 09/22] net: multiqueue support
On 02/13/2013 05:21 AM, Alexander Graf wrote: On 01.02.2013, at 08:39, Jason Wang wrote: This patch adds basic multiqueue support for qemu. The idea is simple, an array of NetClientStates were introduced in NICState, parse_netdev() were extended to find and match all NetClientStates belongs to the backend and place their pointers in NICConf. Then qemu_new_nic can setup a N:N mapping between NICStates that belongs to a nic and NICStates belongs to the netdev. And a queue_index were introduced in NetClientState to track its index. After this, each peers of a NICState were abstracted as a queue. After this change, all NetClientState that belongs to the same backend/nic has the same id. When use want to change the link status, all NetClientStates that belongs to the same backend/nic will be also changed. When user want to delete a device or netdev, all NetClientStates that belongs to the same backend/nic will be deleted also. Changing or deleting an specific queue is not allowed. Signed-off-by: Jason Wang jasow...@redhat.com Signed-off-by: Michael S. Tsirkin m...@redhat.com --- hw/dp8393x.c|2 +- hw/mcf_fec.c|2 +- hw/qdev-properties-system.c | 46 +++--- hw/qdev-properties.h|6 +- include/net/net.h | 18 +-- net/net.c | 113 +++ 6 files changed, 139 insertions(+), 48 deletions(-) diff --git a/hw/dp8393x.c b/hw/dp8393x.c index 0273fad..808157b 100644 --- a/hw/dp8393x.c +++ b/hw/dp8393x.c @@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift, s-regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */ s-conf.macaddr = nd-macaddr; -s-conf.peer = nd-netdev; +s-conf.peers.ncs[0] = nd-netdev; s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, s); diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c index 909e32b..8e60f09 100644 --- a/hw/mcf_fec.c +++ b/hw/mcf_fec.c @@ -472,7 +472,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd, memory_region_add_subregion(sysmem, base, s-iomem); s-conf.macaddr = nd-macaddr; -s-conf.peer = nd-netdev; +s-conf.peers.ncs[0] = nd-netdev; s-nic = qemu_new_nic(net_mcf_fec_info, s-conf, nd-model, nd-name, s); diff --git a/hw/qdev-properties-system.c b/hw/qdev-properties-system.c index ce0f793..ce3af22 100644 --- a/hw/qdev-properties-system.c +++ b/hw/qdev-properties-system.c @@ -173,16 +173,47 @@ PropertyInfo qdev_prop_chr = { static int parse_netdev(DeviceState *dev, const char *str, void **ptr) { -NetClientState *netdev = qemu_find_netdev(str); +NICPeers *peers_ptr = (NICPeers *)ptr; +NICConf *conf = container_of(peers_ptr, NICConf, peers); +NetClientState **ncs = peers_ptr-ncs; +NetClientState *peers[MAX_QUEUE_NUM]; +int queues, i = 0; +int ret; -if (netdev == NULL) { -return -ENOENT; +queues = qemu_find_net_clients_except(str, peers, + NET_CLIENT_OPTIONS_KIND_NIC, + MAX_QUEUE_NUM); +if (queues == 0) { +ret = -ENOENT; +goto err; } -if (netdev-peer) { -return -EEXIST; + +if (queues MAX_QUEUE_NUM) { +ret = -E2BIG; +goto err; +} + +for (i = 0; i queues; i++) { +if (peers[i] == NULL) { +ret = -ENOENT; +goto err; +} + +if (peers[i]-peer) { +ret = -EEXIST; +goto err; +} + +ncs[i] = peers[i]; +ncs[i]-queue_index = i; } -*ptr = netdev; + +conf-queues = queues; + return 0; + +err: +return ret; } static const char *print_netdev(void *ptr) @@ -249,7 +280,8 @@ static void set_vlan(Object *obj, Visitor *v, void *opaque, { DeviceState *dev = DEVICE(obj); Property *prop = opaque; -NetClientState **ptr = qdev_get_prop_ptr(dev, prop); +NICPeers *peers_ptr = qdev_get_prop_ptr(dev, prop); +NetClientState **ptr = peers_ptr-ncs[0]; Error *local_err = NULL; int32_t id; NetClientState *hubport; diff --git a/hw/qdev-properties.h b/hw/qdev-properties.h index ddcf774..20c67f3 100644 --- a/hw/qdev-properties.h +++ b/hw/qdev-properties.h @@ -31,7 +31,7 @@ extern PropertyInfo qdev_prop_pci_host_devaddr; .name = (_name),\ .info = (_prop), \ .offset= offsetof(_state, _field)\ -+ type_check(_type,typeof_field(_state, _field)),\ ++ type_check(_type, typeof_field(_state, _field)), \ } #define DEFINE_PROP_DEFAULT(_name, _state, _field, _defval, _prop, _type) { \ .name = (_name), \ @@ -77,9 +77,9 @@
[Bug 54071] kvm guests stuck (not in io) and cannot be killed using -9
https://bugzilla.kernel.org/show_bug.cgi?id=54071 Gleb g...@redhat.com changed: What|Removed |Added CC||g...@redhat.com --- Comment #1 from Gleb g...@redhat.com 2013-02-19 07:55:17 --- Report this to your OS vendor. Close this bugs unless it is reproducible on upstream kernel. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support
On 15/02/13 14:24, Paul Mackerras wrote: On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote: +static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt, + unsigned long ioba, unsigned long tce) +{ + unsigned long idx = ioba SPAPR_TCE_SHIFT; + struct page *page; + u64 *tbl; + + /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ + /* liobn, stt, stt-window_size); */ + if (ioba = stt-window_size) { + pr_err(%s failed on ioba=%lx\n, __func__, ioba); Doesn't this give the guest a way to spam the host logs? And in fact printk in real mode is potentially problematic. I would just leave out this statement. + return H_PARAMETER; + } + + page = stt-pages[idx / TCES_PER_PAGE]; + tbl = (u64 *)page_address(page); I would like to see an explanation of why we are confident that page_address() will work correctly in real mode, across all the combinations of config options that we can have for a ppc64 book3s kernel. It was there before this patch, I just moved it so I would think it has been explained before :) There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled. CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not enabled on PPC64 either. So this definition is supposed to work on PPC64: #define page_address(page) lowmem_page_address(page) where lowmem_page_address() is arithmetic operation on a page struct address: static __always_inline void *lowmem_page_address(const struct page *page) { return __va(PFN_PHYS(page_to_pfn(page))); } PPC32 will use page_address() from mm/highmem.c, I need some lesson about memory layout in 32bit but for now I cannot see how it can possibly fail here. + + /* FIXME: Need to validate the TCE itself */ + /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ + tbl[idx % TCES_PER_PAGE] = tce; + + return H_SUCCESS; +} + +/* + * Real mode handlers */ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn, unsigned long ioba, unsigned long tce) { - struct kvm *kvm = vcpu-kvm; struct kvmppc_spapr_tce_table *stt; - /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */ - /* liobn, ioba, tce); */ + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; + + /* Emulated IO */ + return emulated_h_put_tce(stt, ioba, tce); +} + +long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_list, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + unsigned long *tces; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) { - if (stt-liobn == liobn) { - unsigned long idx = ioba SPAPR_TCE_SHIFT; - struct page *page; - u64 *tbl; + tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL); + if (!tces) + return H_TOO_HARD; - /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p window_size=0x%x\n, */ - /* liobn, stt, stt-window_size); */ - if (ioba = stt-window_size) - return H_PARAMETER; + /* Emulated IO */ + for (i = 0; (i npages) !ret; ++i, ioba += IOMMU_PAGE_SIZE) + ret = emulated_h_put_tce(stt, ioba, tces[i]); So, tces is a pointer to somewhere inside a real page. Did we check somewhere that tces[npages-1] is in the same page as tces[0]? If so, I missed it. If we didn't, then we probably should check and do something about it. - page = stt-pages[idx / TCES_PER_PAGE]; - tbl = (u64 *)page_address(page); + return ret; +} - /* FIXME: Need to validate the TCE itself */ - /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */ - tbl[idx % TCES_PER_PAGE] = tce; - return H_SUCCESS; - } - } +long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu, + unsigned long liobn, unsigned long ioba, + unsigned long tce_value, unsigned long npages) +{ + struct kvmppc_spapr_tce_table *stt; + long i, ret = 0; + + stt = find_tce_table(vcpu, liobn); + /* Didn't find the liobn, put it to userspace */ + if (!stt) + return H_TOO_HARD; - /* Didn't find the liobn, punt it to userspace */ - return H_TOO_HARD; + /* Emulated
Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
Can you tell us why mpic should be in kernel? Is it used often by modern guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs are delivered through mpic too)? On x86 we actually would've preferred to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for historical reasons creation of PIC/IOAPIC/LAPIC are bundled together) hence my question. On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote: Scott Wood (6): kvm: add device control API kvm/ppc: add a notifier chain for vcpu creation/destruction. kvm/ppc/mpic: import hw/openpic.c from QEMU kvm/ppc/mpic: remove some obviously unneeded code kvm/ppc/mpic: adapt to kernel style and environment kvm/ppc/mpic: in-kernel MPIC emulation Documentation/virtual/kvm/api.txt | 76 ++ Documentation/virtual/kvm/devices/README |1 + Documentation/virtual/kvm/devices/mpic.txt | 36 + arch/powerpc/include/asm/kvm_host.h|9 +- arch/powerpc/include/asm/kvm_ppc.h |4 + arch/powerpc/kvm/Kconfig |5 + arch/powerpc/kvm/Makefile |2 + arch/powerpc/kvm/booke.c | 10 +- arch/powerpc/kvm/mpic.c| 1890 arch/powerpc/kvm/powerpc.c | 31 +- include/linux/kvm_host.h | 44 +- include/uapi/linux/kvm.h | 34 + virt/kvm/kvm_main.c| 141 +++ 13 files changed, 2268 insertions(+), 15 deletions(-) create mode 100644 Documentation/virtual/kvm/devices/README create mode 100644 Documentation/virtual/kvm/devices/mpic.txt create mode 100644 arch/powerpc/kvm/mpic.c -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? x86 obviously support old way and will have to for some, very long, time. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Scott, what other devices are you planning to support with this interface? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller
On 02/15/2013 10:51:16 PM, Paul Mackerras wrote: On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote: On 02/15/2013 08:56:14 PM, Paul Mackerras wrote: I have no particular objection to the device control API per se, but I have two objections to using it as the primary interface to the XICS emulation. First, I dislike the magical side-effect where creating a device of a particular type (e.g. MPIC or XICS) automatically attaches it to the interrupt lines of the vcpus. I prefer an explicit request to do in-kernel interrupt control. OK. This is device-specific behavior, so you could define it differently for XICS than MPIC. I suppose we could change it for MPIC as well, to leave an opening for the unlikely case where we'd want to model an MPIC that isn't directly connected to the CPUs. You can have cascaded MPICs; I once had a powerbook that had one MPIC cascaded into another. OK. How is the explicit request made in this patchset? The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a specific interrupt controller architecture connected to the vcpus' external interrupt inputs. In that sense it's explicit, compared to a generic create device ioctl that could be for any device. Hooking up to the CPU's interrupt lines is implicit in creating an MPIC (and I'm fine with changing that), not in creating any device. I don't see how it's worse than being implicit in calling KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips). The standardized interface doesn't make things any easier (as noted above, the caller is already mpic-specific code), and we'd have to come up with a scheme for flattening our interrupt numberspace (rather than introduce new attribute groups for things like IPI and timer interrupts). It may still be necessary when it comes to irqfd, though... With 24 bits of interrupt source number, you don't have to flatten it very far. IPIs are handled separately and don't appear in the interrupt source space. They do need to appear in the interrupt source space if we want to inject or irqfd them. Most users won't want to do that, but we have had a customer directly assign IPIs (to communicate with an OS running on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't using them) and MPIC timers to a guest. How do MSIs get injected? Just like other interrupts - from the point of view of the interrupt controller they're edge-triggered interrupt sources. Ah right, I guess this is all set up via hcalls for XICS. With MPIC exposing its registers via the device control api, everything just works -- the PCI device generates a write to the MPIC's memory region, the QEMU MPIC stub sends the write to the kernel as for any other MMIO access (this passthrough is also useful for debugging), the in-kernel MPIC sees the write to the generate an MSI register and does its thing. Compare that to all special the MSI code for APIC... :-) You're doing a round trip to userspace for every MPIC register access by the guest? Seriously? No. Accesses by the guest get handled in the kernel. Accesses in QEMU, including MSIs generated by virtio, get forwarded to the kernel. There are plenty of bits free in the 64 bits per source that I have allowed. We can accommodate those things. MPIC vector numbers take up 16 of the bits. The architected interrupt level field is 8 bits, though only a handful of values are actually needed. Add a couple binary flags, and it gets pretty tight if a third type of interrupt controller starts wanting something new. There's enough space for MPIC to have 16 bits of vector and some flags. We don't need to overdesign this. I view anything other than passing the actual MPIC register values around as overdesign here, given that it is communication between hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the kernel side. interrupts awaiting service (or even in service, if different priorities). We have both current task priority (which is a user-set mask-by-priority register) and the priority of the highest-prio in-service interrupt -- which would current processor priorty be? Etc. It would be the current task priority. I assume MPIC maintains a 16-bit map of the interrupt priorities in service, so that would need to be added. We don't maintain such a map in the emulation code. We have a per-CPU bitmap of the actual interrupt sources pending/active, which is another attribute that would need to be added in order to support migration on MPIC. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) If you mean the way to inject interrupts, it's simpler this way. Why go out of our way to inject common glue code into a communication path between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM? Or rather, why make that common glue be specific to this one function when we could reuse the same communication glue used for other things, such as device state? And that's just for regular interrupts. MSIs are vastly simpler on MPIC than what x86 does. x86 obviously support old way and will have to for some, very long, time. Sure. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. I wasn't aware that that's how it worked. :-P I was trying to be considerate by not making the entire thing gratuitously PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and APIC). We already had a discussion on ARM's set address ioctl and rather than extend *that* interface, they preferred to just stick something ARM-specific in ASAP with the understanding that it would be replaced (or more accurately, kept around as a thin wrapper around the new stuff) later. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Scott, what other devices are you planning to support with this interface? At the moment I do not have plans for other devices, though what does it hurt for the capability to be there? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip
On 02/18/2013 06:04:51 AM, Gleb Natapov wrote: Can you tell us why mpic should be in kernel? Is it used often by modern guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs are delivered through mpic too)? Yes, MSIs are delivered through the mpic. Plus, MSIs are only (normally) for PCI(e). We have embedded system on a chips with important non-PCI devices, which cannot use MSIs. On x86 we actually would've preferred to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for historical reasons creation of PIC/IOAPIC/LAPIC are bundled together) hence my question. We don't have that same split on this hardware. MPIC is one device that covers all of it. Some of the functionality is per-CPU, but it is not easily extracted from the rest. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood scottw...@freescale.com wrote: On 02/18/2013 06:21:59 AM, Gleb Natapov wrote: Copying Christoffer since ARM has in kernel irq chip too. On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. You are not only provide different way to create in kernel irq chip you also use an alternate way to trigger interrupt lines. Before going into interface specifics lets think about whether it is really worth it? Which it do you mean here? The ability to set/get attributes is needed. Sorry, but get or set one blob of data, up to 512 bytes, for the entire irqchip is just not good enough -- assuming you don't want us to start sticking pointers and commands in *that* data. :-) If you mean the way to inject interrupts, it's simpler this way. Why go out of our way to inject common glue code into a communication path between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM? Or rather, why make that common glue be specific to this one function when we could reuse the same communication glue used for other things, such as device state? And that's just for regular interrupts. MSIs are vastly simpler on MPIC than what x86 does. x86 obviously support old way and will have to for some, very long, time. Sure. ARM vGIC code, that is ready to go upstream, uses old way too. So it will be 2 archs against one. I wasn't aware that that's how it worked. :-P I was trying to be considerate by not making the entire thing gratuitously PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and APIC). We already had a discussion on ARM's set address ioctl and rather than extend *that* interface, they preferred to just stick something ARM-specific in ASAP with the understanding that it would be replaced (or more accurately, kept around as a thin wrapper around the new stuff) later. Christoffer do you think the proposed way it better for your needs. Are you willing to make vGIC use it? Well it won't improve functionality much from the current hardware point of view, but the proposed interface is superior to what we have now. Adding and coordinating new interfaces is indeed a pain, so a generic interface which is flexible enough to cater for a certain group of needs, is welcome imho. And this does seem to fit the bill. I can imagine that if there's support for a future ARM gic version or if we add in-kernel support for other stuff on ARM, then this interface will be useful, and in fact using the current interface to support two separate, but similar, interrupt controllers could get messy from a user space point of view. I am definitely willing to change to use this interface, the agreement on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this. I had some nits on the RFC, which I'll send separately. Scott, what other devices are you planning to support with this interface? At the moment I do not have plans for other devices, though what does it hurt for the capability to be there? -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: Currently, devices that are emulated inside KVM are configured in a hardcoded manner based on an assumption that any given architecture only has one way to do it. If there's any need to access device state, it is done through inflexible one-purpose-only IOCTLs (e.g. KVM_GET/SET_LAPIC). Defining new IOCTLs for every little thing is cumbersome and depletes a limited numberspace. This API provides a mechanism to instantiate a device of a certain type, returning an ID that can be used to set/get attributes of the device. Attributes may include configuration parameters (e.g. register base address), device state, operational commands, etc. It is similar to the ONE_REG API, except that it acts on devices rather than vcpus. Both device types and individual attributes can be tested without having to create the device or get/set the attribute, without the need for separately managing enumerated capabilities. Signed-off-by: Scott Wood scottw...@freescale.com --- Documentation/virtual/kvm/api.txt| 76 ++ Documentation/virtual/kvm/devices/README |1 + include/linux/kvm_host.h | 21 + include/uapi/linux/kvm.h | 25 ++ virt/kvm/kvm_main.c | 127 ++ 5 files changed, 250 insertions(+) create mode 100644 Documentation/virtual/kvm/devices/README diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index c2534c3..5bcdb42 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents from the data written, then `n_invalid' invalid entries, invalidating any previously valid entries found. +4.79 KVM_CREATE_DEVICE + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_create_device (in/out) +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device type is unknown or unsupported + EEXIST: Device already created, and this type of device may not + be instantiated multiple times + ENOSPC: Too many devices have been created + + Other error conditions may be defined by individual device types. + +Creates an emulated device in the kernel. The returned handle +can be used with KVM_SET/GET_DEVICE_ATTR. + +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the +device type is supported (not necessarily whether it can be created +in the current vm). + +Individual devices should not define flags. Attributes should be used +for specifying any behavior that is not implied by the device type +number. + +struct kvm_create_device { + __u32 type; /* in: KVM_DEV_TYPE_xxx */ + __u32 id; /* out: device handle */ + __u32 flags; /* in: KVM_CREATE_DEVICE_xxx */ +}; + +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device id is invalid + ENXIO: The group or attribute is unknown/unsupported for this device + EPERM: The attribute cannot (currently) be accessed this way + (e.g. read-only attribute, or attribute that only makes + sense when the device is in a different state) + + Other error conditions may be defined by individual device types. + +Gets/sets a specified piece of device configuration and/or state. The +semantics are device-specific except for certain global attributes. See +individual device documentation in the devices directory. As with +ONE_REG, the size of the data transferred is defined by the particular +attribute. + +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific: + KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to KVM_CREATE_DEVICE + +struct kvm_device_attr { + __u32 dev;/* id from KVM_CREATE_DEVICE */ + __u32 group; /* KVM_DEV_ATTR_COMMON or device-defined */ + __u64 attr; /* group-defined */ + __u64 addr; /* userspace address of attr data */ +}; + +4.81 KVM_HAS_DEVICE_ATTR + +Capability: KVM_CAP_DEVICE_CTRL +Type: vm ioctl +Parameters: struct kvm_device_attr +Returns: 0 on success, -1 on error +Errors: + ENODEV: The device id is invalid + ENXIO: The group or attribute is unknown/unsupported for this device + +Tests whether a device supports a particular attribute. A successful +return indicates the attribute is implemented. It does not necessarily +indicate that the attribute can be read or written in the device's +current state. addr is ignored. 5. The kvm_run structure diff --git a/Documentation/virtual/kvm/devices/README b/Documentation/virtual/kvm/devices/README new file mode 100644 index
Re: [PULL 00/14] ppc patch queue 2013-02-15
Pulled, thanks. On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote: Hi Marcelo / Gleb, This is my current patch queue for ppc. Please pull. Highlights of this queue drop are: - BookE: Fast mapping support for 4k backed memory - BookE: Handle alignment interrupts Alex The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e: Jan Kiszka (1): KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr are available in the git repository at: git://github.com/agraf/linux-2.6.git kvm-ppc-next Alexander Graf (11): KVM: PPC: E500: Move write_stlbe higher KVM: PPC: E500: Explicitly mark shadow maps invalid KVM: PPC: E500: Propagate errors when shadow mapping KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping KVM: PPC: E500: Split host and guest MMU parts KVM: PPC: e500: Implement TLB1-in-TLB0 mapping KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code Merge commit 'origin/next' into kvm-ppc-next KVM: PPC: BookE: Handle alignment interrupts Merge commit 'origin/next' into kvm-ppc-next Bharat Bhushan (3): KVM: PPC: booke: use vcpu reference from thread_struct KVM: PPC: booke: Allow multiple exception types booke: Added DBCR4 SPR number arch/powerpc/include/asm/kvm_ppc.h |2 - arch/powerpc/include/asm/reg.h |2 - arch/powerpc/include/asm/reg_booke.h |1 + arch/powerpc/kernel/asm-offsets.c|2 +- arch/powerpc/kvm/Makefile|9 +- arch/powerpc/kvm/booke.c | 30 +- arch/powerpc/kvm/booke.h |1 + arch/powerpc/kvm/booke_interrupts.S | 49 +- arch/powerpc/kvm/e500.c | 16 +- arch/powerpc/kvm/e500.h |1 + arch/powerpc/kvm/e500_mmu.c | 809 +++ arch/powerpc/kvm/e500_mmu_host.c | 699 + arch/powerpc/kvm/e500_mmu_host.h | 18 + arch/powerpc/kvm/e500_tlb.c | 1430 -- 14 files changed, 1610 insertions(+), 1459 deletions(-) create mode 100644 arch/powerpc/kvm/e500_mmu.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.c create mode 100644 arch/powerpc/kvm/e500_mmu_host.h delete mode 100644 arch/powerpc/kvm/e500_tlb.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On 02/18/2013 06:44:20 PM, Christoffer Dall wrote: On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: index 0350e0d..dbaf012 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -335,6 +335,25 @@ struct kvm_memslots { short id_to_index[KVM_MEM_SLOTS_NUM]; }; +/* + * The worst case number of simultaneous devices will likely be very low + * (usually zero or one) for the forseeable future. If the worst case + * exceeds this, then it can be increased, or we can convert to idr. + */ This comment is on the heavy side (if at all needed). If you want to remind people of idr, just put that in a single line. A define is a define is a define. OK. +#define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xac, struct kvm_create_device) +#define KVM_SET_DEVICE_ATTR _IOW(KVMIO, 0xad, struct kvm_device_attr) +#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xae, struct kvm_device_attr) _IOWR ? struct kvm_device_attr itself is write-only, though the data pointed to by the addr field goes the other way for GET. ONE_REG is in the same situation and also uses _IOW for both. +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ + struct kvm_device *dev = NULL; + bool test = cd-flags KVM_CREATE_DEVICE_TEST; + int id; + int r; + + mutex_lock(kvm-lock); + + id = kvm-num_devices; + if (id = KVM_MAX_DEVICES !test) { + r = -ENOSPC; + goto out; + } + + switch (cd-type) { + default: + r = -ENODEV; + goto out; + } do we really believe that there will be any arch-generic recognition of types? shouldn't this be a call to an arch-specific function instead. Which makes me wonder whether the device type IDs should be arch specific as well... I prefer to look at it from the other direction -- is there any reason why this *should* be architecture specific? What will that make easier? By doing device recognition here we don't need a separate copy of this per arch (including some #ifdef or modifying every arch at once -- including ARM which I can't modify yet because it's not merged), and *if* we should end up with an in-kernel-emulated device that gets used across multiple architectures, it would be annoying to have to modify all relevant architectures (and worse to deal with per-arch numberspaces). -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm/powerpc/e500mc: fix tlb invalidation on cpu migration
The existing check handles the case where we've migrated to a different core than we last ran on, but it doesn't handle the case where we're still on the same cpu we last ran on, but some other vcpu has run on this cpu in the meantime. Signed-off-by: Scott Wood scottw...@freescale.com --- This seems to have been the cause of the userspace segfaults I was seeing (the other TLB patches I posted are still needed as well). arch/powerpc/kvm/e500mc.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c index 1f89d26..8637689 100644 --- a/arch/powerpc/kvm/e500mc.c +++ b/arch/powerpc/kvm/e500mc.c @@ -111,6 +111,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 old_msr) void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu); + static struct kvm_vcpu *last_vcpu_on_cpu[NR_CPUS]; kvmppc_booke_vcpu_load(vcpu, cpu); @@ -136,8 +137,11 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu) mtspr(SPRN_GDEAR, vcpu-arch.shared-dar); mtspr(SPRN_GESR, vcpu-arch.shared-esr); - if (vcpu-arch.oldpir != mfspr(SPRN_PIR)) + if (vcpu-arch.oldpir != mfspr(SPRN_PIR) || + last_vcpu_on_cpu[smp_processor_id()] != vcpu) { kvmppc_e500_tlbil_all(vcpu_e500); + last_vcpu_on_cpu[smp_processor_id()] = vcpu; + } kvmppc_load_guest_fp(vcpu); } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/6] kvm: add device control API
On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood scottw...@freescale.com wrote: On 02/18/2013 06:44:20 PM, Christoffer Dall wrote: On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote: index 0350e0d..dbaf012 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -335,6 +335,25 @@ struct kvm_memslots { short id_to_index[KVM_MEM_SLOTS_NUM]; }; +/* + * The worst case number of simultaneous devices will likely be very low + * (usually zero or one) for the forseeable future. If the worst case + * exceeds this, then it can be increased, or we can convert to idr. + */ This comment is on the heavy side (if at all needed). If you want to remind people of idr, just put that in a single line. A define is a define is a define. OK. +#define KVM_CREATE_DEVICE_IOWR(KVMIO, 0xac, struct kvm_create_device) +#define KVM_SET_DEVICE_ATTR _IOW(KVMIO, 0xad, struct kvm_device_attr) +#define KVM_GET_DEVICE_ATTR _IOW(KVMIO, 0xae, struct kvm_device_attr) _IOWR ? struct kvm_device_attr itself is write-only, though the data pointed to by the addr field goes the other way for GET. ONE_REG is in the same situation and also uses _IOW for both. ok. Btw., what about the size of the attr? implicitly defined through the attr id? +static int kvm_ioctl_create_device(struct kvm *kvm, + struct kvm_create_device *cd) +{ + struct kvm_device *dev = NULL; + bool test = cd-flags KVM_CREATE_DEVICE_TEST; + int id; + int r; + + mutex_lock(kvm-lock); + + id = kvm-num_devices; + if (id = KVM_MAX_DEVICES !test) { + r = -ENOSPC; + goto out; + } + + switch (cd-type) { + default: + r = -ENODEV; + goto out; + } do we really believe that there will be any arch-generic recognition of types? shouldn't this be a call to an arch-specific function instead. Which makes me wonder whether the device type IDs should be arch specific as well... I prefer to look at it from the other direction -- is there any reason why this *should* be architecture specific? What will that make easier? The fact that you don't have to create static inlines for the architectures that don't define the functions that get called or have to similar #ifdef tricks, and I also think it's easier to read the arch-specific bits of the code that way, instead of some arbitrary function that you have to trace through to figure out where it's called from. By doing device recognition here we don't need a separate copy of this per arch (including some #ifdef or modifying every arch at once -- including ARM which I can't modify yet because it's not merged), and *if* we should end up with an in-kernel-emulated device that gets used across multiple architectures, it would be annoying to have to modify all relevant architectures (and worse to deal with per-arch numberspaces). I would say that's exactly what you're going to need with your approach: switch (cd-type) { case KVM_ARM_VGIC_V2_0: kvm_arm_vgic_v2_0_create(...); } are you going to ifdef here in this function, or? I think it's cleaner to have the single arch-specific hooks and handle the cases there. The use case of having a single device which is so central to the system that we emulate it inside the kernel and is shared across multiple archs is pretty far fetched to me. However, this is internal and can always be changed, so if everyone agrees on the overall API, whichever way you implement it is fine with me. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html