Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support

2013-02-18 Thread Alexey Kardashevskiy

On 15/02/13 14:24, Paul Mackerras wrote:

On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote:


+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long tce)
+{
+   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
+   struct page *page;
+   u64 *tbl;
+
+   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  window_size=0x%x\n, 
*/
+   /*  liobn, stt, stt-window_size); */
+   if (ioba = stt-window_size) {
+   pr_err(%s failed on ioba=%lx\n, __func__, ioba);


Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.


+   return H_PARAMETER;
+   }
+
+   page = stt-pages[idx / TCES_PER_PAGE];
+   tbl = (u64 *)page_address(page);


I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.


It was there before this patch, I just moved it so I would think it has 
been explained before :)


There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.


So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.





+
+   /* FIXME: Need to validate the TCE itself */
+   /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */
+   tbl[idx % TCES_PER_PAGE] = tce;
+
+   return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
   */
  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
  {
-   struct kvm *kvm = vcpu-kvm;
struct kvmppc_spapr_tce_table *stt;

-   /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */
-   /*  liobn, ioba, tce); */
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;
+
+   /* Emulated IO */
+   return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+   unsigned long *tces;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) {
-   if (stt-liobn == liobn) {
-   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
+   tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+   if (!tces)
+   return H_TOO_HARD;

-   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  
window_size=0x%x\n, */
-   /*  liobn, stt, stt-window_size); */
-   if (ioba = stt-window_size)
-   return H_PARAMETER;
+   /* Emulated IO */
+   for (i = 0; (i  npages)  !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+   ret = emulated_h_put_tce(stt, ioba, tces[i]);


So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.



-   page = stt-pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);
+   return ret;
+}

-   /* FIXME: Need to validate the TCE itself */
-   /* udbg_printf(tce @ %p\n, tbl[idx % 
TCES_PER_PAGE]); */
-   tbl[idx % TCES_PER_PAGE] = tce;
-   return H_SUCCESS;
-   }
-   }
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   /* Didn't find the liobn, punt it to userspace */
-   return H_TOO_HARD;
+   /* Emulated 

Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Gleb Natapov
On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote:
 On 2013-02-14 19:46, Jan Kiszka wrote:
  This prevents trapping L2 I/O exits if L1 has neither unconditional nor
  bitmap-based exiting enabled. Furthermore, it implements basic I/O
  bitmap handling. Repeated string accesses are still reported to L1
  unconditionally for now.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
  
  Changes in v3:
   - trap unconditionally if bitmap access fails
  
   arch/x86/kvm/vmx.c |   55 
  ++-
   1 files changed, 53 insertions(+), 2 deletions(-)
  
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 6667042..2633199 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
  kvm_vcpu *vcpu) = {
   static const int kvm_vmx_max_exit_handlers =
  ARRAY_SIZE(kvm_vmx_exit_handlers);
   
  +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
  +  struct vmcs12 *vmcs12)
  +{
  +   unsigned long exit_qualification;
  +   gpa_t bitmap, last_bitmap;
  +   bool string, rep;
  +   u16 port;
  +   int size;
  +   u8 b;
  +
  +   if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
  +   return 1;
  +
  +   if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
  +   return 0;
  +
  +   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  +
  +   string = exit_qualification  16;
  +   rep = exit_qualification  32;
  +
  +   /* TODO: interpret instruction and check range against bitmap */
  +   if (string  rep)
  +   return 1;
 
 Nonsense, rep ins/outs always works against the same port. We can simply
 drop this check and be done with the feature. I'll come up with v4.
 
Actually this reminds me that we should check range of ports depending
on operand size, not one port. But here is a catch, older cpus do not
provide operand size as part of exit information.

 Jan
 
  +
  +   port = exit_qualification  16;
  +   size = (exit_qualification  7) + 1;
  +
  +   last_bitmap = (gpa_t)-1;
  +   b = -1;
  +
  +   while (size  0) {
  +   if (port  0x8000)
  +   bitmap = vmcs12-io_bitmap_a;
  +   else
  +   bitmap = vmcs12-io_bitmap_b;
  +   bitmap += (port  0x7fff) / 8;
  +
  +   if (last_bitmap != bitmap)
  +   if (kvm_read_guest(vcpu-kvm, bitmap, b, 1))
  +   return 1;
  +   if (b  (1  (port  7)))
  +   return 1;
  +
  +   port++;
  +   size--;
  +   last_bitmap = bitmap;
  +   }
  +
  +   return 0;
  +}
  +
   /*
* Return 1 if we should exit from L2 to L1 to handle an MSR access access,
* rather than handle it ourselves in L0. I.e., check whether L1 expressed
  @@ -6097,8 +6149,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
  *vcpu)
  case EXIT_REASON_DR_ACCESS:
  return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
  case EXIT_REASON_IO_INSTRUCTION:
  -   /* TODO: support IO bitmaps */
  -   return 1;
  +   return nested_vmx_exit_handled_io(vcpu, vmcs12);
  case EXIT_REASON_MSR_READ:
  case EXIT_REASON_MSR_WRITE:
  return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
  
 
 



--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Jan Kiszka
On 2013-02-18 09:44, Gleb Natapov wrote:
 On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote:
 On 2013-02-14 19:46, Jan Kiszka wrote:
 This prevents trapping L2 I/O exits if L1 has neither unconditional nor
 bitmap-based exiting enabled. Furthermore, it implements basic I/O
 bitmap handling. Repeated string accesses are still reported to L1
 unconditionally for now.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---

 Changes in v3:
  - trap unconditionally if bitmap access fails

  arch/x86/kvm/vmx.c |   55 
 ++-
  1 files changed, 53 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6667042..2633199 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
 kvm_vcpu *vcpu) = {
  static const int kvm_vmx_max_exit_handlers =
 ARRAY_SIZE(kvm_vmx_exit_handlers);
  
 +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
 +  struct vmcs12 *vmcs12)
 +{
 +   unsigned long exit_qualification;
 +   gpa_t bitmap, last_bitmap;
 +   bool string, rep;
 +   u16 port;
 +   int size;
 +   u8 b;
 +
 +   if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
 +   return 1;
 +
 +   if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
 +   return 0;
 +
 +   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 +
 +   string = exit_qualification  16;
 +   rep = exit_qualification  32;
 +
 +   /* TODO: interpret instruction and check range against bitmap */
 +   if (string  rep)
 +   return 1;

 Nonsense, rep ins/outs always works against the same port. We can simply
 drop this check and be done with the feature. I'll come up with v4.

 Actually this reminds me that we should check range of ports depending
 on operand size, not one port. But here is a catch, older cpus do not
 provide operand size as part of exit information.

You mean what bit 54 in VMX_BASIC is telling us? Too bad. OK, will write
v5 which takes this into account.

Jan




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Gleb Natapov
On Mon, Feb 18, 2013 at 09:53:22AM +0100, Jan Kiszka wrote:
 On 2013-02-18 09:44, Gleb Natapov wrote:
  On Mon, Feb 18, 2013 at 07:32:53AM +0100, Jan Kiszka wrote:
  On 2013-02-14 19:46, Jan Kiszka wrote:
  This prevents trapping L2 I/O exits if L1 has neither unconditional nor
  bitmap-based exiting enabled. Furthermore, it implements basic I/O
  bitmap handling. Repeated string accesses are still reported to L1
  unconditionally for now.
 
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
  ---
 
  Changes in v3:
   - trap unconditionally if bitmap access fails
 
   arch/x86/kvm/vmx.c |   55 
  ++-
   1 files changed, 53 insertions(+), 2 deletions(-)
 
  diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
  index 6667042..2633199 100644
  --- a/arch/x86/kvm/vmx.c
  +++ b/arch/x86/kvm/vmx.c
  @@ -5908,6 +5908,58 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
  kvm_vcpu *vcpu) = {
   static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
   
  +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
  +struct vmcs12 *vmcs12)
  +{
  + unsigned long exit_qualification;
  + gpa_t bitmap, last_bitmap;
  + bool string, rep;
  + u16 port;
  + int size;
  + u8 b;
  +
  + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
  + return 1;
  +
  + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
  + return 0;
  +
  + exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
  +
  + string = exit_qualification  16;
  + rep = exit_qualification  32;
  +
  + /* TODO: interpret instruction and check range against bitmap */
  + if (string  rep)
  + return 1;
 
  Nonsense, rep ins/outs always works against the same port. We can simply
  drop this check and be done with the feature. I'll come up with v4.
 
  Actually this reminds me that we should check range of ports depending
  on operand size, not one port. But here is a catch, older cpus do not
  provide operand size as part of exit information.
 
 You mean what bit 54 in VMX_BASIC is telling us? Too bad. OK, will write
 v5 which takes this into account.
 
Yes, this one. We can just exit unconditionally on older cpus.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/9] virtio-blk: use virtqueue_start_buf on req path

2013-02-18 Thread Paolo Bonzini
Il 17/02/2013 07:37, Asias He ha scritto:
   static int __virtblk_add_req(struct virtqueue *vq,
  -   struct virtblk_req *vbr)
  +   struct virtblk_req *vbr,
  +   struct scatterlist *data_sg,
  +   unsigned data_nents)
   {
 struct scatterlist sg;
 enum dma_data_direction dir;
 int ret;
   
  +  int type = vbr-out_hdr.type  ~VIRTIO_BLK_T_OUT;
 unsigned int nents = 2;
 unsigned int nsg = 2;
   
  -  if (vbr-nents) {
  +  if (type == VIRTIO_BLK_T_SCSI_CMD) {
  +  BUG_ON(use_bio);
 Do we really need the BUG_ON?  Even if with use_bio=1,
 VIRTIO_BLK_T_SCSI_CMD cmd can be fired. See this:

I stand corrected... will send the patch with this removed.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

This prevents trapping L2 I/O exits if L1 has neither unconditional nor
bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
handling. We still exit unconditionally in case the CPU does not provide
information for ins/outs.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

Changes in v5:
 - still exit unconditionally if CPU refuses to provide exit
   information on ins/outs

 arch/x86/kvm/vmx.c |   58 ++-
 1 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..ccc7c17 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -651,6 +651,7 @@ static struct vmcs_config {
int size;
int order;
u32 revision_id;
+   u32 vmx_basic_high;
u32 pin_based_exec_ctrl;
u32 cpu_based_exec_ctrl;
u32 cpu_based_2nd_exec_ctrl;
@@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm)
return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
 }
 
+static inline bool cpu_has_stringio_exit_info(void)
+{
+   return vmcs_config.vmx_basic_high  (VMX_BASIC_INOUT  32);
+}
+
 static inline bool cpu_has_secondary_exec_ctrls(void)
 {
return vmcs_config.cpu_based_exec_ctrl 
@@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
vmcs_conf-size = vmx_msr_high  0x1fff;
vmcs_conf-order = get_order(vmcs_config.size);
vmcs_conf-revision_id = vmx_msr_low;
+   vmcs_conf-vmx_basic_high = vmx_msr_high;
 
vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control;
vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control;
@@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
 static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
+  struct vmcs12 *vmcs12)
+{
+   unsigned long exit_qualification;
+   gpa_t bitmap, last_bitmap;
+   u16 port;
+   int size;
+   u8 b;
+
+   if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
+   return 1;
+
+   if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
+   return 0;
+
+   /* TODO: for older CPUs, derive access width from instruction */
+   if (!cpu_has_stringio_exit_info())
+   return 1;
+
+   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+   port = exit_qualification  16;
+   size = (exit_qualification  7) + 1;
+
+   last_bitmap = (gpa_t)-1;
+   b = -1;
+
+   while (size  0) {
+   if (port  0x8000)
+   bitmap = vmcs12-io_bitmap_a;
+   else
+   bitmap = vmcs12-io_bitmap_b;
+   bitmap += (port  0x7fff) / 8;
+
+   if (last_bitmap != bitmap)
+   if (kvm_read_guest(vcpu-kvm, bitmap, b, 1))
+   return 1;
+   if (b  (1  (port  7)))
+   return 1;
+
+   port++;
+   size--;
+   last_bitmap = bitmap;
+   }
+
+   return 0;
+}
+
 /*
  * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
  * rather than handle it ourselves in L0. I.e., check whether L1 expressed
@@ -6097,8 +6152,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_DR_ACCESS:
return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
case EXIT_REASON_IO_INSTRUCTION:
-   /* TODO: support IO bitmaps */
-   return 1;
+   return nested_vmx_exit_handled_io(vcpu, vmcs12);
case EXIT_REASON_MSR_READ:
case EXIT_REASON_MSR_WRITE:
return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH ppc-next v2 42/52] target-ppc: Convert CPU definitions

2013-02-18 Thread Andreas Färber
Turn the array of model definitions into a set of self-registering QOM
types with their own class_init. Unique identifiers are obtained from
the combination of PVR, SVR and family identifiers; this requires all
alias #defines to be removed from the list. Possibly there are some more
left after this commit that are not currently being compiled.

Prepares for introducing abstract intermediate CPU types for families.

Keep the right-aligned macro line breaks within 78 chars to aid
three-way merges.

Signed-off-by: Andreas Färber afaer...@suse.de
---
 target-ppc/cpu-qom.h|   17 -
 target-ppc/cpu.h|   20 --
 target-ppc/kvm.c|   32 +
 target-ppc/translate_init.c |  163 +--
 4 Dateien geändert, 115 Zeilen hinzugefügt(+), 117 Zeilen entfernt(-)

diff --git a/target-ppc/cpu-qom.h b/target-ppc/cpu-qom.h
index b338f8f..7220908 100644
--- a/target-ppc/cpu-qom.h
+++ b/target-ppc/cpu-qom.h
@@ -51,8 +51,21 @@ typedef struct PowerPCCPUClass {
 
 void (*parent_reset)(CPUState *cpu);
 
-/* TODO inline fields here */
-ppc_def_t *info;
+uint32_t pvr;
+uint32_t svr;
+uint64_t insns_flags;
+uint64_t insns_flags2;
+uint64_t msr_mask;
+powerpc_mmu_t   mmu_model;
+powerpc_excp_t  excp_model;
+powerpc_input_t bus_model;
+uint32_t flags;
+int bfd_mach;
+#if defined(TARGET_PPC64)
+const struct ppc_segment_page_sizes *sps;
+#endif
+void (*init_proc)(CPUPPCState *env);
+int  (*check_pow)(CPUPPCState *env);
 } PowerPCCPUClass;
 
 /**
diff --git a/target-ppc/cpu.h b/target-ppc/cpu.h
index 8c081db..86ebd3a 100644
--- a/target-ppc/cpu.h
+++ b/target-ppc/cpu.h
@@ -307,7 +307,6 @@ enum powerpc_input_t {
 #define PPC_INPUT(env) (env-bus_model)
 
 /*/
-typedef struct ppc_def_t ppc_def_t;
 typedef struct opc_handler_t opc_handler_t;
 
 /*/
@@ -902,25 +901,6 @@ struct ppc_segment_page_sizes {
 /* The whole PowerPC CPU context */
 #define NB_MMU_MODES 3
 
-struct ppc_def_t {
-const char *name;
-uint32_t pvr;
-uint32_t svr;
-uint64_t insns_flags;
-uint64_t insns_flags2;
-uint64_t msr_mask;
-powerpc_mmu_t   mmu_model;
-powerpc_excp_t  excp_model;
-powerpc_input_t bus_model;
-uint32_t flags;
-int bfd_mach;
-#if defined(TARGET_PPC64)
-const struct ppc_segment_page_sizes *sps;
-#endif
-void (*init_proc)(CPUPPCState *env);
-int  (*check_pow)(CPUPPCState *env);
-};
-
 struct CPUPPCState {
 /* First are the most commonly used resources
  * during translated code execution
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 2c64c63..e601059 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -1263,7 +1263,7 @@ static void kvmppc_host_cpu_initfn(Object *obj)
 
 assert(kvm_enabled());
 
-if (pcc-info-pvr != mfpvr()) {
+if (pcc-pvr != mfpvr()) {
 fprintf(stderr, Your host CPU is unsupported.\n
 Please choose a supported model instead, see -cpu ?.\n);
 exit(1);
@@ -1275,30 +1275,38 @@ static void kvmppc_host_cpu_class_init(ObjectClass *oc, 
void *data)
 PowerPCCPUClass *pcc = POWERPC_CPU_CLASS(oc);
 uint32_t host_pvr = mfpvr();
 PowerPCCPUClass *pvr_pcc;
-ppc_def_t *spec;
 uint32_t vmx = kvmppc_get_vmx();
 uint32_t dfp = kvmppc_get_dfp();
 
-spec = g_malloc0(sizeof(*spec));
-
 pvr_pcc = ppc_cpu_class_by_pvr(host_pvr);
 if (pvr_pcc != NULL) {
-memcpy(spec, pvr_pcc-info, sizeof(*spec));
+pcc-pvr  = pvr_pcc-pvr;
+pcc-svr  = pvr_pcc-svr;
+pcc-insns_flags  = pvr_pcc-insns_flags;
+pcc-insns_flags2 = pvr_pcc-insns_flags2;
+pcc-msr_mask = pvr_pcc-msr_mask;
+pcc-mmu_model= pvr_pcc-mmu_model;
+pcc-excp_model   = pvr_pcc-excp_model;
+pcc-bus_model= pvr_pcc-bus_model;
+pcc-flags= pvr_pcc-flags;
+pcc-bfd_mach = pvr_pcc-bfd_mach;
+#ifdef TARGET_PPC64
+pcc-sps  = pvr_pcc-sps;
+#endif
+pcc-init_proc= pvr_pcc-init_proc;
+pcc-check_pow= pvr_pcc-check_pow;
 }
-pcc-info = spec;
-/* Override the display name for -cpu ? and QMP */
-pcc-info-name = host;
 
-/* Now fix up the spec with information we can query from the host */
+/* Now fix up the class with information we can query from the host */
 
 if (vmx != -1) {
 /* Only override when we know what the host supports */
-alter_insns(spec-insns_flags, PPC_ALTIVEC, vmx  0);
-alter_insns(spec-insns_flags2, PPC2_VSX, vmx  1);
+alter_insns(pcc-insns_flags, PPC_ALTIVEC, vmx  0);
+alter_insns(pcc-insns_flags2, PPC2_VSX, vmx  1);
 }
 if (dfp != -1) {
 /* Only override when we know what the host supports */
-

Re: [PATCH v5] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Gleb Natapov
On Mon, Feb 18, 2013 at 10:17:14AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 This prevents trapping L2 I/O exits if L1 has neither unconditional nor
 bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
 handling. We still exit unconditionally in case the CPU does not provide
 information for ins/outs.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
 
 Changes in v5:
  - still exit unconditionally if CPU refuses to provide exit
information on ins/outs
 
  arch/x86/kvm/vmx.c |   58 ++-
  1 files changed, 56 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6667042..ccc7c17 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -651,6 +651,7 @@ static struct vmcs_config {
   int size;
   int order;
   u32 revision_id;
 + u32 vmx_basic_high;
   u32 pin_based_exec_ctrl;
   u32 cpu_based_exec_ctrl;
   u32 cpu_based_2nd_exec_ctrl;
 @@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm)
   return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
  }
  
 +static inline bool cpu_has_stringio_exit_info(void)
 +{
 + return vmcs_config.vmx_basic_high  (VMX_BASIC_INOUT  32);
 +}
 +
  static inline bool cpu_has_secondary_exec_ctrls(void)
  {
   return vmcs_config.cpu_based_exec_ctrl 
 @@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
   vmcs_conf-size = vmx_msr_high  0x1fff;
   vmcs_conf-order = get_order(vmcs_config.size);
   vmcs_conf-revision_id = vmx_msr_low;
 + vmcs_conf-vmx_basic_high = vmx_msr_high;
  
   vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control;
   vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control;
 @@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
 kvm_vcpu *vcpu) = {
  static const int kvm_vmx_max_exit_handlers =
   ARRAY_SIZE(kvm_vmx_exit_handlers);
  
 +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
 +struct vmcs12 *vmcs12)
 +{
 + unsigned long exit_qualification;
 + gpa_t bitmap, last_bitmap;
 + u16 port;
 + int size;
 + u8 b;
 +
 + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
 + return 1;
 +
 + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
 + return 0;
 +
 + /* TODO: for older CPUs, derive access width from instruction */
 + if (!cpu_has_stringio_exit_info())
 + return 1;
 +

Sigh, actually I am stupid :( The information that is not available
on older cpus is address size in VM-exit instruction information,
not operand size on exit qualification, so your v4 is correct. Except
handling of port wrap around:

  If the “use I/O bitmaps” VM-execution control is 1, the instruction
  causes a VM exit if it attempts to access an I/O port corresponding to a
  bit set to 1 in the appropriate I/O bitmap (see Section 24.6.4). If an
  I/O operation “wraps around” the 16-bit I/O-port space (accesses
  ports H and H), the I/O instruction causes a VM exit (the
  “unconditional I/O exiting” VM-execution control is ignored if the
  “use I/O bitmaps” VM-execution control is 1).


 + exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 +
 + port = exit_qualification  16;
 + size = (exit_qualification  7) + 1;
 +
 + last_bitmap = (gpa_t)-1;
 + b = -1;
 +
 + while (size  0) {
 + if (port  0x8000)
 + bitmap = vmcs12-io_bitmap_a;
 + else
 + bitmap = vmcs12-io_bitmap_b;
 + bitmap += (port  0x7fff) / 8;
 +
 + if (last_bitmap != bitmap)
 + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1))
 + return 1;
 + if (b  (1  (port  7)))
 + return 1;
 +
 + port++;
 + size--;
 + last_bitmap = bitmap;
 + }
 +
 + return 0;
 +}
 +
  /*
   * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
   * rather than handle it ourselves in L0. I.e., check whether L1 expressed
 @@ -6097,8 +6152,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
 *vcpu)
   case EXIT_REASON_DR_ACCESS:
   return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
   case EXIT_REASON_IO_INSTRUCTION:
 - /* TODO: support IO bitmaps */
 - return 1;
 + return nested_vmx_exit_handled_io(vcpu, vmcs12);
   case EXIT_REASON_MSR_READ:
   case EXIT_REASON_MSR_WRITE:
   return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.

2013-02-18 Thread Gleb Natapov
On Fri, Feb 15, 2013 at 11:10:46AM -0500, Sanjay Lal wrote:
 
 On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote:
 
  On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote:
  +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva)
  +{
  +  gpa_t gpa;
  +  uint32_t kseg = KSEGX(gva);
  +
  +  if ((kseg == CKSEG0) || (kseg == CKSEG1))
  You seems to be using KVM_GUEST_KSEGX variants on gva in all other
  places. Why not here?
 
 This function is invoked to handle 2 scenarios:
 (1) Parse the boot code config tables setup by QEMU's Malta emulation. The 
 pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not 
 Guest KSEG0 addresses.
 
Where is it called for that purpose? The only place where gva_to_gpa
callback is called is in kvm/kvm_mips_emul.c:kvm_mips_emulate_(store|load)

 (2) Handle I/O accesses by the guest.  On MIPS platforms, I/O device 
 registers are mapped into the KSEG1 address space (unmapped, uncached).  
 Again like (1) these are actual KSEG1 addresses, which cause an exception and 
 are passed onto QEMU for I/O emulation.
 
So guest KSEG1 registers is mapped to 0xA000-0xBFFF ranges just
like on a host? Can you give corresponding segment names to those ranges

Guest User address space:   0x - 0x4000 (useg?)
Guest Kernel Unmapped:  0x4000 - 0x6000 (kseg0?)
Guest Kernel Mapped:0x6000 - 0x8000 (?)


--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/5] virtio-scsi multiqueue

2013-02-18 Thread Wanlong Gao
On 02/12/2013 09:06 PM, Paolo Bonzini wrote:
 This series implements virtio-scsi queue steering, which gives
 performance improvements of up to 50% (measured both with QEMU and
 tcm_vhost backends).  The patches build on top of the new virtio APIs
 at http://permalink.gmane.org/gmane.linux.kernel.virtualization/18431;
 the new API simplifies the locking of the virtio-scsi driver nicely,
 thus it makes sense to require them as a prerequisite.
 
 Changes from the previous post, which can be found at
 http://permalink.gmane.org/gmane.linux.kernel.virtualization/17869:
 
 - patches 1 and 2 (virtio: add functions for piecewise addition of
   buffers, virtio-scsi: use functions for piecewise composition of
   buffers) split into their own series
 
 - new cleanup patch virtio-scsi: push vq lock/unlock into virtscsi_vq_done
 
 - reorganized code to move ACCESS_ONCE in a clearer place
 
 - included Wanlong Gao's CPU hotplug patches
 
 Ok for 3.9?  It would probably be easier to get it in via Rusty's tree
 because of the prerequisites.  James, can I get your Acked-by?

I can't apply this series on top of Rusty's virtio-next, I missed something
or needed rebase them ?

Thanks,
Wanlong Gao

 
 Paolo
 
 Paolo Bonzini (4):
   virtio-scsi: redo allocation of target data
   virtio-scsi: pass struct virtio_scsi to virtqueue completion function
   virtio-scsi: push vq lock/unlock into virtscsi_vq_done
   virtio-scsi: introduce multiqueue support
 
 Wanlong Gao (1):
   virtio-scsi: reset virtqueue affinity when doing cpu hotplug
 
  drivers/scsi/virtio_scsi.c |  360 
 +++-
  1 files changed, 292 insertions(+), 68 deletions(-)
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Jan Kiszka
On 2013-02-18 10:36, Gleb Natapov wrote:
 On Mon, Feb 18, 2013 at 10:17:14AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com

 This prevents trapping L2 I/O exits if L1 has neither unconditional nor
 bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
 handling. We still exit unconditionally in case the CPU does not provide
 information for ins/outs.

 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---

 Changes in v5:
  - still exit unconditionally if CPU refuses to provide exit
information on ins/outs

  arch/x86/kvm/vmx.c |   58 
 ++-
  1 files changed, 56 insertions(+), 2 deletions(-)

 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6667042..ccc7c17 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -651,6 +651,7 @@ static struct vmcs_config {
  int size;
  int order;
  u32 revision_id;
 +u32 vmx_basic_high;
  u32 pin_based_exec_ctrl;
  u32 cpu_based_exec_ctrl;
  u32 cpu_based_2nd_exec_ctrl;
 @@ -752,6 +753,11 @@ static inline bool vm_need_tpr_shadow(struct kvm *kvm)
  return (cpu_has_vmx_tpr_shadow())  (irqchip_in_kernel(kvm));
  }
  
 +static inline bool cpu_has_stringio_exit_info(void)
 +{
 +return vmcs_config.vmx_basic_high  (VMX_BASIC_INOUT  32);
 +}
 +
  static inline bool cpu_has_secondary_exec_ctrls(void)
  {
  return vmcs_config.cpu_based_exec_ctrl 
 @@ -2635,6 +2641,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
 *vmcs_conf)
  vmcs_conf-size = vmx_msr_high  0x1fff;
  vmcs_conf-order = get_order(vmcs_config.size);
  vmcs_conf-revision_id = vmx_msr_low;
 +vmcs_conf-vmx_basic_high = vmx_msr_high;
  
  vmcs_conf-pin_based_exec_ctrl = _pin_based_exec_control;
  vmcs_conf-cpu_based_exec_ctrl = _cpu_based_exec_control;
 @@ -5908,6 +5915,54 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
 kvm_vcpu *vcpu) = {
  static const int kvm_vmx_max_exit_handlers =
  ARRAY_SIZE(kvm_vmx_exit_handlers);
  
 +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
 +   struct vmcs12 *vmcs12)
 +{
 +unsigned long exit_qualification;
 +gpa_t bitmap, last_bitmap;
 +u16 port;
 +int size;
 +u8 b;
 +
 +if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
 +return 1;
 +
 +if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
 +return 0;
 +
 +/* TODO: for older CPUs, derive access width from instruction */
 +if (!cpu_has_stringio_exit_info())
 +return 1;
 +
 
 Sigh, actually I am stupid :( The information that is not available
 on older cpus is address size in VM-exit instruction information,
 not operand size on exit qualification, so your v4 is correct. Except
 handling of port wrap around:
 
   If the “use I/O bitmaps” VM-execution control is 1, the instruction
   causes a VM exit if it attempts to access an I/O port corresponding to a
   bit set to 1 in the appropriate I/O bitmap (see Section 24.6.4). If an
   I/O operation “wraps around” the 16-bit I/O-port space (accesses
   ports H and H), the I/O instruction causes a VM exit (the
   “unconditional I/O exiting” VM-execution control is ignored if the
   “use I/O bitmaps” VM-execution control is 1).

Ah, indeed.

OK, let's see if this simple patch manages to reach two-digit revision
numbers...

Jan




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 0/5] virtio-scsi multiqueue

2013-02-18 Thread Paolo Bonzini
Il 18/02/2013 10:32, Wanlong Gao ha scritto:
  Ok for 3.9?  It would probably be easier to get it in via Rusty's tree
  because of the prerequisites.  James, can I get your Acked-by?
 I can't apply this series on top of Rusty's virtio-next, I missed something
 or needed rebase them ?

It's on top of the patches at
http://permalink.gmane.org/gmane.linux.kernel.virtualization/18431.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v6] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Jan Kiszka
This prevents trapping L2 I/O exits if L1 has neither unconditional nor
bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
handling.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---

Changes in v6:
 - drop the bogus check of vmx_basix.54 again
 - exit unconditionally on I/O address wrap around

 arch/x86/kvm/vmx.c |   49 +++--
 1 files changed, 47 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..b4ce43c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5908,6 +5908,52 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
kvm_vcpu *vcpu) = {
 static const int kvm_vmx_max_exit_handlers =
ARRAY_SIZE(kvm_vmx_exit_handlers);
 
+static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
+  struct vmcs12 *vmcs12)
+{
+   unsigned long exit_qualification;
+   gpa_t bitmap, last_bitmap;
+   unsigned int port;
+   int size;
+   u8 b;
+
+   if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
+   return 1;
+
+   if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
+   return 0;
+
+   exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+
+   port = exit_qualification  16;
+   size = (exit_qualification  7) + 1;
+
+   last_bitmap = (gpa_t)-1;
+   b = -1;
+
+   while (size  0) {
+   if (port  0x8000)
+   bitmap = vmcs12-io_bitmap_a;
+   else if (port  0x1)
+   bitmap = vmcs12-io_bitmap_b;
+   else
+   return 1;
+   bitmap += (port  0x7fff) / 8;
+
+   if (last_bitmap != bitmap)
+   if (kvm_read_guest(vcpu-kvm, bitmap, b, 1))
+   return 1;
+   if (b  (1  (port  7)))
+   return 1;
+
+   port++;
+   size--;
+   last_bitmap = bitmap;
+   }
+
+   return 0;
+}
+
 /*
  * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
  * rather than handle it ourselves in L0. I.e., check whether L1 expressed
@@ -6097,8 +6143,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu *vcpu)
case EXIT_REASON_DR_ACCESS:
return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
case EXIT_REASON_IO_INSTRUCTION:
-   /* TODO: support IO bitmaps */
-   return 1;
+   return nested_vmx_exit_handled_io(vcpu, vmcs12);
case EXIT_REASON_MSR_READ:
case EXIT_REASON_MSR_WRITE:
return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
-- 
1.7.3.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Gleb Natapov
On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote:
 This prevents trapping L2 I/O exits if L1 has neither unconditional nor
 bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
 handling.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
Reviewed-by: Gleb Natapov g...@redhat.com

 ---
 
 Changes in v6:
  - drop the bogus check of vmx_basix.54 again
  - exit unconditionally on I/O address wrap around
 
  arch/x86/kvm/vmx.c |   49 +++--
  1 files changed, 47 insertions(+), 2 deletions(-)
 
 diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
 index 6667042..b4ce43c 100644
 --- a/arch/x86/kvm/vmx.c
 +++ b/arch/x86/kvm/vmx.c
 @@ -5908,6 +5908,52 @@ static int (*const kvm_vmx_exit_handlers[])(struct 
 kvm_vcpu *vcpu) = {
  static const int kvm_vmx_max_exit_handlers =
   ARRAY_SIZE(kvm_vmx_exit_handlers);
  
 +static bool nested_vmx_exit_handled_io(struct kvm_vcpu *vcpu,
 +struct vmcs12 *vmcs12)
 +{
 + unsigned long exit_qualification;
 + gpa_t bitmap, last_bitmap;
 + unsigned int port;
 + int size;
 + u8 b;
 +
 + if (nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING))
 + return 1;
 +
 + if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
 + return 0;
 +
 + exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
 +
 + port = exit_qualification  16;
 + size = (exit_qualification  7) + 1;
 +
 + last_bitmap = (gpa_t)-1;
 + b = -1;
 +
 + while (size  0) {
 + if (port  0x8000)
 + bitmap = vmcs12-io_bitmap_a;
 + else if (port  0x1)
 + bitmap = vmcs12-io_bitmap_b;
 + else
 + return 1;
 + bitmap += (port  0x7fff) / 8;
 +
 + if (last_bitmap != bitmap)
 + if (kvm_read_guest(vcpu-kvm, bitmap, b, 1))
 + return 1;
 + if (b  (1  (port  7)))
 + return 1;
 +
 + port++;
 + size--;
 + last_bitmap = bitmap;
 + }
 +
 + return 0;
 +}
 +
  /*
   * Return 1 if we should exit from L2 to L1 to handle an MSR access access,
   * rather than handle it ourselves in L0. I.e., check whether L1 expressed
 @@ -6097,8 +6143,7 @@ static bool nested_vmx_exit_handled(struct kvm_vcpu 
 *vcpu)
   case EXIT_REASON_DR_ACCESS:
   return nested_cpu_has(vmcs12, CPU_BASED_MOV_DR_EXITING);
   case EXIT_REASON_IO_INSTRUCTION:
 - /* TODO: support IO bitmaps */
 - return 1;
 + return nested_vmx_exit_handled_io(vcpu, vmcs12);
   case EXIT_REASON_MSR_READ:
   case EXIT_REASON_MSR_WRITE:
   return nested_vmx_exit_handled_msr(vcpu, vmcs12, exit_reason);
 -- 
 1.7.3.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip

2013-02-18 Thread Gleb Natapov
Can you tell us why mpic should be in kernel? Is it used often by modern
guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs
are delivered through mpic too)? On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.

On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote:
 Scott Wood (6):
   kvm: add device control API
   kvm/ppc: add a notifier chain for vcpu creation/destruction.
   kvm/ppc/mpic: import hw/openpic.c from QEMU
   kvm/ppc/mpic: remove some obviously unneeded code
   kvm/ppc/mpic: adapt to kernel style and environment
   kvm/ppc/mpic: in-kernel MPIC emulation
 
  Documentation/virtual/kvm/api.txt  |   76 ++
  Documentation/virtual/kvm/devices/README   |1 +
  Documentation/virtual/kvm/devices/mpic.txt |   36 +
  arch/powerpc/include/asm/kvm_host.h|9 +-
  arch/powerpc/include/asm/kvm_ppc.h |4 +
  arch/powerpc/kvm/Kconfig   |5 +
  arch/powerpc/kvm/Makefile  |2 +
  arch/powerpc/kvm/booke.c   |   10 +-
  arch/powerpc/kvm/mpic.c| 1890 
 
  arch/powerpc/kvm/powerpc.c |   31 +-
  include/linux/kvm_host.h   |   44 +-
  include/uapi/linux/kvm.h   |   34 +
  virt/kvm/kvm_main.c|  141 +++
  13 files changed, 2268 insertions(+), 15 deletions(-)
  create mode 100644 Documentation/virtual/kvm/devices/README
  create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
  create mode 100644 arch/powerpc/kvm/mpic.c
 
 -- 
 1.7.9.5
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Gleb Natapov
Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.
 
 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
You are not only provide different way to create in kernel irq chip you
also use an alternate way to trigger interrupt lines. Before going into
interface specifics lets think about whether it is really worth it? x86
obviously support old way and will have to for some, very long, time.
ARM vGIC code, that is ready to go upstream, uses old way too. So it will
be 2 archs against one. Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-02-18 Thread Frederic Weisbecker
2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
 In the case of where you have a system that is running in a
 capped or overcommitted environment the user may see steal time
 being reported in accounting tools such as top or vmstat.  This can
 cause confusion for the end user.

Sorry, I'm no expert in this area. But I don't really understand what
is confusing for the end user here.

 To ease the confusion this patch set
 adds the idea of consigned (expected steal) time.  The host will separate
 the consigned time from the steal time.  Tthe steal time will only be altered
 if hard limits (cfs bandwidth control) is used.  The period and the quota used
 to separate the consigned time (expected steal) from the steal time are taken
 from the cfs bandwidth control settings. Any other steal time accruing during
 that period will show as the traditional steal time.

I'm also a bit confused here. steal time will then only account the
cpu time lost due to quotas from cfs bandwidth control? Also what do
you exactly mean by expected steal time? Is it steal time due to
overcommitting minus scheduler quotas?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for 2013-02-19

2013-02-18 Thread Juan Quintela


Hi

Please send in any agenda topics you are interested in.

Later, Juan.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] qemu vfio-pci: PCIe and generic config space mangling

2013-02-18 Thread Alex Williamson
This series generalizes the places where we already rely on QEMU
config space emulation to make it easier to add various other fields.
It turns out that when we start running on Q35 Windows cares quite
a bit about the PCIe type we expose.  Drivers will fail to load if
not set to something valid and appropriate for the bus.  We don't
have anything in place yet to tell what the bus is, so for now we
rely on a temporary option that will eventually get replaced by
something automatic.  Thanks,

Alex

---

Alex Williamson (2):
  vfio-pci: Generalize PCI config mangling
  vfio-pci: Add PCIe capability mangling based on bus type


 hw/vfio_pci.c |  210 ++---
 1 file changed, 171 insertions(+), 39 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] vfio-pci: Generalize PCI config mangling

2013-02-18 Thread Alex Williamson
Kernel-side vfio virtualizes all of config space, but some parts are
unique to Qemu.  For instance we may or may not expose the ROM BAR,
Qemu manages MSI/MSIX, and Qemu manages the multi-function bit so that
single function devices can appear as multi-function and vica versa.
Generalize this into a bitmap of Qemu emulated bits.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |   80 ++---
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 66537b7..53b23f3 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -117,6 +117,7 @@ typedef struct VFIODevice {
 int fd;
 VFIOINTx intx;
 unsigned int config_size;
+uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
 off_t config_offset; /* Offset of config space region within device fd */
 unsigned int rom_size;
 off_t rom_offset; /* Offset of ROM region within device fd */
@@ -963,44 +964,29 @@ static const MemoryRegionOps vfio_bar_ops = {
 static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
 VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
-uint32_t val = 0;
+uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
 
-/*
- * We only need QEMU PCI config support for the ROM BAR, the MSI and MSIX
- * capabilities, and the multifunction bit below.  We let VFIO handle
- * virtualizing everything else.  Performance is not a concern here.
- */
-if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) ||
-(pdev-cap_present  QEMU_PCI_CAP_MSIX 
- ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) ||
-(pdev-cap_present  QEMU_PCI_CAP_MSI 
- ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size))) {
+memcpy(emu_bits, vdev-emulated_config_bits + addr, len);
+emu_bits = le32_to_cpu(emu_bits);
 
-val = pci_default_read_config(pdev, addr, len);
-} else {
-if (pread(vdev-fd, val, len, vdev-config_offset + addr) != len) {
+if (emu_bits) {
+emu_val = pci_default_read_config(pdev, addr, len);
+}
+
+if (~emu_bits  (0xU  (32 - len * 8))) {
+ssize_t ret;
+
+ret = pread(vdev-fd, phys_val, len, vdev-config_offset + addr);
+if (ret != len) {
 error_report(%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m\n,
  __func__, vdev-host.domain, vdev-host.bus,
  vdev-host.slot, vdev-host.function, addr, len);
 return -errno;
 }
-val = le32_to_cpu(val);
+phys_val = le32_to_cpu(phys_val);
 }
 
-/* Multifunction bit is virualized in QEMU */
-if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) {
-uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION;
-
-if (len == 4) {
-mask = 16;
-}
-
-if (pdev-cap_present  QEMU_PCI_CAP_MULTIFUNCTION) {
-val |= mask;
-} else {
-val = ~mask;
-}
-}
+val = (emu_val  emu_bits) | (phys_val  ~emu_bits);
 
 DPRINTF(%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n, __func__,
 vdev-host.domain, vdev-host.bus, vdev-host.slot,
@@ -1026,12 +1012,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
  vdev-host.slot, vdev-host.function, addr, val, len);
 }
 
-/* Write standard header bits to emulation */
-if (addr  PCI_CONFIG_HEADER_SIZE) {
-pci_default_write_config(pdev, addr, val, len);
-return;
-}
-
 /* MSI/MSI-X Enabling/Disabling */
 if (pdev-cap_present  QEMU_PCI_CAP_MSI 
 ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size)) {
@@ -1046,9 +1026,7 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
 } else if (was_enabled  !is_enabled) {
 vfio_disable_msi(vdev);
 }
-}
-
-if (pdev-cap_present  QEMU_PCI_CAP_MSIX 
+} else if (pdev-cap_present  QEMU_PCI_CAP_MSIX 
 ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) {
 int is_enabled, was_enabled = msix_enabled(pdev);
 
@@ -1061,6 +1039,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
 } else if (was_enabled  !is_enabled) {
 vfio_disable_msix(vdev);
 }
+} else {
+/* Write everything to QEMU to keep emulated bits correct */
+pci_default_write_config(pdev, addr, val, len);
 }
 }
 
@@ -2003,6 +1984,16 @@ static int vfio_initfn(PCIDevice *pdev)
 goto out_put;
 }
 
+/* vfio emulates a lot for us, but some bits need extra love */
+vdev-emulated_config_bits = g_malloc0(vdev-config_size);
+
+/* QEMU can choose to expose the ROM or not */
+memset(vdev-emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
+
+/* QEMU can change multi-function devices to single function, or reverse */
+

[PATCH 2/2] vfio-pci: Add PCIe capability mangling based on bus type

2013-02-18 Thread Alex Williamson
Windows seems to pay particular interest to the PCIe header type of
devices and will fail to load drivers if we attached Endpoint devices
or Legacy Endpoint devices to the Root Complex.  We don't yet have a
good way to determine the bus type, so for now we add an experimental
x-bustype option which will later be replaced by some mechanism to
determine this automatcally.  The new option is defined as:

x-bustype=n where n is one of:
0: Legacy PCI [default]
1: PCI Express
2: PCI Express Root Complex

Conversion of PCIe types is does as follows:

* Legacy PCI
  * No change, capability is unmodified for compatibility.
* PCI Express
  * Integrated Root Complex Endpoint - Endpoint
* PCI Express Root Complext
  * Endpoint - Integrated Root Complex Endpoint
  * Legacy Endpoint - none, capability hidden

We also take this opportunity to explicitly limit supported devices
to Endpoints, Legacy Endpoints, and Root Complex Integrated Endpoints.
We don't currently have support for other types and users often cause
themselves problems by assigning them.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |  130 +
 1 file changed, 129 insertions(+), 1 deletion(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 53b23f3..7d6468b 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -130,6 +130,7 @@ typedef struct VFIODevice {
 PCIHostDeviceAddress host;
 QLIST_ENTRY(VFIODevice) next;
 struct VFIOGroup *group;
+uint8_t bustype;
 bool reset_works;
 } VFIODevice;
 
@@ -1506,6 +1507,123 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, 
uint8_t pos)
 return next - pos;
 }
 
+static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
+{
+pci_set_word(buf, (pci_get_word(buf)  ~mask) | val);
+}
+
+static void vfio_add_emulated_word(VFIODevice *vdev, int pos,
+   uint16_t val, uint16_t mask)
+{
+vfio_set_word_bits(vdev-pdev.config + pos, val, mask);
+vfio_set_word_bits(vdev-pdev.wmask + pos, ~mask, mask);
+vfio_set_word_bits(vdev-emulated_config_bits + pos, mask, mask);
+}
+
+static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask)
+{
+pci_set_long(buf, (pci_get_long(buf)  ~mask) | val);
+}
+
+static void vfio_add_emulated_long(VFIODevice *vdev, int pos,
+   uint32_t val, uint32_t mask)
+{
+vfio_set_long_bits(vdev-pdev.config + pos, val, mask);
+vfio_set_long_bits(vdev-pdev.wmask + pos, ~mask, mask);
+vfio_set_long_bits(vdev-emulated_config_bits + pos, mask, mask);
+}
+
+static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
+{
+uint16_t flags;
+uint8_t type;
+enum {
+VFIO_BUS_TYPE_PCI,
+VFIO_BUS_TYPE_PCIE,
+VFIO_BUS_TYPE_PCIE_RC,
+};
+
+flags = pci_get_word(vdev-pdev.config + pos + PCI_CAP_FLAGS);
+type = (flags  PCI_EXP_FLAGS_TYPE)  4;
+
+switch (type) {
+case PCI_EXP_TYPE_ENDPOINT:
+case PCI_EXP_TYPE_LEG_END:
+case PCI_EXP_TYPE_RC_END:
+break;
+default:
+error_report(vfio: Assignment of PCIe type 0x%x devices is not 
+ currently supported\n, type);
+return -EINVAL;
+}
+
+if (vdev-bustype  VFIO_BUS_TYPE_PCIE_RC) {
+error_report(vfio: Unknown x-bustype %d.  Accepted values:\n
+ \t%d: Legacy PCI [default]\n
+ \t%d: PCI Express\n
+ \t%d: PCI Express Root-Complex\n, vdev-bustype,
+ VFIO_BUS_TYPE_PCI, VFIO_BUS_TYPE_PCIE,
+ VFIO_BUS_TYPE_PCIE_RC);
+return -EINVAL;
+}
+
+switch (vdev-bustype) {
+case VFIO_BUS_TYPE_PCI:
+/*
+ * Use express capability as-is on PCI bus.  It doesn't make much
+ * sense to even expose, but some drivers (ex. tg3) depend on it
+ * and guests don't seem to be particular about it.  We'll need
+ * to revist this if we ever expose an IOMMU to the guest.
+ */
+return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size);
+
+case VFIO_BUS_TYPE_PCIE:
+if (type == PCI_EXP_TYPE_RC_END) {
+/* Type becomes non-Integrated Endpoint */
+vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS,
+   PCI_EXP_TYPE_ENDPOINT  4,
+   PCI_EXP_FLAGS_TYPE);
+/* XXX Implement LNKCAP fields? */
+}
+
+return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size);
+
+case VFIO_BUS_TYPE_PCIE_RC:
+switch (type) {
+case PCI_EXP_TYPE_ENDPOINT:
+/* Type becomes Integrated Endpoint */
+vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS,
+   PCI_EXP_TYPE_RC_END  4,
+   PCI_EXP_FLAGS_TYPE);
+
+/* Link Capabilities, Status, and 

[PATCH v2 0/2] qemu vfio-pci: PCIe and generic config space mangling

2013-02-18 Thread Alex Williamson
v2: Removing trailing \n from error_report calls

This series generalizes the places where we already rely on QEMU
config space emulation to make it easier to add various other fields.
It turns out that when we start running on Q35 Windows cares quite
a bit about the PCIe type we expose.  Drivers will fail to load if
not set to something valid and appropriate for the bus.  We don't
have anything in place yet to tell what the bus is, so for now we
rely on a temporary option that will eventually get replaced by
something automatic.  Thanks,

Alex

---

Alex Williamson (2):
  vfio-pci: Generalize PCI config mangling
  vfio-pci: Add PCIe capability mangling based on bus type


 hw/vfio_pci.c |  209 ++---
 1 file changed, 170 insertions(+), 39 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/2] vfio-pci: Generalize PCI config mangling

2013-02-18 Thread Alex Williamson
Kernel-side vfio virtualizes all of config space, but some parts are
unique to Qemu.  For instance we may or may not expose the ROM BAR,
Qemu manages MSI/MSIX, and Qemu manages the multi-function bit so that
single function devices can appear as multi-function and vica versa.
Generalize this into a bitmap of Qemu emulated bits.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |   80 ++---
 1 file changed, 42 insertions(+), 38 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index ad9ae36..2106b87 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -117,6 +117,7 @@ typedef struct VFIODevice {
 int fd;
 VFIOINTx intx;
 unsigned int config_size;
+uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */
 off_t config_offset; /* Offset of config space region within device fd */
 unsigned int rom_size;
 off_t rom_offset; /* Offset of ROM region within device fd */
@@ -963,44 +964,29 @@ static const MemoryRegionOps vfio_bar_ops = {
 static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
 {
 VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
-uint32_t val = 0;
+uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
 
-/*
- * We only need QEMU PCI config support for the ROM BAR, the MSI and MSIX
- * capabilities, and the multifunction bit below.  We let VFIO handle
- * virtualizing everything else.  Performance is not a concern here.
- */
-if (ranges_overlap(addr, len, PCI_ROM_ADDRESS, 4) ||
-(pdev-cap_present  QEMU_PCI_CAP_MSIX 
- ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) ||
-(pdev-cap_present  QEMU_PCI_CAP_MSI 
- ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size))) {
+memcpy(emu_bits, vdev-emulated_config_bits + addr, len);
+emu_bits = le32_to_cpu(emu_bits);
 
-val = pci_default_read_config(pdev, addr, len);
-} else {
-if (pread(vdev-fd, val, len, vdev-config_offset + addr) != len) {
+if (emu_bits) {
+emu_val = pci_default_read_config(pdev, addr, len);
+}
+
+if (~emu_bits  (0xU  (32 - len * 8))) {
+ssize_t ret;
+
+ret = pread(vdev-fd, phys_val, len, vdev-config_offset + addr);
+if (ret != len) {
 error_report(%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m,
  __func__, vdev-host.domain, vdev-host.bus,
  vdev-host.slot, vdev-host.function, addr, len);
 return -errno;
 }
-val = le32_to_cpu(val);
+phys_val = le32_to_cpu(phys_val);
 }
 
-/* Multifunction bit is virualized in QEMU */
-if (unlikely(ranges_overlap(addr, len, PCI_HEADER_TYPE, 1))) {
-uint32_t mask = PCI_HEADER_TYPE_MULTI_FUNCTION;
-
-if (len == 4) {
-mask = 16;
-}
-
-if (pdev-cap_present  QEMU_PCI_CAP_MULTIFUNCTION) {
-val |= mask;
-} else {
-val = ~mask;
-}
-}
+val = (emu_val  emu_bits) | (phys_val  ~emu_bits);
 
 DPRINTF(%s(%04x:%02x:%02x.%x, @0x%x, len=0x%x) %x\n, __func__,
 vdev-host.domain, vdev-host.bus, vdev-host.slot,
@@ -1026,12 +1012,6 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
  vdev-host.slot, vdev-host.function, addr, val, len);
 }
 
-/* Write standard header bits to emulation */
-if (addr  PCI_CONFIG_HEADER_SIZE) {
-pci_default_write_config(pdev, addr, val, len);
-return;
-}
-
 /* MSI/MSI-X Enabling/Disabling */
 if (pdev-cap_present  QEMU_PCI_CAP_MSI 
 ranges_overlap(addr, len, pdev-msi_cap, vdev-msi_cap_size)) {
@@ -1046,9 +1026,7 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
 } else if (was_enabled  !is_enabled) {
 vfio_disable_msi(vdev);
 }
-}
-
-if (pdev-cap_present  QEMU_PCI_CAP_MSIX 
+} else if (pdev-cap_present  QEMU_PCI_CAP_MSIX 
 ranges_overlap(addr, len, pdev-msix_cap, MSIX_CAP_LENGTH)) {
 int is_enabled, was_enabled = msix_enabled(pdev);
 
@@ -1061,6 +1039,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, 
uint32_t addr,
 } else if (was_enabled  !is_enabled) {
 vfio_disable_msix(vdev);
 }
+} else {
+/* Write everything to QEMU to keep emulated bits correct */
+pci_default_write_config(pdev, addr, val, len);
 }
 }
 
@@ -2003,6 +1984,16 @@ static int vfio_initfn(PCIDevice *pdev)
 goto out_put;
 }
 
+/* vfio emulates a lot for us, but some bits need extra love */
+vdev-emulated_config_bits = g_malloc0(vdev-config_size);
+
+/* QEMU can choose to expose the ROM or not */
+memset(vdev-emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
+
+/* QEMU can change multi-function devices to single function, or reverse */
+

[PATCH v2 2/2] vfio-pci: Add PCIe capability mangling based on bus type

2013-02-18 Thread Alex Williamson
Windows seems to pay particular interest to the PCIe header type of
devices and will fail to load drivers if we attached Endpoint devices
or Legacy Endpoint devices to the Root Complex.  We don't yet have a
good way to determine the bus type, so for now we add an experimental
x-bustype option which will later be replaced by some mechanism to
determine this automatcally.  The new option is defined as:

x-bustype=n where n is one of:
0: Legacy PCI [default]
1: PCI Express
2: PCI Express Root Complex

Conversion of PCIe types is does as follows:

* Legacy PCI
  * No change, capability is unmodified for compatibility.
* PCI Express
  * Integrated Root Complex Endpoint - Endpoint
* PCI Express Root Complext
  * Endpoint - Integrated Root Complex Endpoint
  * Legacy Endpoint - none, capability hidden

We also take this opportunity to explicitly limit supported devices
to Endpoints, Legacy Endpoints, and Root Complex Integrated Endpoints.
We don't currently have support for other types and users often cause
themselves problems by assigning them.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |  129 +
 1 file changed, 128 insertions(+), 1 deletion(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 2106b87..330a687 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -130,6 +130,7 @@ typedef struct VFIODevice {
 PCIHostDeviceAddress host;
 QLIST_ENTRY(VFIODevice) next;
 struct VFIOGroup *group;
+uint8_t bustype;
 bool reset_works;
 } VFIODevice;
 
@@ -1506,6 +1507,122 @@ static uint8_t vfio_std_cap_max_size(PCIDevice *pdev, 
uint8_t pos)
 return next - pos;
 }
 
+static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask)
+{
+pci_set_word(buf, (pci_get_word(buf)  ~mask) | val);
+}
+
+static void vfio_add_emulated_word(VFIODevice *vdev, int pos,
+   uint16_t val, uint16_t mask)
+{
+vfio_set_word_bits(vdev-pdev.config + pos, val, mask);
+vfio_set_word_bits(vdev-pdev.wmask + pos, ~mask, mask);
+vfio_set_word_bits(vdev-emulated_config_bits + pos, mask, mask);
+}
+
+static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask)
+{
+pci_set_long(buf, (pci_get_long(buf)  ~mask) | val);
+}
+
+static void vfio_add_emulated_long(VFIODevice *vdev, int pos,
+   uint32_t val, uint32_t mask)
+{
+vfio_set_long_bits(vdev-pdev.config + pos, val, mask);
+vfio_set_long_bits(vdev-pdev.wmask + pos, ~mask, mask);
+vfio_set_long_bits(vdev-emulated_config_bits + pos, mask, mask);
+}
+
+static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size)
+{
+uint16_t flags;
+uint8_t type;
+enum {
+VFIO_BUS_TYPE_PCI,
+VFIO_BUS_TYPE_PCIE,
+VFIO_BUS_TYPE_PCIE_RC,
+};
+
+flags = pci_get_word(vdev-pdev.config + pos + PCI_CAP_FLAGS);
+type = (flags  PCI_EXP_FLAGS_TYPE)  4;
+
+switch (type) {
+case PCI_EXP_TYPE_ENDPOINT:
+case PCI_EXP_TYPE_LEG_END:
+case PCI_EXP_TYPE_RC_END:
+break;
+default:
+error_report(vfio: Assignment of PCIe type 0x%x 
+ devices is not currently supported, type);
+return -EINVAL;
+}
+
+if (vdev-bustype  VFIO_BUS_TYPE_PCIE_RC) {
+error_report(vfio: Unknown bustype %d.  Accepted values: 
+ %d (Legacy PCI [default]), %d (PCI Express), 
+ %d (PCI Express Root-Complex), vdev-bustype,
+ VFIO_BUS_TYPE_PCI, VFIO_BUS_TYPE_PCIE,
+ VFIO_BUS_TYPE_PCIE_RC);
+return -EINVAL;
+}
+
+switch (vdev-bustype) {
+case VFIO_BUS_TYPE_PCI:
+/*
+ * Use express capability as-is on PCI bus.  It doesn't make much
+ * sense to even expose, but some drivers (ex. tg3) depend on it
+ * and guests don't seem to be particular about it.  We'll need
+ * to revist this if we ever expose an IOMMU to the guest.
+ */
+return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size);
+
+case VFIO_BUS_TYPE_PCIE:
+if (type == PCI_EXP_TYPE_RC_END) {
+/* Type becomes non-Integrated Endpoint */
+vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS,
+   PCI_EXP_TYPE_ENDPOINT  4,
+   PCI_EXP_FLAGS_TYPE);
+/* XXX Implement LNKCAP fields? */
+}
+
+return pci_add_capability(vdev-pdev, PCI_CAP_ID_EXP, pos, size);
+
+case VFIO_BUS_TYPE_PCIE_RC:
+switch (type) {
+case PCI_EXP_TYPE_ENDPOINT:
+/* Type becomes Integrated Endpoint */
+vfio_add_emulated_word(vdev, pos + PCI_CAP_FLAGS,
+   PCI_EXP_TYPE_RC_END  4,
+   PCI_EXP_FLAGS_TYPE);
+
+/* Link Capabilities, Status, and Control goes away */
+if 

[PATCH 0/2] qemu vfio-pci: VGA passthrough support

2013-02-18 Thread Alex Williamson
I've pushed kernel side vfio vga support to my next branch and intend
to try to get those in for kernel v3.9.  Those changes can be found
here: git://github.com/awilliam/linux-vfio.git (next)

This is the matching QEMU updates.  Patch 1/2 adds basic VGA range
support.  This can be used with the option -vga none to disable
emulated VGA and also requires vfio-pci device option x-vga=on.  As
noted by the x- prefix, this support is entirely experimental.
Expectations should range anywhere from working perfectly to crashing
the host.

Patch 2/2 contains numerous quirks for devices that I have onhand to
test, but hopefully covers a faily broad spectrum (at least for
ATI/AMD and Nvidia).  These quirks trap various device specific
addresses and insert device virtual addresses in place of device
physical addresses.  If we could identity map devices we could avoid
these, but that also puts numerous constraints on our configuration.

The vfio kernel VGA support makes use of the VGA arbiter to switch
between host VGA devices and I have been successful in running
multiple devices simultaneously.  At this point only secondary
devices have been tested.  The primary graphics likely needs extra
work to completely turn off host kernel access.  Intel integrated
graphics have not been tested and are most certainly broken as
they require access to devices other than the VGA device alone.
Nvidia devices will work with nouveau in Linux guests, but do not
work with Nvidia drivers in Windows (probably Linux too).  I think
we need to put these behind emulated PCIe root ports before they'll
work, but Q35 doesn't work well enough for that yet.  ATI/AMD cards
seem to have the best hope of working with their Catalyst drivers.

For best results, use Q35, -vga none, x-vga=on, and x-bustype=2,
for example:

 -M q35 -vga none -device vfio-pci,host=x:xx.x,x-vga=on,x-bustype=2

All the qemu patches necessary for this can be found in this branch:

git://github.com/awilliam/qemu-vfio.git (vfio-vga-v3)

This series will depend on a kernel header update which I'll send
patches for once the kernel patches are upstream.  Thanks,

Alex

---

Alex Williamson (2):
  qemu vfio-pci: Add support for VGA MMIO and I/O port access
  qemu vfio-pci: Graphics device quirks


 hw/vfio_pci.c |  651 +
 1 file changed, 649 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] qemu vfio-pci: Add support for VGA MMIO and I/O port access

2013-02-18 Thread Alex Williamson
Most VGA cards need some kind of quirk to fully operate since they
hide backdoors to get to other registers outside of PCI config space
within the registers, but this provides the base infrastructure.  If
we could identity map PCI resources for assigned devices we would need
a lot fewer quirks.

To enable this, use a kernel side vfio-pci driver that incorporates
VGA support (patches posted), and use the -vga none option and add the
x-vga=on option for the vfio-pci device.  The x- denotes this as an
experimental feature.  You may also need to use a cached copy of the
VGA BIOS for your device, passing it to vfio-pci using the romfile=
option.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |  173 +
 1 file changed, 173 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 330a687..307cbc1 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -59,6 +59,18 @@ typedef struct VFIOBAR {
 uint8_t nr; /* cache the BAR number for debug */
 } VFIOBAR;
 
+typedef struct VFIOVGARegion {
+MemoryRegion mem;
+off_t offset;
+int nr;
+} VFIOVGARegion;
+
+typedef struct VFIOVGA {
+off_t fd_offset;
+int fd;
+VFIOVGARegion region[3];
+} VFIOVGA;
+
 typedef struct VFIOINTx {
 bool pending; /* interrupt pending */
 bool kvm_accel; /* set when QEMU bypass through KVM enabled */
@@ -127,11 +139,16 @@ typedef struct VFIODevice {
 int nr_vectors; /* Number of MSI/MSIX vectors currently in use */
 int interrupt; /* Current interrupt type */
 VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */
+VFIOVGA vga; /* 0xa, 0x3b0, 0x3c0 */
 PCIHostDeviceAddress host;
 QLIST_ENTRY(VFIODevice) next;
 struct VFIOGroup *group;
+uint32_t features;
+#define VFIO_FEATURE_ENABLE_VGA_BIT 0
+#define VFIO_FEATURE_ENABLE_VGA (1  VFIO_FEATURE_ENABLE_VGA_BIT)
 uint8_t bustype;
 bool reset_works;
+bool has_vga;
 } VFIODevice;
 
 typedef struct VFIOGroup {
@@ -959,6 +976,89 @@ static const MemoryRegionOps vfio_bar_ops = {
 .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
+static void vfio_vga_write(void *opaque, hwaddr addr,
+   uint64_t data, unsigned size)
+{
+VFIOVGARegion *region = opaque;
+VFIOVGA *vga = container_of(region, VFIOVGA, region[region-nr]);
+union {
+uint8_t byte;
+uint16_t word;
+uint32_t dword;
+uint64_t qword;
+} buf;
+off_t offset = vga-fd_offset + region-offset + addr;
+
+switch (size) {
+case 1:
+buf.byte = data;
+break;
+case 2:
+buf.word = cpu_to_le16(data);
+break;
+case 4:
+buf.dword = cpu_to_le32(data);
+break;
+default:
+hw_error(vfio: unsupported write size, %d bytes\n, size);
+break;
+}
+
+if (pwrite(vga-fd, buf, size, offset) != size) {
+error_report(%s(,0x%HWADDR_PRIx, 0x%PRIx64, %d) failed: %m,
+ __func__, region-offset + addr, data, size);
+}
+
+DPRINTF(%s(0x%HWADDR_PRIx, 0x%PRIx64, %d)\n,
+__func__, region-offset + addr, data, size);
+}
+
+static uint64_t vfio_vga_read(void *opaque, hwaddr addr, unsigned size)
+{
+VFIOVGARegion *region = opaque;
+VFIOVGA *vga = container_of(region, VFIOVGA, region[region-nr]);
+union {
+uint8_t byte;
+uint16_t word;
+uint32_t dword;
+uint64_t qword;
+} buf;
+uint64_t data = 0;
+off_t offset = vga-fd_offset + region-offset + addr;
+
+if (pread(vga-fd, buf, size, offset) != size) {
+error_report(%s(,0x%HWADDR_PRIx, %d) failed: %m,
+ __func__, region-offset + addr, size);
+return (uint64_t)-1;
+}
+
+switch (size) {
+case 1:
+data = buf.byte;
+break;
+case 2:
+data = le16_to_cpu(buf.word);
+break;
+case 4:
+data = le32_to_cpu(buf.dword);
+break;
+default:
+hw_error(vfio: unsupported read size, %d bytes\n, size);
+break;
+}
+
+DPRINTF(%s(0x%HWADDR_PRIx, %d) = 0x%PRIx64\n,
+__func__, region-offset + addr, size, data);
+
+return data;
+}
+
+static const MemoryRegionOps vfio_vga_ops = {
+.read = vfio_vga_read,
+.write = vfio_vga_write,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
+
 /*
  * PCI config space
  */
@@ -1479,6 +1579,27 @@ static void vfio_map_bars(VFIODevice *vdev)
 for (i = 0; i  PCI_ROM_SLOT; i++) {
 vfio_map_bar(vdev, i);
 }
+
+if (vdev-has_vga) {
+memory_region_init_io(vdev-vga.region[0].mem, vfio_vga_ops,
+  vdev-vga.region[0], vfio-vga-mmio@0xa,
+  0xb - 0xa + 1);
+memory_region_add_subregion_overlap(pci_address_space(vdev-pdev),
+0xa, vdev-vga.region[0].mem,
+1);
+
+

[PATCH 2/2] qemu vfio-pci: Graphics device quirks

2013-02-18 Thread Alex Williamson
Apparently graphics vendors need to come up with new ways to retrieve
PCI BAR addresses on every revision of their chip.  These are the ones
that I've found on the following assortment of cards:

Advanced Micro Devices [AMD] nee ATI Cedar PRO [Radeon HD 5450/6350]
Advanced Micro Devices [AMD] nee ATI RV370 [Radeon X550]
NVIDIA Corporation G98 [GeForce 8400 GS]
NVIDIA Corporation G86 [Quadro NVS 290]
NVIDIA Corporation G72 [GeForce 7300 LE]

With these quirks, most ATI/AMD and Nvidia cards will post and work
with standard VGA drivers.  ATI/AMD devices may also work with the
Catalyst driver (try x-bustype=2).  I suspect Nvidia accelerated
drivers will require working root ports, they currently report error
code 43 under Windows.

It's relatively easy to figure out many of these quirks.  First enable
DEBUG_UNASSIGNED in exec.c, then enable DEBUG_VFIO in hw/vfio_pci.c.
Log the output and can kill Qemu when Unassigned access errors start
to spew (ignore the ones at very low offsets).  If the unassigned
access matches a range covered by the device (consult lspci or
/proc/iomem), then look back in the vfio-pci debug output for read
from the device that returned an address or partial address matching
the unassigned access.  Then follow these examples for creating quirks
to trap that access and return the emulated BAR address.

None of these would be necessary if we could identity map devices.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |  478 +
 1 file changed, 476 insertions(+), 2 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 307cbc1..769ab05 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -48,6 +48,16 @@
 do { } while (0)
 #endif
 
+struct VFIODevice;
+
+typedef struct VFIOQuirk {
+MemoryRegion mem;
+struct VFIODevice *vdev;
+QLIST_ENTRY(VFIOQuirk) next;
+uint32_t data;
+uint32_t data2;
+} VFIOQuirk;
+
 typedef struct VFIOBAR {
 off_t fd_offset; /* offset of BAR within device fd */
 int fd; /* device fd, allows us to pass VFIOBAR as opaque data */
@@ -57,12 +67,14 @@ typedef struct VFIOBAR {
 size_t size;
 uint32_t flags; /* VFIO region flags (rd/wr/mmap) */
 uint8_t nr; /* cache the BAR number for debug */
+QLIST_HEAD(, VFIOQuirk) quirks;
 } VFIOBAR;
 
 typedef struct VFIOVGARegion {
 MemoryRegion mem;
 off_t offset;
 int nr;
+QLIST_HEAD(, VFIOQuirk) quirks;
 } VFIOVGARegion;
 
 typedef struct VFIOVGA {
@@ -82,8 +94,6 @@ typedef struct VFIOINTx {
 QEMUTimer *mmap_timer; /* enable mmaps after periods w/o interrupts */
 } VFIOINTx;
 
-struct VFIODevice;
-
 typedef struct VFIOMSIVector {
 EventNotifier interrupt; /* eventfd triggered on interrupt */
 struct VFIODevice *vdev; /* back pointer to device */
@@ -1060,6 +1070,460 @@ static const MemoryRegionOps vfio_vga_ops = {
 };
 
 /*
+ * Device specific quirks
+ */
+
+/*
+ * Device 1002:68f9 (Advanced Micro Devices [AMD] nee ATI Cedar PRO [Radeon
+ * HD 5450/6350]) reports the upper byte of the physical address of the
+ * I/O port BAR4 through VGA register 0x3c3.  The BAR is 256 bytes, so the
+ * lower byte is known to be zero.  Test for this quirk on all ATI/AMD
+ * devices.
+ */
+static uint64_t vfio_ati_3c3_quirk_read(void *opaque,
+hwaddr addr, unsigned size)
+{
+VFIOQuirk *quirk = opaque;
+VFIODevice *vdev = quirk-vdev;
+PCIDevice *pdev = vdev-pdev;
+uint64_t data = vfio_vga_read(vdev-vga.region[2], addr + 0x3, size);
+
+if (data == quirk-data) {
+data = pci_get_byte(pdev-config + PCI_BASE_ADDRESS_4 + 1);
+DPRINTF(%s(0x3c3, 1) = 0x%PRIx64\n, __func__, data);
+}
+
+return data;
+}
+
+static const MemoryRegionOps vfio_ati_3c3_quirk = {
+.read = vfio_ati_3c3_quirk_read,
+.endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+/*
+ * Some devices seem to read 0xff from 0x3c3 during init so probing
+ * is unreliable.  Save the physical address and return the virtual
+ * address any time the read matches the physical address.
+ */
+static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev)
+{
+PCIDevice *pdev = vdev-pdev;
+off_t physoffset = vdev-config_offset + PCI_BASE_ADDRESS_4;
+uint32_t physbar;
+VFIOQuirk *quirk;
+
+if (pci_get_word(pdev-config + PCI_VENDOR_ID) != 0x1002 ||
+vdev-bars[4].size  256) {
+return;
+}
+
+/* Get I/O port BAR physical address */
+if (pread(vdev-fd, physbar, 4, physoffset) != 4) {
+error_report(vfio: probe failed for ATI/AMD 0x3c3 quirk on device 
+ %04x:%02x:%02x.%x, vdev-host.domain,
+ vdev-host.bus, vdev-host.slot, vdev-host.function);
+return;
+}
+
+quirk = g_malloc0(sizeof(*quirk));
+quirk-vdev = vdev;
+quirk-data = (physbar  8)  0xff;
+
+memory_region_init_io(quirk-mem, vfio_ati_3c3_quirk, quirk,
+  vfio-ati-3c3-quirk, 1);
+  

[PATCH] qemu vfio-pci: Add extra debugging

2013-02-18 Thread Alex Williamson
Often when debugging it's useful to be able to disable bypass paths
so no interactions with the device are missed.  Add some extra debug
options to do this.  Also add device info on read/write BAR accesses,
which is useful when debugging more than one assigned device.  A
couple DPRINTFs also had redundant vfio: prefixes.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |   40 ++--
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 769ab05..7aa3d88 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -48,6 +48,10 @@
 do { } while (0)
 #endif
 
+/* Extra debugging, trap acceleration paths for more logging */
+#define VFIO_ALLOW_MMAP 1
+#define VFIO_ALLOW_KVM_INTX 1
+
 struct VFIODevice;
 
 typedef struct VFIOQuirk {
@@ -304,7 +308,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev)
 int ret, argsz;
 int32_t *pfd;
 
-if (!kvm_irqfds_enabled() ||
+if (!VFIO_ALLOW_KVM_INTX || !kvm_irqfds_enabled() ||
 vdev-intx.route.mode != PCI_INTX_ENABLED ||
 !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
 return;
@@ -924,8 +928,16 @@ static void vfio_bar_write(void *opaque, hwaddr addr,
  __func__, addr, data, size);
 }
 
-DPRINTF(%s(BAR%d+0x%HWADDR_PRIx, 0x%PRIx64, %d)\n,
-__func__, bar-nr, addr, data, size);
+#ifdef DEBUG_VFIO
+{
+VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar-nr]);
+
+DPRINTF(%s(%04x:%02x:%02x.%x:BAR%d+0x%HWADDR_PRIx, 0x%PRIx64
+, %d)\n, __func__, vdev-host.domain, vdev-host.bus,
+vdev-host.slot, vdev-host.function, bar-nr, addr,
+data, size);
+}
+#endif
 
 /*
  * A read or write to a BAR always signals an INTx EOI.  This will
@@ -971,8 +983,16 @@ static uint64_t vfio_bar_read(void *opaque,
 break;
 }
 
-DPRINTF(%s(BAR%d+0x%HWADDR_PRIx, %d) = 0x%PRIx64\n,
-__func__, bar-nr, addr, size, data);
+#ifdef DEBUG_VFIO
+{
+VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar-nr]);
+
+DPRINTF(%s(%04x:%02x:%02x.%x:BAR%d+0x%HWADDR_PRIx
+, %d) = 0x%PRIx64\n, __func__, vdev-host.domain,
+vdev-host.bus, vdev-host.slot, vdev-host.function,
+bar-nr, addr, size, data);
+}
+#endif
 
 /* Same as write above */
 vfio_eoi(container_of(bar, VFIODevice, bars[bar-nr]));
@@ -1676,7 +1696,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 int ret;
 
 if (vfio_listener_skipped_section(section)) {
-DPRINTF(vfio: SKIPPING region_add %HWADDR_PRIx - %PRIx64\n,
+DPRINTF(SKIPPING region_add %HWADDR_PRIx - %PRIx64\n,
 section-offset_within_address_space,
 section-offset_within_address_space + section-size - 1);
 return;
@@ -1700,7 +1720,7 @@ static void vfio_listener_region_add(MemoryListener 
*listener,
 section-offset_within_region +
 (iova - section-offset_within_address_space);
 
-DPRINTF(vfio: region_add %HWADDR_PRIx - %HWADDR_PRIx [%p]\n,
+DPRINTF(region_add %HWADDR_PRIx - %HWADDR_PRIx [%p]\n,
 iova, end - 1, vaddr);
 
 ret = vfio_dma_map(container, iova, end - iova, vaddr, section-readonly);
@@ -1720,7 +1740,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 int ret;
 
 if (vfio_listener_skipped_section(section)) {
-DPRINTF(vfio: SKIPPING region_del %HWADDR_PRIx - %PRIx64\n,
+DPRINTF(SKIPPING region_del %HWADDR_PRIx - %PRIx64\n,
 section-offset_within_address_space,
 section-offset_within_address_space + section-size - 1);
 return;
@@ -1740,7 +1760,7 @@ static void vfio_listener_region_del(MemoryListener 
*listener,
 return;
 }
 
-DPRINTF(vfio: region_del %HWADDR_PRIx - %HWADDR_PRIx\n,
+DPRINTF(region_del %HWADDR_PRIx - %HWADDR_PRIx\n,
 iova, end - 1);
 
 ret = vfio_dma_unmap(container, iova, end - iova);
@@ -1943,7 +1963,7 @@ static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, 
MemoryRegion *submem,
 {
 int ret = 0;
 
-if (size  bar-flags  VFIO_REGION_INFO_FLAG_MMAP) {
+if (VFIO_ALLOW_MMAP  size  bar-flags  VFIO_REGION_INFO_FLAG_MMAP) {
 int prot = 0;
 
 if (bar-flags  VFIO_REGION_INFO_FLAG_READ) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] qemu vfio-pci: Move devices to D0 on reset

2013-02-18 Thread Alex Williamson
Guests may leave devices in a low power state at reboot, but we expect
devices to be woken up for the next boot.  Make this happen.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 hw/vfio_pci.c |   23 +++
 1 file changed, 23 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 7aa3d88..0a45500 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -161,6 +161,7 @@ typedef struct VFIODevice {
 #define VFIO_FEATURE_ENABLE_VGA_BIT 0
 #define VFIO_FEATURE_ENABLE_VGA (1  VFIO_FEATURE_ENABLE_VGA_BIT)
 uint8_t bustype;
+uint8_t pm_cap;
 bool reset_works;
 bool has_vga;
 } VFIODevice;
@@ -2297,6 +2298,8 @@ static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos)
 case PCI_CAP_ID_MSIX:
 ret = vfio_setup_msix(vdev, pos);
 break;
+case PCI_CAP_ID_PM:
+vdev-pm_cap = pos;
 default:
 ret = pci_add_capability(pdev, cap_id, pos, size);
 break;
@@ -2869,6 +2872,26 @@ static void vfio_pci_reset(DeviceState *dev)
 
 vfio_disable_interrupts(vdev);
 
+/* Make sure the device is in D0 */
+if (vdev-pm_cap) {
+uint16_t pmcsr;
+uint8_t state;
+
+pmcsr = vfio_pci_read_config(pdev, vdev-pm_cap + PCI_PM_CTRL, 2);
+state = pmcsr  PCI_PM_CTRL_STATE_MASK;
+if (state) {
+pmcsr = ~PCI_PM_CTRL_STATE_MASK;
+vfio_pci_write_config(pdev, vdev-pm_cap + PCI_PM_CTRL, pmcsr, 2);
+/* vfio handles the necessary delay here */
+pmcsr = vfio_pci_read_config(pdev, vdev-pm_cap + PCI_PM_CTRL, 2);
+state = pmcsr  PCI_PM_CTRL_STATE_MASK;
+if (state) {
+error_report(vfio: Unable to power on device, stuck in D%d\n,
+ state);
+}
+}
+}
+
 /*
  * Stop any ongoing DMA by disconecting I/O, MMIO, and bus master.
  * Also put INTx Disable in known state.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-18 Thread Scott Wood

On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se,  
but
 I have two objections to using it as the primary interface to the  
XICS

 emulation.
 
 First, I dislike the magical side-effect where creating a device  
of a
 particular type (e.g. MPIC or XICS) automatically attaches it to  
the

 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.

 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.


OK.


 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.


Hooking up to the CPU's interrupt lines is implicit in creating an MPIC  
(and I'm fine with changing that), not in creating any device.  I don't  
see how it's worse than being implicit in calling  
KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).



 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.


They do need to appear in the interrupt source space if we want to  
inject or irqfd them.  Most users won't want to do that, but we have  
had a customer directly assign IPIs (to communicate with an OS running  
on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't  
using them) and MPIC timers to a guest.



  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the  
interrupt

 controller they're edge-triggered interrupt sources.

 Ah right, I guess this is all set up via hcalls for XICS.

 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?


No.  Accesses by the guest get handled in the kernel.  Accesses in  
QEMU, including MSIs generated by virtio, get forwarded to the kernel.



 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.

 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.


I view anything other than passing the actual MPIC register values  
around as overdesign here, given that it is communication between  
hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the  
kernel side.



 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set mask-by-priority register) and the priority of the
 highest-prio in-service interrupt -- which would current processor
 priorty be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.


We don't maintain such a map in the emulation code.  We have a per-CPU  
bitmap of the actual interrupt sources pending/active, which is another  
attribute that would need to be added in order to support migration on  
MPIC.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Scott Wood

On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:

Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device  
state,

 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
You are not only provide different way to create in kernel irq chip  
you
also use an alternate way to trigger interrupt lines. Before going  
into

interface specifics lets think about whether it is really worth it?


Which it do you mean here?

The ability to set/get attributes is needed.  Sorry, but get or set  
one blob of data, up to 512 bytes, for the entire irqchip is just not  
good enough -- assuming you don't want us to start sticking pointers  
and commands in *that* data. :-)


If you mean the way to inject interrupts, it's simpler this way.  Why  
go out of our way to inject common glue code into a communication path  
between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or  
rather, why make that common glue be specific to this one function when  
we could reuse the same communication glue used for other things, such  
as device state?


And that's just for regular interrupts.  MSIs are vastly simpler on  
MPIC than what x86 does.


x86 obviously support old way and will have to for some, very long,  
time.


Sure.

ARM vGIC code, that is ready to go upstream, uses old way too. So it  
will

be 2 archs against one.


I wasn't aware that that's how it worked. :-P

I was trying to be considerate by not making the entire thing  
gratuitously PPC or MPIC specific, as some others seem inclined to do  
(e.g. see irqfd and APIC).  We already had a discussion on ARM's set  
address ioctl and rather than extend *that* interface, they preferred  
to just stick something ARM-specific in ASAP with the understanding  
that it would be replaced (or more accurately, kept around as a thin  
wrapper around the new stuff) later.



Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?


At the moment I do not have plans for other devices, though what does  
it hurt for the capability to be there?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip

2013-02-18 Thread Scott Wood

On 02/18/2013 06:04:51 AM, Gleb Natapov wrote:
Can you tell us why mpic should be in kernel? Is it used often by  
modern
guests or may be they prefer MSI for interrupt delivery (hmm may be  
MSIs

are delivered through mpic too)?


Yes, MSIs are delivered through the mpic.

Plus, MSIs are only (normally) for PCI(e).  We have embedded system on  
a chips with important non-PCI devices, which cannot use MSIs.



On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.


We don't have that same split on this hardware.  MPIC is one device  
that covers all of it.  Some of the functionality is per-CPU, but it is  
not easily extracted from the rest.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood scottw...@freescale.com wrote:
 On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:

 Copying Christoffer since ARM has in kernel irq chip too.

 On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
  Currently, devices that are emulated inside KVM are configured in a
  hardcoded manner based on an assumption that any given architecture
  only has one way to do it.  If there's any need to access device state,
  it is done through inflexible one-purpose-only IOCTLs (e.g.
  KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
  cumbersome and depletes a limited numberspace.
 
  This API provides a mechanism to instantiate a device of a certain
  type, returning an ID that can be used to set/get attributes of the
  device.  Attributes may include configuration parameters (e.g.
  register base address), device state, operational commands, etc.  It
  is similar to the ONE_REG API, except that it acts on devices rather
  than vcpus.
 You are not only provide different way to create in kernel irq chip you
 also use an alternate way to trigger interrupt lines. Before going into
 interface specifics lets think about whether it is really worth it?


 Which it do you mean here?

 The ability to set/get attributes is needed.  Sorry, but get or set one
 blob of data, up to 512 bytes, for the entire irqchip is just not good
 enough -- assuming you don't want us to start sticking pointers and commands
 in *that* data. :-)

 If you mean the way to inject interrupts, it's simpler this way.  Why go out
 of our way to inject common glue code into a communication path between
 hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or rather, why
 make that common glue be specific to this one function when we could reuse
 the same communication glue used for other things, such as device state?

 And that's just for regular interrupts.  MSIs are vastly simpler on MPIC
 than what x86 does.


 x86 obviously support old way and will have to for some, very long, time.


 Sure.


 ARM vGIC code, that is ready to go upstream, uses old way too. So it will
 be 2 archs against one.


 I wasn't aware that that's how it worked. :-P

 I was trying to be considerate by not making the entire thing gratuitously
 PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and
 APIC).  We already had a discussion on ARM's set address ioctl and rather
 than extend *that* interface, they preferred to just stick something
 ARM-specific in ASAP with the understanding that it would be replaced (or
 more accurately, kept around as a thin wrapper around the new stuff) later.


 Christoffer do you think the proposed way it
 better for your needs. Are you willing to make vGIC use it?


Well it won't improve functionality much from the current hardware
point of view, but the proposed interface is superior to what we have
now. Adding and coordinating new interfaces is indeed a pain, so a
generic interface which is flexible enough to cater for a certain
group of needs, is welcome imho. And this does seem to fit the bill.

I can imagine that if there's support for a future ARM gic version or
if we add in-kernel support for other stuff on ARM, then this
interface will be useful, and in fact using the current interface to
support two separate, but similar, interrupt controllers could get
messy from a user space point of view.

I am definitely willing to change to use this interface, the agreement
on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this.

I had some nits on the RFC, which I'll send separately.

 Scott, what other devices are you planning to support with this
 interface?


 At the moment I do not have plans for other devices, though what does it
 hurt for the capability to be there?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.

 Both device types and individual attributes can be tested without having
 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.

 Signed-off-by: Scott Wood scottw...@freescale.com
 ---
  Documentation/virtual/kvm/api.txt|   76 ++
  Documentation/virtual/kvm/devices/README |1 +
  include/linux/kvm_host.h |   21 +
  include/uapi/linux/kvm.h |   25 ++
  virt/kvm/kvm_main.c  |  127 
 ++
  5 files changed, 250 insertions(+)
  create mode 100644 Documentation/virtual/kvm/devices/README

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index c2534c3..5bcdb42 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents 
 from the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.

 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created
 +
 +  Other error conditions may be defined by individual device types.
 +
 +Creates an emulated device in the kernel.  The returned handle
 +can be used with KVM_SET/GET_DEVICE_ATTR.
 +
 +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
 +device type is supported (not necessarily whether it can be created
 +in the current vm).
 +
 +Individual devices should not define flags.  Attributes should be used
 +for specifying any behavior that is not implied by the device type
 +number.
 +
 +struct kvm_create_device {
 +   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 +   __u32   id; /* out: device handle */
 +   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_device_attr
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device id is invalid
 +  ENXIO:  The group or attribute is unknown/unsupported for this device
 +  EPERM:  The attribute cannot (currently) be accessed this way
 +  (e.g. read-only attribute, or attribute that only makes
 +  sense when the device is in a different state)
 +
 +  Other error conditions may be defined by individual device types.
 +
 +Gets/sets a specified piece of device configuration and/or state.  The
 +semantics are device-specific except for certain global attributes.  See
 +individual device documentation in the devices directory.  As with
 +ONE_REG, the size of the data transferred is defined by the particular
 +attribute.
 +
 +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
 +   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to 
 KVM_CREATE_DEVICE
 +
 +struct kvm_device_attr {
 +   __u32   dev;/* id from KVM_CREATE_DEVICE */
 +   __u32   group;  /* KVM_DEV_ATTR_COMMON or device-defined */
 +   __u64   attr;   /* group-defined */
 +   __u64   addr;   /* userspace address of attr data */
 +};
 +
 +4.81 KVM_HAS_DEVICE_ATTR
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_device_attr
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device id is invalid
 +  ENXIO:  The group or attribute is unknown/unsupported for this device
 +
 +Tests whether a device supports a particular attribute.  A successful
 +return indicates the attribute is implemented.  It does not necessarily
 +indicate that the attribute can be read or written in the device's
 +current state.  addr is ignored.

  5. The kvm_run structure
  
 diff --git a/Documentation/virtual/kvm/devices/README 
 b/Documentation/virtual/kvm/devices/README
 new file mode 100644
 index 

Re: [PATCH 4/4] Add a timer to allow the separation of consigned from steal time.

2013-02-18 Thread Marcelo Tosatti
On Tue, Feb 05, 2013 at 03:49:41PM -0600, Michael Wolf wrote:
 Add a helper routine to scheduler/core.c to allow the kvm module
 to retrieve the cpu hardlimit settings.  The values will be used
 to set up a timer that is used to separate the consigned from the
 steal time.

1) Can you please describe, in english, the mechanics of subtracting cpu
hardlimit values from steal time reported via run_delay supposed to
work?

The period and the quota used to separate the consigned time 
(expected steal) from the steal time are taken
from the cfs bandwidth control settings. Any other steal time
accruing during that period will show as the traditional steal time.

There is no expected steal time over a fixed period of real time.

2) From the description of patch 1: In the case of where you have
a system that is running in a capped or overcommitted environment 
the user may see steal time being reported in accounting tools 
such as top or vmstat. 

This is outdated, right? Because overcommitted environment is exactly
what steal time should report.


Thanks

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-18 Thread Marcelo Tosatti
On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
 On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
  On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
   On 02/12/2013 04:26 PM, Peter Hurley wrote:
With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
device/console):

[0.666410] udevd[97]: starting version 175
[0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 
7fff069e277f sp 7fff068c9ef8 error d

and boots to an initramfs prompt.

git bisect (log attached) blames:

commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
Merge: 3596f5b 949db15
Author: H. Peter Anvin h...@linux.intel.com
Date:   Fri Jan 25 16:31:21 2013 -0800

Merge tag 'v3.8-rc5' into x86/mm

The __pa() fixup series that follows touches KVM code that is not
present in the existing branch based on v3.7-rc5, so merge in the
current upstream from Linus.

Signed-off-by: H. Peter Anvin h...@linux.intel.com


This only happens with the VGA device/console but that is the default
configuration for Ubuntu/KVM because it blacklists pretty much every fb
driver.

   
   I am guessing this is another bad use of __pa()... need to look into that.

Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
there?

  He is using 64bit guest and on those __pa() happens to be working. Is it
  possible that slow_virt_to_phys() does not work as expected? Peter (the
  bug reporter :)) can you run your guest kernel with loglevel=7 and
  attach send me console output?
 
 Attached.
 
 BTW, this message happens on 'good' boots too:
 
 [0.00] [ cut here ]
 [0.00] WARNING: at 
 /home/peter/src/kernels/next/arch/x86/kernel/pvclock.c:182 
 pvclock_init_vsyscall+0x22/0x60()
 [0.00] Hardware name: Bochs
 [0.00] Modules linked in:
 [0.00] Pid: 0, comm: swapper Not tainted 3.8.0-next-20130204-xeon 
 #20130204
 [0.00] Call Trace:
 [0.00]  [8105812f] warn_slowpath_common+0x7f/0xc0
 [0.00]  [8105818a] warn_slowpath_null+0x1a/0x20
 [0.00]  [81d20521] pvclock_init_vsyscall+0x22/0x60
 [0.00]  [81d20480] kvm_setup_vsyscall_timeinfo+0x74/0xd8
 [0.00]  [81d201d1] kvm_guest_init+0xd0/0xe9
 [0.00]  [81d13f7c] setup_arch+0xbee/0xcaf
 [0.00]  [816cbceb] ? printk+0x61/0x63
 [0.00]  [81d0cbc3] start_kernel+0xd3/0x3f0
 [0.00]  [81d0c5e4] x86_64_start_reservations+0x2a/0x2c
 [0.00]  [81d0c6d7] x86_64_start_kernel+0xf1/0x100
 [0.00] ---[ end trace b47bb564b2d6ec76 ]---
 
 Regards,
 Peter Hurley

Sending a patch for this, thanks for the report.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PULL 00/14] ppc patch queue 2013-02-15

2013-02-18 Thread Marcelo Tosatti
Pulled, thanks.

On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull.
 
 Highlights of this queue drop are:
 
   - BookE: Fast mapping support for 4k backed memory
   - BookE: Handle alignment interrupts
 
 Alex
 
 
 The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e:
   Jan Kiszka (1):
 KVM: nVMX: Remove redundant get_vmcs12 from 
 nested_vmx_exit_handled_msr
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (11):
   KVM: PPC: E500: Move write_stlbe higher
   KVM: PPC: E500: Explicitly mark shadow maps invalid
   KVM: PPC: E500: Propagate errors when shadow mapping
   KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
   KVM: PPC: E500: Split host and guest MMU parts
   KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
   KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
   KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
   Merge commit 'origin/next' into kvm-ppc-next
   KVM: PPC: BookE: Handle alignment interrupts
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (3):
   KVM: PPC: booke: use vcpu reference from thread_struct
   KVM: PPC: booke: Allow multiple exception types
   booke: Added DBCR4 SPR number
 
  arch/powerpc/include/asm/kvm_ppc.h   |2 -
  arch/powerpc/include/asm/reg.h   |2 -
  arch/powerpc/include/asm/reg_booke.h |1 +
  arch/powerpc/kernel/asm-offsets.c|2 +-
  arch/powerpc/kvm/Makefile|9 +-
  arch/powerpc/kvm/booke.c |   30 +-
  arch/powerpc/kvm/booke.h |1 +
  arch/powerpc/kvm/booke_interrupts.S  |   49 +-
  arch/powerpc/kvm/e500.c  |   16 +-
  arch/powerpc/kvm/e500.h  |1 +
  arch/powerpc/kvm/e500_mmu.c  |  809 +++
  arch/powerpc/kvm/e500_mmu_host.c |  699 +
  arch/powerpc/kvm/e500_mmu_host.h |   18 +
  arch/powerpc/kvm/e500_tlb.c  | 1430 
 --
  14 files changed, 1610 insertions(+), 1459 deletions(-)
  create mode 100644 arch/powerpc/kvm/e500_mmu.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.h
  delete mode 100644 arch/powerpc/kvm/e500_tlb.c
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Scott Wood

On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com  
wrote:

 index 0350e0d..dbaf012 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -335,6 +335,25 @@ struct kvm_memslots {
 short id_to_index[KVM_MEM_SLOTS_NUM];
  };

 +/*
 + * The worst case number of simultaneous devices will likely be  
very low
 + * (usually zero or one) for the forseeable future.  If the worst  
case
 + * exceeds this, then it can be increased, or we can convert to  
idr.

 + */

This comment is on the heavy side (if at all needed). If you want to
remind people of idr, just put that in a single line. A define is a
define is a define.


OK.

 +#define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xac, struct  
kvm_create_device)
 +#define KVM_SET_DEVICE_ATTR  _IOW(KVMIO,  0xad, struct  
kvm_device_attr)
 +#define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xae, struct  
kvm_device_attr)


_IOWR ?


struct kvm_device_attr itself is write-only, though the data pointed to  
by the addr field goes the other way for GET.  ONE_REG is in the same  
situation and also uses _IOW for both.



 +static int kvm_ioctl_create_device(struct kvm *kvm,
 +  struct kvm_create_device *cd)
 +{
 +   struct kvm_device *dev = NULL;
 +   bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
 +   int id;
 +   int r;
 +
 +   mutex_lock(kvm-lock);
 +
 +   id = kvm-num_devices;
 +   if (id = KVM_MAX_DEVICES  !test) {
 +   r = -ENOSPC;
 +   goto out;
 +   }
 +
 +   switch (cd-type) {
 +   default:
 +   r = -ENODEV;
 +   goto out;
 +   }

do we really believe that there will be any arch-generic recognition
of types? shouldn't this be a call to an arch-specific function
instead. Which makes me wonder whether the device type IDs should be
arch specific as well...


I prefer to look at it from the other direction -- is there any reason  
why this *should* be architecture specific?  What will that make easier?


By doing device recognition here we don't need a separate copy of this  
per arch (including some #ifdef or modifying every arch at once --  
including ARM which I can't modify yet because it's not merged), and  
*if* we should end up with an in-kernel-emulated device that gets used  
across multiple architectures, it would be annoying to have to modify  
all relevant architectures (and worse to deal with per-arch  
numberspaces).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bisected][-next-20130204+] [x86/kvm] udevd:[97]: segfault at ffffffffff5fd020 ip 00007fff069e277f sp 00007fff068c9ef8 error d

2013-02-18 Thread Peter Hurley
On Mon, 2013-02-18 at 19:59 -0300, Marcelo Tosatti wrote:
 On Wed, Feb 13, 2013 at 06:57:09AM -0500, Peter Hurley wrote:
  On Wed, 2013-02-13 at 12:51 +0200, Gleb Natapov wrote:
   On Tue, Feb 12, 2013 at 04:39:03PM -0800, H. Peter Anvin wrote:
On 02/12/2013 04:26 PM, Peter Hurley wrote:
 With -next-20130204+ in ubuntu 12.10 VM (so the 80x25 VGA
 device/console):
 
 [0.666410] udevd[97]: starting version 175
 [0.674043] udevd[97]: udevd:[97]: segfault at ff5fd020 ip 
 7fff069e277f sp 7fff068c9ef8 error d
 
 and boots to an initramfs prompt.
 
 git bisect (log attached) blames:
 
 commit 7b5c4a65cc27f017c170b025f8d6d75dabb11c6f
 Merge: 3596f5b 949db15
 Author: H. Peter Anvin h...@linux.intel.com
 Date:   Fri Jan 25 16:31:21 2013 -0800
 
 Merge tag 'v3.8-rc5' into x86/mm
 
 The __pa() fixup series that follows touches KVM code that is not
 present in the existing branch based on v3.7-rc5, so merge in the
 current upstream from Linus.
 
 Signed-off-by: H. Peter Anvin h...@linux.intel.com
 
 
 This only happens with the VGA device/console but that is the default
 configuration for Ubuntu/KVM because it blacklists pretty much every 
 fb
 driver.
 

I am guessing this is another bad use of __pa()... need to look into 
that.
 
 Can't find this commit on kvm.git or linux-2.6.git. Is it reproducible
 there?

This is in the linux-next repo (any git tag after 'next-20130204' will
reproduce this). It's a pretty large merge commit.

This doesn't happen on 3.8-rc7.

I'll try to repro this on kvm.git sometime this week. Otherwise, we can
wait for it to show up in 3.9.

Regards,
Peter Hurley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] Alter steal-time reporting in the guest

2013-02-18 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 05:43:47PM +0100, Frederic Weisbecker wrote:
 2013/2/5 Michael Wolf m...@linux.vnet.ibm.com:
  In the case of where you have a system that is running in a
  capped or overcommitted environment the user may see steal time
  being reported in accounting tools such as top or vmstat.  This can
  cause confusion for the end user.
 
 Sorry, I'm no expert in this area. But I don't really understand what
 is confusing for the end user here.

I suppose that what is wanted is to subtract stolen time due to 'known
reasons' from steal time reporting. 'Known reasons' being, for example,
hard caps. So a vcpu executing instructions with no halt, but limited to
80% of available bandwidth, would not have 20% of stolen time reported.

But yes, a description of the scenario that is being dealt with, with
details, is important.

  To ease the confusion this patch set
  adds the idea of consigned (expected steal) time.  The host will separate
  the consigned time from the steal time.  Tthe steal time will only be 
  altered
  if hard limits (cfs bandwidth control) is used.  The period and the quota 
  used
  to separate the consigned time (expected steal) from the steal time are 
  taken
  from the cfs bandwidth control settings. Any other steal time accruing 
  during
  that period will show as the traditional steal time.
 
 I'm also a bit confused here. steal time will then only account the
 cpu time lost due to quotas from cfs bandwidth control? Also what do
 you exactly mean by expected steal time? Is it steal time due to
 overcommitting minus scheduler quotas?
 
 Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


x86: pvclock kvm: align allocation size to page size

2013-02-18 Thread Marcelo Tosatti

To match whats mapped via vsyscalls to userspace.

Reported-by: Peter Hurley pe...@hurleysoftware.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 220a360..5bedbdd 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -218,6 +218,9 @@ static void kvm_shutdown(void)
 void __init kvmclock_init(void)
 {
unsigned long mem;
+   int size;
+
+   size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
if (!kvm_para_available())
return;
@@ -231,16 +234,14 @@ void __init kvmclock_init(void)
printk(KERN_INFO kvm-clock: Using msrs %x and %x,
msr_kvm_system_time, msr_kvm_wall_clock);
 
-   mem = memblock_alloc(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS,
-PAGE_SIZE);
+   mem = memblock_alloc(size, PAGE_SIZE);
if (!mem)
return;
hv_clock = __va(mem);
 
if (kvm_register_clock(boot clock)) {
hv_clock = NULL;
-   memblock_free(mem,
-   sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
+   memblock_free(mem, size);
return;
}
pv_time_ops.sched_clock = kvm_clock_read;
@@ -275,7 +276,7 @@ int __init kvm_setup_vsyscall_timeinfo(void)
struct pvclock_vcpu_time_info *vcpu_time;
unsigned int size;
 
-   size = sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS;
+   size = PAGE_ALIGN(sizeof(struct pvclock_vsyscall_time_info)*NR_CPUS);
 
preempt_disable();
cpu = smp_processor_id();


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] KVM: nVMX: Improve I/O exit handling

2013-02-18 Thread Marcelo Tosatti
On Mon, Feb 18, 2013 at 12:32:37PM +0200, Gleb Natapov wrote:
 On Mon, Feb 18, 2013 at 11:21:16AM +0100, Jan Kiszka wrote:
  This prevents trapping L2 I/O exits if L1 has neither unconditional nor
  bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
  handling.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: nVMX: Trap unconditionally if msr bitmap access fails

2013-02-18 Thread Marcelo Tosatti
On Sun, Feb 17, 2013 at 10:56:36AM +0200, Gleb Natapov wrote:
 On Thu, Feb 14, 2013 at 07:46:27PM +0100, Jan Kiszka wrote:
  This avoids basing decisions on uninitialized variables, potentially
  leaking kernel data to the L1 guest.
  
  Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 Reviewed-by: Gleb Natapov g...@redhat.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.

2013-02-18 Thread Sanjay Lal

On Feb 18, 2013, at 1:44 AM, Gleb Natapov wrote:

 On Fri, Feb 15, 2013 at 11:10:46AM -0500, Sanjay Lal wrote:
 
 On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote:
 
 On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote:
 +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva)
 +{
 +  gpa_t gpa;
 +  uint32_t kseg = KSEGX(gva);
 +
 +  if ((kseg == CKSEG0) || (kseg == CKSEG1))
 You seems to be using KVM_GUEST_KSEGX variants on gva in all other
 places. Why not here?
 
 This function is invoked to handle 2 scenarios:
 (1) Parse the boot code config tables setup by QEMU's Malta emulation. The 
 pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not 
 Guest KSEG0 addresses.
 
 Where is it called for that purpose? The only place where gva_to_gpa
 callback is called is in kvm/kvm_mips_emul.c:kvm_mips_emulate_(store|load)
Load/stores from/to KSEG1 generate the Address Error Load/Store exceptions. The 
handler calls kvm_mips_emul.c:kvm_mips_emulate_(store|load) which then call the 
gva_to_gpa callback.

 
 (2) Handle I/O accesses by the guest.  On MIPS platforms, I/O device 
 registers are mapped into the KSEG1 address space (unmapped, uncached).  
 Again like (1) these are actual KSEG1 addresses, which cause an exception 
 and are passed onto QEMU for I/O emulation.
 
 So guest KSEG1 registers is mapped to 0xA000-0xBFFF ranges just
 like on a host? Can you give corresponding segment names to those ranges
 
 Guest User address space:   0x - 0x4000 (useg?)
 Guest Kernel Unmapped:  0x4000 - 0x6000 (kseg0?)
 Guest Kernel Mapped:0x6000 - 0x8000 (?)
 


Yes, now that you mention it :-). I'll add a corresponding Guest Kernel KSEG1 
segment name.

Regards
Sanjay


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 54061] New: guest panic after live migration

2013-02-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=54061

   Summary: guest panic after live migration
   Product: Virtualization
   Version: unspecified
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: yongjie@intel.com
Regression: No


Created an attachment (id=93511)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=93511)
guest panic after migration

Environment:

Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux (e.g. RHEL6.3)
kvm.git next branch Commit:cbd29cb6e38af6119df2cdac0c58acf0e85c177e
qemu-kvm.git Commit:4d9367b76f71c6d938cf8201392abe4bfb1136cb
Hardware:SandyBridge-EP, Westmere-EP

Bug detailed description:
--
After live migration, guest will panic.
This should be a KVM kernel bug.
kvm  + qemu-kvm   =  result
cbd29cb6 + 4d9367b7   = bad
b0da5bec + 4d9367b7   = good

Reproduce steps:

1. start up a host with kvm (commit: cbd29cb6)
2. Start a TCP daemon for migration:
qemu-system-x86_64 -m 1024 -smp 2 -net nic,macaddr=00:12:32:45:12:54 -net tap
/root/rhel6u3.img -incoming tcp:localhost:
3. create a guest 
qemu-system-x86_64 -m 1024 -smp 2 -net nic,macaddr=00:12:32:45:12:54 -net tap
/root/rhel6u3.img
4. ctrl+Alt+2 switch to QEMU monitor
5. in monitor:  migrate tcp:localhost:

Current result:

after live migration, guest panic

Expected result:

after live migration, guest work fine.

Basic root-causing log:
--
WARNING: at lib/list_debug.c:30 __list_add+0x8f/0xa0() (Tainted: GB   W 
---   )

Hardware name: Bochs

list_add corruption. prev-next should be next (88003fae0ac0), but was
8800365c3000. (prev=8800365f9040).

Modules linked in: autofs4 sunrpc ipv6 uinput ppdev parport_pc parport
microcode sg 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod
cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mirror dm_region_hash
dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 12, comm: events/1 Tainted: GB   W  ---   
2.6.32-279.el6.x86_64 #1

Call Trace:

 [8106b747] ? warn_slowpath_common+0x87/0xc0

 [8106b836] ? warn_slowpath_fmt+0x46/0x50

 [8128301f] ? __list_add+0x8f/0xa0

 [81163f64] ? free_block+0x154/0x170

 [811641b1] ? drain_array+0xc1/0x100

 [8116517e] ? cache_reap+0x8e/0x260

 [81137090] ? vmstat_update+0x0/0x40

 [811650f0] ? cache_reap+0x0/0x260

 [8108c760] ? worker_thread+0x170/0x2a0

 [810920d0] ? autoremove_wake_function+0x0/0x40

 [8108c5f0] ? worker_thread+0x0/0x2a0

 [81091d66] ? kthread+0x96/0xa0

 [8100c14a] ? child_rip+0xa/0x20

 [81091cd0] ? kthread+0x0/0xa0

 [8100c140] ? child_rip+0x0/0x20

---[ end trace f17758832a0dcb5e ]---

general protection fault:  [#1] SMP 

last sysfs file: /sys/devices/pci:00/:00:03.0/irq

CPU 1 

Modules linked in: autofs4 sunrpc ipv6 uinput ppdev parport_pc parport
microcode sg 8139too 8139cp mii i2c_piix4 i2c_core ext4 mbcache jbd2 sr_mod
cdrom sd_mod crc_t10dif pata_acpi ata_generic ata_piix dm_mirror dm_region_hash
dm_log dm_mod [last unloaded: speedstep_lib]



Pid: 1173, comm: rs:main Q:Reg Tainted: GB   W  ---   
2.6.32-279.el6.x86_64 #1 Bochs Bochs

RIP: 0010:[81282f00]  [81282f00] list_del+0x10/0xa0

RSP: 0018:880037547a78  EFLAGS: 00010096

RAX: dead00200200 RBX: eaceb940 RCX: 

RDX: 0010 RSI: 88003edd00d0 RDI: eaceb940

RBP: 880037547a88 R08:  R09: 

R10:  R11:  R12: 88003edd00c0

R13: 880116c0 R14: 362e R15: eaceb918

FS:  7fc44b7cc700() GS:88000230() knlGS:

CS:  0010 DS:  ES:  CR0: 80050033

CR2: 7fc44c5aba10 CR3: 3dc44000 CR4: 06e0

DR0:  DR1:  DR2: 

DR3:  DR6: 0ff0 DR7: 0400

Process rs:main Q:Reg (pid: 1173, threadinfo 880037546000, task
880037062ae0)

Stack:

 0282 0001 880037547ba8 811258a8

d 880037547ab8  0001 88003728b400

d 00c7f118 0040  88033c28

Call Trace:

 [811258a8] get_page_from_freelist+0x288/0x820

 [a00869f6] ? jbd2_journal_stop+0x1e6/0x2b0 [jbd2]

 [81126f31] __alloc_pages_nodemask+0x111/0x940

 [81161d62] kmem_getpages+0x62/0x170

 [811623cf] cache_grow+0x2cf/0x320

 [81162622] 

[PATCH] kvm/powerpc/e500mc: fix tlb invalidation on cpu migration

2013-02-18 Thread Scott Wood
The existing check handles the case where we've migrated to a different
core than we last ran on, but it doesn't handle the case where we're
still on the same cpu we last ran on, but some other vcpu has run on
this cpu in the meantime.

Signed-off-by: Scott Wood scottw...@freescale.com
---
This seems to have been the cause of the userspace segfaults I was
seeing (the other TLB patches I posted are still needed as well).

 arch/powerpc/kvm/e500mc.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index 1f89d26..8637689 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -111,6 +111,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 
old_msr)
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+   static struct kvm_vcpu *last_vcpu_on_cpu[NR_CPUS];
 
kvmppc_booke_vcpu_load(vcpu, cpu);
 
@@ -136,8 +137,11 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
mtspr(SPRN_GDEAR, vcpu-arch.shared-dar);
mtspr(SPRN_GESR, vcpu-arch.shared-esr);
 
-   if (vcpu-arch.oldpir != mfspr(SPRN_PIR))
+   if (vcpu-arch.oldpir != mfspr(SPRN_PIR) ||
+   last_vcpu_on_cpu[smp_processor_id()] != vcpu) {
kvmppc_e500_tlbil_all(vcpu_e500);
+   last_vcpu_on_cpu[smp_processor_id()] = vcpu;
+   }
 
kvmppc_load_guest_fp(vcpu);
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


buildbot failure in kvm on next-i386

2013-02-18 Thread kvm
The Buildbot has detected a new failure on builder next-i386 while building kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-i386/builds/804

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed git

sincerely,
 -The Buildbot



buildbot failure in kvm on next-ppc44x

2013-02-18 Thread kvm
The Buildbot has detected a new failure on builder next-ppc44x while building 
kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-ppc44x/builds/804

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed git

sincerely,
 -The Buildbot



buildbot failure in kvm on next-ppc64

2013-02-18 Thread kvm
The Buildbot has detected a new failure on builder next-ppc64 while building 
kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-ppc64/builds/805

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed git

sincerely,
 -The Buildbot

N�r��yb�X��ǧv�^�)޺{.n�+h����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

buildbot failure in kvm on next-x86_64

2013-02-18 Thread kvm
The Buildbot has detected a new failure on builder next-x86_64 while building 
kvm.
Full details are available at:
 http://buildbot.b1-systems.de/kvm/builders/next-x86_64/builds/804

Buildbot URL: http://buildbot.b1-systems.de/kvm/

Buildslave for this Build: b1_kvm_1

Build Reason: The Nightly scheduler named 'nightly_next' triggered this build
Build Source Stamp: [branch next] HEAD
Blamelist: 

BUILD FAILED: failed git

sincerely,
 -The Buildbot



Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood scottw...@freescale.com wrote:
 On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:

 On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com
 wrote:
  index 0350e0d..dbaf012 100644
  --- a/include/linux/kvm_host.h
  +++ b/include/linux/kvm_host.h
  @@ -335,6 +335,25 @@ struct kvm_memslots {
  short id_to_index[KVM_MEM_SLOTS_NUM];
   };
 
  +/*
  + * The worst case number of simultaneous devices will likely be very
  low
  + * (usually zero or one) for the forseeable future.  If the worst case
  + * exceeds this, then it can be increased, or we can convert to idr.
  + */

 This comment is on the heavy side (if at all needed). If you want to
 remind people of idr, just put that in a single line. A define is a
 define is a define.


 OK.


  +#define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xac, struct
  kvm_create_device)
  +#define KVM_SET_DEVICE_ATTR  _IOW(KVMIO,  0xad, struct
  kvm_device_attr)
  +#define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xae, struct
  kvm_device_attr)

 _IOWR ?


 struct kvm_device_attr itself is write-only, though the data pointed to by
 the addr field goes the other way for GET.  ONE_REG is in the same situation
 and also uses _IOW for both.



ok.

Btw., what about the size of the attr? implicitly defined through the attr id?

  +static int kvm_ioctl_create_device(struct kvm *kvm,
  +  struct kvm_create_device *cd)
  +{
  +   struct kvm_device *dev = NULL;
  +   bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
  +   int id;
  +   int r;
  +
  +   mutex_lock(kvm-lock);
  +
  +   id = kvm-num_devices;
  +   if (id = KVM_MAX_DEVICES  !test) {
  +   r = -ENOSPC;
  +   goto out;
  +   }
  +
  +   switch (cd-type) {
  +   default:
  +   r = -ENODEV;
  +   goto out;
  +   }

 do we really believe that there will be any arch-generic recognition
 of types? shouldn't this be a call to an arch-specific function
 instead. Which makes me wonder whether the device type IDs should be
 arch specific as well...


 I prefer to look at it from the other direction -- is there any reason why
 this *should* be architecture specific?  What will that make easier?


The fact that you don't have to create static inlines for the
architectures that don't define the functions that get called or have
to similar #ifdef tricks, and I also think it's easier to read the
arch-specific bits of the code that way, instead of some arbitrary
function that you have to trace through to figure out where it's
called from.

 By doing device recognition here we don't need a separate copy of this per
 arch (including some #ifdef or modifying every arch at once -- including ARM
 which I can't modify yet because it's not merged), and *if* we should end up
 with an in-kernel-emulated device that gets used across multiple
 architectures, it would be annoying to have to modify all relevant
 architectures (and worse to deal with per-arch numberspaces).

I would say that's exactly what you're going to need with your approach:

switch (cd-type) {
case KVM_ARM_VGIC_V2_0:
kvm_arm_vgic_v2_0_create(...);
}


are you going to ifdef here in this function, or? I think it's cleaner
to have the single arch-specific hooks and handle the cases there.

The use case of having a single device which is so central to the
system that we emulate it inside the kernel and is shared across
multiple archs is pretty far fetched to me.

However, this is internal and can always be changed, so if everyone
agrees on the overall API, whichever way you implement it is fine with
me.


 -Scott

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 54071] New: kvm guests stuck (not in io) and cannot be killed using -9

2013-02-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=54071

   Summary: kvm guests stuck (not in io) and cannot be killed
using -9
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.5.0-23.35
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: high
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: i...@corinlangosch.com
Regression: No


I have two stuck kvm guests hanging in state R (for several hours now) on two
different hosts. Actually they are not consuming any CPU and cannot be killed
with kill -9. They are not responding to anything. Both systems are Ubuntu
12.10 Quantal running on a AMD II X4 965 Processor with 16GB RAM.


Stack trace of the first stuck guest:

cat /proc/2647/stack 
[81683e06] retint_careful+0x14/0x32
[] 0x


Stack trace of the second stuck guest:

cat /proc/11932/stack
[81084d2a] __cond_resched+0x2a/0x40
[811544aa] try_to_unmap_file+0x3a/0x6d0
[811553c1] try_to_unmap+0x31/0x70
[811722a8] migrate_pages+0x318/0x500
[811453b4] compact_zone+0x1e4/0x350
[811458bf] try_to_compact_pages+0x16f/0x1d0
[81677a94] __alloc_pages_direct_compact+0xaa/0x191
[8112bee3] __alloc_pages_nodemask+0x473/0x920
[81164b30] alloc_pages_current+0xb0/0x120
[8116bf5d] new_slab+0x22d/0x2c0
[816790ea] __slab_alloc+0x314/0x46e
[811708a1] __kmalloc_node_track_caller+0xa1/0x1b0
[81566e99] __alloc_skb+0x79/0x230
[81560a31] sock_alloc_send_pskb+0x1d1/0x320
[81499ec1] tun_get_user+0x131/0x4a0
[8149a759] tun_chr_aio_write+0x69/0x90
[8118297a] do_sync_readv_writev+0xda/0x120
[81182c64] do_readv_writev+0xd4/0x1e0
[81182da5] vfs_writev+0x35/0x60
[81182f2a] sys_writev+0x4a/0xb0
[8168bb69] system_call_fastpath+0x16/0x1b
[] 0x

Packages:

Kernel: 3.5.0-23-generic #35-Ubuntu SMP Thu Jan 24 13:15:40 UTC 2013 x86_64
x86_64 x86_64 GNU/Linux
KVM: 1.2.0+noroms-0ubuntu2.12.10.2

Command line:

kvm -name vm-16762 -machine pc-1.2 -enable-kvm -cpu kvm64 -smp
sockets=1,cores=2 -m 1024 -vga cirrus -drive
id=drive1245,if=none,cache=writeback,aio=native,format=raw,media=disk,file=rbd:kvm1/vm-16762-disk-1
-device virtio-blk-pci,id=hdrive1245,addr=0x12,drive=drive1245 -netdev
tap,id=netdev1243,ifname=tap1243,script=,downscript= -device
virtio-net-pci,id=nic1243,addr=0x03,mac=02:48:6e:2b:4e:55,netdev=netdev1243
-chardev socket,id=qmp,path=/var/run/kvm/vm-16762/qmp,server,nowait -mon
chardev=qmp,mode=control -monitor unix:/var/run/kvm/vm-16762/mon,server,nowait
-vnc 10.0.0.60:21,password -usbdevice tablet -nodefaults -pidfile
/var/run/kvm/vm-16762/pid -boot menu=on -k de -chroot /var/run/kvm/vm-16762
-runas kvm -daemonize

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH V4 RESEND 09/22] net: multiqueue support

2013-02-18 Thread Jason Wang
On 02/13/2013 05:21 AM, Alexander Graf wrote:
 On 01.02.2013, at 08:39, Jason Wang wrote:

 This patch adds basic multiqueue support for qemu. The idea is simple, an 
 array
 of NetClientStates were introduced in NICState, parse_netdev() were extended 
 to
 find and match all NetClientStates belongs to the backend and place their
 pointers in NICConf. Then qemu_new_nic can setup a N:N mapping between 
 NICStates
 that belongs to a nic and NICStates belongs to the netdev. And a queue_index
 were introduced in NetClientState to track its index. After this, each peers 
 of
 a NICState were abstracted as a queue.

 After this change, all NetClientState that belongs to the same backend/nic 
 has
 the same id. When use want to change the link status, all NetClientStates 
 that
 belongs to the same backend/nic will be also changed. When user want to 
 delete
 a device or netdev, all NetClientStates that belongs to the same backend/nic
 will be deleted also. Changing or deleting an specific queue is not allowed.

 Signed-off-by: Jason Wang jasow...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 hw/dp8393x.c|2 +-
 hw/mcf_fec.c|2 +-
 hw/qdev-properties-system.c |   46 +++---
 hw/qdev-properties.h|6 +-
 include/net/net.h   |   18 +--
 net/net.c   |  113 
 +++
 6 files changed, 139 insertions(+), 48 deletions(-)

 diff --git a/hw/dp8393x.c b/hw/dp8393x.c
 index 0273fad..808157b 100644
 --- a/hw/dp8393x.c
 +++ b/hw/dp8393x.c
 @@ -900,7 +900,7 @@ void dp83932_init(NICInfo *nd, hwaddr base, int it_shift,
 s-regs[SONIC_SR] = 0x0004; /* only revision recognized by Linux */

 s-conf.macaddr = nd-macaddr;
 -s-conf.peer = nd-netdev;
 +s-conf.peers.ncs[0] = nd-netdev;

 s-nic = qemu_new_nic(net_dp83932_info, s-conf, nd-model, nd-name, 
 s);

 diff --git a/hw/mcf_fec.c b/hw/mcf_fec.c
 index 909e32b..8e60f09 100644
 --- a/hw/mcf_fec.c
 +++ b/hw/mcf_fec.c
 @@ -472,7 +472,7 @@ void mcf_fec_init(MemoryRegion *sysmem, NICInfo *nd,
 memory_region_add_subregion(sysmem, base, s-iomem);

 s-conf.macaddr = nd-macaddr;
 -s-conf.peer = nd-netdev;
 +s-conf.peers.ncs[0] = nd-netdev;

 s-nic = qemu_new_nic(net_mcf_fec_info, s-conf, nd-model, nd-name, 
 s);

 diff --git a/hw/qdev-properties-system.c b/hw/qdev-properties-system.c
 index ce0f793..ce3af22 100644
 --- a/hw/qdev-properties-system.c
 +++ b/hw/qdev-properties-system.c
 @@ -173,16 +173,47 @@ PropertyInfo qdev_prop_chr = {

 static int parse_netdev(DeviceState *dev, const char *str, void **ptr)
 {
 -NetClientState *netdev = qemu_find_netdev(str);
 +NICPeers *peers_ptr = (NICPeers *)ptr;
 +NICConf *conf = container_of(peers_ptr, NICConf, peers);
 +NetClientState **ncs = peers_ptr-ncs;
 +NetClientState *peers[MAX_QUEUE_NUM];
 +int queues, i = 0;
 +int ret;

 -if (netdev == NULL) {
 -return -ENOENT;
 +queues = qemu_find_net_clients_except(str, peers,
 +  NET_CLIENT_OPTIONS_KIND_NIC,
 +  MAX_QUEUE_NUM);
 +if (queues == 0) {
 +ret = -ENOENT;
 +goto err;
 }
 -if (netdev-peer) {
 -return -EEXIST;
 +
 +if (queues  MAX_QUEUE_NUM) {
 +ret = -E2BIG;
 +goto err;
 +}
 +
 +for (i = 0; i  queues; i++) {
 +if (peers[i] == NULL) {
 +ret = -ENOENT;
 +goto err;
 +}
 +
 +if (peers[i]-peer) {
 +ret = -EEXIST;
 +goto err;
 +}
 +
 +ncs[i] = peers[i];
 +ncs[i]-queue_index = i;
 }
 -*ptr = netdev;
 +
 +conf-queues = queues;
 +
 return 0;
 +
 +err:
 +return ret;
 }

 static const char *print_netdev(void *ptr)
 @@ -249,7 +280,8 @@ static void set_vlan(Object *obj, Visitor *v, void 
 *opaque,
 {
 DeviceState *dev = DEVICE(obj);
 Property *prop = opaque;
 -NetClientState **ptr = qdev_get_prop_ptr(dev, prop);
 +NICPeers *peers_ptr = qdev_get_prop_ptr(dev, prop);
 +NetClientState **ptr = peers_ptr-ncs[0];
 Error *local_err = NULL;
 int32_t id;
 NetClientState *hubport;
 diff --git a/hw/qdev-properties.h b/hw/qdev-properties.h
 index ddcf774..20c67f3 100644
 --- a/hw/qdev-properties.h
 +++ b/hw/qdev-properties.h
 @@ -31,7 +31,7 @@ extern PropertyInfo qdev_prop_pci_host_devaddr;
 .name  = (_name),\
 .info  = (_prop),   \
 .offset= offsetof(_state, _field)\
 -+ type_check(_type,typeof_field(_state, _field)),\
 ++ type_check(_type, typeof_field(_state, _field)),   \
 }
 #define DEFINE_PROP_DEFAULT(_name, _state, _field, _defval, _prop, _type) { \
 .name  = (_name),   \
 @@ -77,9 +77,9 @@ 

[Bug 54071] kvm guests stuck (not in io) and cannot be killed using -9

2013-02-18 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=54071


Gleb g...@redhat.com changed:

   What|Removed |Added

 CC||g...@redhat.com




--- Comment #1 from Gleb g...@redhat.com  2013-02-19 07:55:17 ---
Report this to your OS vendor. Close this bugs unless it is reproducible on
upstream kernel.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] powerpc kvm: added multiple TCEs requests support

2013-02-18 Thread Alexey Kardashevskiy

On 15/02/13 14:24, Paul Mackerras wrote:

On Mon, Feb 11, 2013 at 11:12:41PM +1100, a...@ozlabs.ru wrote:


+static long emulated_h_put_tce(struct kvmppc_spapr_tce_table *stt,
+   unsigned long ioba, unsigned long tce)
+{
+   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
+   struct page *page;
+   u64 *tbl;
+
+   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  window_size=0x%x\n, 
*/
+   /*  liobn, stt, stt-window_size); */
+   if (ioba = stt-window_size) {
+   pr_err(%s failed on ioba=%lx\n, __func__, ioba);


Doesn't this give the guest a way to spam the host logs?  And in fact
printk in real mode is potentially problematic.  I would just leave
out this statement.


+   return H_PARAMETER;
+   }
+
+   page = stt-pages[idx / TCES_PER_PAGE];
+   tbl = (u64 *)page_address(page);


I would like to see an explanation of why we are confident that
page_address() will work correctly in real mode, across all the
combinations of config options that we can have for a ppc64 book3s
kernel.


It was there before this patch, I just moved it so I would think it has 
been explained before :)


There is no combination on PPC to get WANT_PAGE_VIRTUAL enabled.
CONFIG_HIGHMEM is supported for PPC32 only so HASHED_PAGE_VIRTUAL is not 
enabled on PPC64 either.


So this definition is supposed to work on PPC64:
#define page_address(page) lowmem_page_address(page)

where lowmem_page_address() is arithmetic operation on a page struct address:
static __always_inline void *lowmem_page_address(const struct page *page)
{
return __va(PFN_PHYS(page_to_pfn(page)));
}

PPC32 will use page_address() from mm/highmem.c, I need some lesson about 
memory layout in 32bit but for now I cannot see how it can possibly fail here.





+
+   /* FIXME: Need to validate the TCE itself */
+   /* udbg_printf(tce @ %p\n, tbl[idx % TCES_PER_PAGE]); */
+   tbl[idx % TCES_PER_PAGE] = tce;
+
+   return H_SUCCESS;
+}
+
+/*
+ * Real mode handlers
   */
  long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long liobn,
  unsigned long ioba, unsigned long tce)
  {
-   struct kvm *kvm = vcpu-kvm;
struct kvmppc_spapr_tce_table *stt;

-   /* udbg_printf(H_PUT_TCE(): liobn=0x%lx ioba=0x%lx, tce=0x%lx\n, */
-   /*  liobn, ioba, tce); */
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;
+
+   /* Emulated IO */
+   return emulated_h_put_tce(stt, ioba, tce);
+}
+
+long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_list, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+   unsigned long *tces;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   list_for_each_entry(stt, kvm-arch.spapr_tce_tables, list) {
-   if (stt-liobn == liobn) {
-   unsigned long idx = ioba  SPAPR_TCE_SHIFT;
-   struct page *page;
-   u64 *tbl;
+   tces = (void *) get_real_address(vcpu, tce_list, false, NULL, NULL);
+   if (!tces)
+   return H_TOO_HARD;

-   /* udbg_printf(H_PUT_TCE: liobn 0x%lx = stt=%p  
window_size=0x%x\n, */
-   /*  liobn, stt, stt-window_size); */
-   if (ioba = stt-window_size)
-   return H_PARAMETER;
+   /* Emulated IO */
+   for (i = 0; (i  npages)  !ret; ++i, ioba += IOMMU_PAGE_SIZE)
+   ret = emulated_h_put_tce(stt, ioba, tces[i]);


So, tces is a pointer to somewhere inside a real page.  Did we check
somewhere that tces[npages-1] is in the same page as tces[0]?  If so,
I missed it.  If we didn't, then we probably should check and do
something about it.



-   page = stt-pages[idx / TCES_PER_PAGE];
-   tbl = (u64 *)page_address(page);
+   return ret;
+}

-   /* FIXME: Need to validate the TCE itself */
-   /* udbg_printf(tce @ %p\n, tbl[idx % 
TCES_PER_PAGE]); */
-   tbl[idx % TCES_PER_PAGE] = tce;
-   return H_SUCCESS;
-   }
-   }
+long kvmppc_h_stuff_tce(struct kvm_vcpu *vcpu,
+   unsigned long liobn, unsigned long ioba,
+   unsigned long tce_value, unsigned long npages)
+{
+   struct kvmppc_spapr_tce_table *stt;
+   long i, ret = 0;
+
+   stt = find_tce_table(vcpu, liobn);
+   /* Didn't find the liobn, put it to userspace */
+   if (!stt)
+   return H_TOO_HARD;

-   /* Didn't find the liobn, punt it to userspace */
-   return H_TOO_HARD;
+   /* Emulated 

Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip

2013-02-18 Thread Gleb Natapov
Can you tell us why mpic should be in kernel? Is it used often by modern
guests or may be they prefer MSI for interrupt delivery (hmm may be MSIs
are delivered through mpic too)? On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.

On Wed, Feb 13, 2013 at 11:49:14PM -0600, Scott Wood wrote:
 Scott Wood (6):
   kvm: add device control API
   kvm/ppc: add a notifier chain for vcpu creation/destruction.
   kvm/ppc/mpic: import hw/openpic.c from QEMU
   kvm/ppc/mpic: remove some obviously unneeded code
   kvm/ppc/mpic: adapt to kernel style and environment
   kvm/ppc/mpic: in-kernel MPIC emulation
 
  Documentation/virtual/kvm/api.txt  |   76 ++
  Documentation/virtual/kvm/devices/README   |1 +
  Documentation/virtual/kvm/devices/mpic.txt |   36 +
  arch/powerpc/include/asm/kvm_host.h|9 +-
  arch/powerpc/include/asm/kvm_ppc.h |4 +
  arch/powerpc/kvm/Kconfig   |5 +
  arch/powerpc/kvm/Makefile  |2 +
  arch/powerpc/kvm/booke.c   |   10 +-
  arch/powerpc/kvm/mpic.c| 1890 
 
  arch/powerpc/kvm/powerpc.c |   31 +-
  include/linux/kvm_host.h   |   44 +-
  include/uapi/linux/kvm.h   |   34 +
  virt/kvm/kvm_main.c|  141 +++
  13 files changed, 2268 insertions(+), 15 deletions(-)
  create mode 100644 Documentation/virtual/kvm/devices/README
  create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
  create mode 100644 arch/powerpc/kvm/mpic.c
 
 -- 
 1.7.9.5
 
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Gleb Natapov
Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.
 
 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
You are not only provide different way to create in kernel irq chip you
also use an alternate way to trigger interrupt lines. Before going into
interface specifics lets think about whether it is really worth it? x86
obviously support old way and will have to for some, very long, time.
ARM vGIC code, that is ready to go upstream, uses old way too. So it will
be 2 archs against one. Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-18 Thread Scott Wood

On 02/15/2013 10:51:16 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se,  
but
 I have two objections to using it as the primary interface to the  
XICS

 emulation.
 
 First, I dislike the magical side-effect where creating a device  
of a
 particular type (e.g. MPIC or XICS) automatically attaches it to  
the

 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.

 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.


OK.


 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.


Hooking up to the CPU's interrupt lines is implicit in creating an MPIC  
(and I'm fine with changing that), not in creating any device.  I don't  
see how it's worse than being implicit in calling  
KVM_CREATE_IRQCHIP_ARGS (which doesn't allow for cascaded irqchips).



 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.


They do need to appear in the interrupt source space if we want to  
inject or irqfd them.  Most users won't want to do that, but we have  
had a customer directly assign IPIs (to communicate with an OS running  
on the other CPU in an AMP setup -- host Linux was non-SMP so wasn't  
using them) and MPIC timers to a guest.



  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the  
interrupt

 controller they're edge-triggered interrupt sources.

 Ah right, I guess this is all set up via hcalls for XICS.

 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?


No.  Accesses by the guest get handled in the kernel.  Accesses in  
QEMU, including MSIs generated by virtio, get forwarded to the kernel.



 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.

 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.


I view anything other than passing the actual MPIC register values  
around as overdesign here, given that it is communication between  
hw/kvm/mpic.c on the QEMU side and arch/powerpc/kvm/mpic.c on the  
kernel side.



 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set mask-by-priority register) and the priority of the
 highest-prio in-service interrupt -- which would current processor
 priorty be?  Etc.

It would be the current task priority.  I assume MPIC maintains a
16-bit map of the interrupt priorities in service, so that would need
to be added.


We don't maintain such a map in the emulation code.  We have a per-CPU  
bitmap of the actual interrupt sources pending/active, which is another  
attribute that would need to be added in order to support migration on  
MPIC.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Scott Wood

On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:

Copying Christoffer since ARM has in kernel irq chip too.

On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device  
state,

 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
You are not only provide different way to create in kernel irq chip  
you
also use an alternate way to trigger interrupt lines. Before going  
into

interface specifics lets think about whether it is really worth it?


Which it do you mean here?

The ability to set/get attributes is needed.  Sorry, but get or set  
one blob of data, up to 512 bytes, for the entire irqchip is just not  
good enough -- assuming you don't want us to start sticking pointers  
and commands in *that* data. :-)


If you mean the way to inject interrupts, it's simpler this way.  Why  
go out of our way to inject common glue code into a communication path  
between hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or  
rather, why make that common glue be specific to this one function when  
we could reuse the same communication glue used for other things, such  
as device state?


And that's just for regular interrupts.  MSIs are vastly simpler on  
MPIC than what x86 does.


x86 obviously support old way and will have to for some, very long,  
time.


Sure.

ARM vGIC code, that is ready to go upstream, uses old way too. So it  
will

be 2 archs against one.


I wasn't aware that that's how it worked. :-P

I was trying to be considerate by not making the entire thing  
gratuitously PPC or MPIC specific, as some others seem inclined to do  
(e.g. see irqfd and APIC).  We already had a discussion on ARM's set  
address ioctl and rather than extend *that* interface, they preferred  
to just stick something ARM-specific in ASAP with the understanding  
that it would be replaced (or more accurately, kept around as a thin  
wrapper around the new stuff) later.



Christoffer do you think the proposed way it
better for your needs. Are you willing to make vGIC use it?

Scott, what other devices are you planning to support with this
interface?


At the moment I do not have plans for other devices, though what does  
it hurt for the capability to be there?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/6] kvm/ppc/mpic: in-kernel irqchip

2013-02-18 Thread Scott Wood

On 02/18/2013 06:04:51 AM, Gleb Natapov wrote:
Can you tell us why mpic should be in kernel? Is it used often by  
modern
guests or may be they prefer MSI for interrupt delivery (hmm may be  
MSIs

are delivered through mpic too)?


Yes, MSIs are delivered through the mpic.

Plus, MSIs are only (normally) for PCI(e).  We have embedded system on  
a chips with important non-PCI devices, which cannot use MSIs.



On x86 we actually would've preferred
to move PIC/IOAPIC form the kernel and leave only LAPIC there (but for
historical reasons creation of PIC/IOAPIC/LAPIC are bundled together)
hence my question.


We don't have that same split on this hardware.  MPIC is one device  
that covers all of it.  Some of the functionality is per-CPU, but it is  
not easily extracted from the rest.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Mon, Feb 18, 2013 at 3:01 PM, Scott Wood scottw...@freescale.com wrote:
 On 02/18/2013 06:21:59 AM, Gleb Natapov wrote:

 Copying Christoffer since ARM has in kernel irq chip too.

 On Wed, Feb 13, 2013 at 11:49:15PM -0600, Scott Wood wrote:
  Currently, devices that are emulated inside KVM are configured in a
  hardcoded manner based on an assumption that any given architecture
  only has one way to do it.  If there's any need to access device state,
  it is done through inflexible one-purpose-only IOCTLs (e.g.
  KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
  cumbersome and depletes a limited numberspace.
 
  This API provides a mechanism to instantiate a device of a certain
  type, returning an ID that can be used to set/get attributes of the
  device.  Attributes may include configuration parameters (e.g.
  register base address), device state, operational commands, etc.  It
  is similar to the ONE_REG API, except that it acts on devices rather
  than vcpus.
 You are not only provide different way to create in kernel irq chip you
 also use an alternate way to trigger interrupt lines. Before going into
 interface specifics lets think about whether it is really worth it?


 Which it do you mean here?

 The ability to set/get attributes is needed.  Sorry, but get or set one
 blob of data, up to 512 bytes, for the entire irqchip is just not good
 enough -- assuming you don't want us to start sticking pointers and commands
 in *that* data. :-)

 If you mean the way to inject interrupts, it's simpler this way.  Why go out
 of our way to inject common glue code into a communication path between
 hw/kvm/mpic.c in QEMU and arch/powerpc/kvm/mpic.c in KVM?  Or rather, why
 make that common glue be specific to this one function when we could reuse
 the same communication glue used for other things, such as device state?

 And that's just for regular interrupts.  MSIs are vastly simpler on MPIC
 than what x86 does.


 x86 obviously support old way and will have to for some, very long, time.


 Sure.


 ARM vGIC code, that is ready to go upstream, uses old way too. So it will
 be 2 archs against one.


 I wasn't aware that that's how it worked. :-P

 I was trying to be considerate by not making the entire thing gratuitously
 PPC or MPIC specific, as some others seem inclined to do (e.g. see irqfd and
 APIC).  We already had a discussion on ARM's set address ioctl and rather
 than extend *that* interface, they preferred to just stick something
 ARM-specific in ASAP with the understanding that it would be replaced (or
 more accurately, kept around as a thin wrapper around the new stuff) later.


 Christoffer do you think the proposed way it
 better for your needs. Are you willing to make vGIC use it?


Well it won't improve functionality much from the current hardware
point of view, but the proposed interface is superior to what we have
now. Adding and coordinating new interfaces is indeed a pain, so a
generic interface which is flexible enough to cater for a certain
group of needs, is welcome imho. And this does seem to fit the bill.

I can imagine that if there's support for a future ARM gic version or
if we add in-kernel support for other stuff on ARM, then this
interface will be useful, and in fact using the current interface to
support two separate, but similar, interrupt controllers could get
messy from a user space point of view.

I am definitely willing to change to use this interface, the agreement
on the KVM_ARM_SET_DEVICE_ADDR ioctl was exactly because of this.

I had some nits on the RFC, which I'll send separately.

 Scott, what other devices are you planning to support with this
 interface?


 At the moment I do not have plans for other devices, though what does it
 hurt for the capability to be there?

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.

 Both device types and individual attributes can be tested without having
 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.

 Signed-off-by: Scott Wood scottw...@freescale.com
 ---
  Documentation/virtual/kvm/api.txt|   76 ++
  Documentation/virtual/kvm/devices/README |1 +
  include/linux/kvm_host.h |   21 +
  include/uapi/linux/kvm.h |   25 ++
  virt/kvm/kvm_main.c  |  127 
 ++
  5 files changed, 250 insertions(+)
  create mode 100644 Documentation/virtual/kvm/devices/README

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index c2534c3..5bcdb42 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2122,6 +2122,82 @@ header; first `n_valid' valid entries with contents 
 from the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.

 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created
 +
 +  Other error conditions may be defined by individual device types.
 +
 +Creates an emulated device in the kernel.  The returned handle
 +can be used with KVM_SET/GET_DEVICE_ATTR.
 +
 +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
 +device type is supported (not necessarily whether it can be created
 +in the current vm).
 +
 +Individual devices should not define flags.  Attributes should be used
 +for specifying any behavior that is not implied by the device type
 +number.
 +
 +struct kvm_create_device {
 +   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 +   __u32   id; /* out: device handle */
 +   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_device_attr
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device id is invalid
 +  ENXIO:  The group or attribute is unknown/unsupported for this device
 +  EPERM:  The attribute cannot (currently) be accessed this way
 +  (e.g. read-only attribute, or attribute that only makes
 +  sense when the device is in a different state)
 +
 +  Other error conditions may be defined by individual device types.
 +
 +Gets/sets a specified piece of device configuration and/or state.  The
 +semantics are device-specific except for certain global attributes.  See
 +individual device documentation in the devices directory.  As with
 +ONE_REG, the size of the data transferred is defined by the particular
 +attribute.
 +
 +Attributes in group KVM_DEV_ATTR_COMMON are not device-specific:
 +   KVM_DEV_ATTR_TYPE (ro, 32-bit): the device type passed to 
 KVM_CREATE_DEVICE
 +
 +struct kvm_device_attr {
 +   __u32   dev;/* id from KVM_CREATE_DEVICE */
 +   __u32   group;  /* KVM_DEV_ATTR_COMMON or device-defined */
 +   __u64   attr;   /* group-defined */
 +   __u64   addr;   /* userspace address of attr data */
 +};
 +
 +4.81 KVM_HAS_DEVICE_ATTR
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_device_attr
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device id is invalid
 +  ENXIO:  The group or attribute is unknown/unsupported for this device
 +
 +Tests whether a device supports a particular attribute.  A successful
 +return indicates the attribute is implemented.  It does not necessarily
 +indicate that the attribute can be read or written in the device's
 +current state.  addr is ignored.

  5. The kvm_run structure
  
 diff --git a/Documentation/virtual/kvm/devices/README 
 b/Documentation/virtual/kvm/devices/README
 new file mode 100644
 index 

Re: [PULL 00/14] ppc patch queue 2013-02-15

2013-02-18 Thread Marcelo Tosatti
Pulled, thanks.

On Fri, Feb 15, 2013 at 01:16:16AM +0100, Alexander Graf wrote:
 Hi Marcelo / Gleb,
 
 This is my current patch queue for ppc.  Please pull.
 
 Highlights of this queue drop are:
 
   - BookE: Fast mapping support for 4k backed memory
   - BookE: Handle alignment interrupts
 
 Alex
 
 
 The following changes since commit cbd29cb6e38af6119df2cdac0c58acf0e85c177e:
   Jan Kiszka (1):
 KVM: nVMX: Remove redundant get_vmcs12 from 
 nested_vmx_exit_handled_msr
 
 are available in the git repository at:
 
   git://github.com/agraf/linux-2.6.git kvm-ppc-next
 
 Alexander Graf (11):
   KVM: PPC: E500: Move write_stlbe higher
   KVM: PPC: E500: Explicitly mark shadow maps invalid
   KVM: PPC: E500: Propagate errors when shadow mapping
   KVM: PPC: e500: Call kvmppc_mmu_map for initial mapping
   KVM: PPC: E500: Split host and guest MMU parts
   KVM: PPC: e500: Implement TLB1-in-TLB0 mapping
   KVM: PPC: E500: Make clear_tlb_refs and clear_tlb1_bitmap static
   KVM: PPC: E500: Remove kvmppc_e500_tlbil_all usage from guest TLB code
   Merge commit 'origin/next' into kvm-ppc-next
   KVM: PPC: BookE: Handle alignment interrupts
   Merge commit 'origin/next' into kvm-ppc-next
 
 Bharat Bhushan (3):
   KVM: PPC: booke: use vcpu reference from thread_struct
   KVM: PPC: booke: Allow multiple exception types
   booke: Added DBCR4 SPR number
 
  arch/powerpc/include/asm/kvm_ppc.h   |2 -
  arch/powerpc/include/asm/reg.h   |2 -
  arch/powerpc/include/asm/reg_booke.h |1 +
  arch/powerpc/kernel/asm-offsets.c|2 +-
  arch/powerpc/kvm/Makefile|9 +-
  arch/powerpc/kvm/booke.c |   30 +-
  arch/powerpc/kvm/booke.h |1 +
  arch/powerpc/kvm/booke_interrupts.S  |   49 +-
  arch/powerpc/kvm/e500.c  |   16 +-
  arch/powerpc/kvm/e500.h  |1 +
  arch/powerpc/kvm/e500_mmu.c  |  809 +++
  arch/powerpc/kvm/e500_mmu_host.c |  699 +
  arch/powerpc/kvm/e500_mmu_host.h |   18 +
  arch/powerpc/kvm/e500_tlb.c  | 1430 
 --
  14 files changed, 1610 insertions(+), 1459 deletions(-)
  create mode 100644 arch/powerpc/kvm/e500_mmu.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.c
  create mode 100644 arch/powerpc/kvm/e500_mmu_host.h
  delete mode 100644 arch/powerpc/kvm/e500_tlb.c
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Scott Wood

On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:
On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com  
wrote:

 index 0350e0d..dbaf012 100644
 --- a/include/linux/kvm_host.h
 +++ b/include/linux/kvm_host.h
 @@ -335,6 +335,25 @@ struct kvm_memslots {
 short id_to_index[KVM_MEM_SLOTS_NUM];
  };

 +/*
 + * The worst case number of simultaneous devices will likely be  
very low
 + * (usually zero or one) for the forseeable future.  If the worst  
case
 + * exceeds this, then it can be increased, or we can convert to  
idr.

 + */

This comment is on the heavy side (if at all needed). If you want to
remind people of idr, just put that in a single line. A define is a
define is a define.


OK.

 +#define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xac, struct  
kvm_create_device)
 +#define KVM_SET_DEVICE_ATTR  _IOW(KVMIO,  0xad, struct  
kvm_device_attr)
 +#define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xae, struct  
kvm_device_attr)


_IOWR ?


struct kvm_device_attr itself is write-only, though the data pointed to  
by the addr field goes the other way for GET.  ONE_REG is in the same  
situation and also uses _IOW for both.



 +static int kvm_ioctl_create_device(struct kvm *kvm,
 +  struct kvm_create_device *cd)
 +{
 +   struct kvm_device *dev = NULL;
 +   bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
 +   int id;
 +   int r;
 +
 +   mutex_lock(kvm-lock);
 +
 +   id = kvm-num_devices;
 +   if (id = KVM_MAX_DEVICES  !test) {
 +   r = -ENOSPC;
 +   goto out;
 +   }
 +
 +   switch (cd-type) {
 +   default:
 +   r = -ENODEV;
 +   goto out;
 +   }

do we really believe that there will be any arch-generic recognition
of types? shouldn't this be a call to an arch-specific function
instead. Which makes me wonder whether the device type IDs should be
arch specific as well...


I prefer to look at it from the other direction -- is there any reason  
why this *should* be architecture specific?  What will that make easier?


By doing device recognition here we don't need a separate copy of this  
per arch (including some #ifdef or modifying every arch at once --  
including ARM which I can't modify yet because it's not merged), and  
*if* we should end up with an in-kernel-emulated device that gets used  
across multiple architectures, it would be annoying to have to modify  
all relevant architectures (and worse to deal with per-arch  
numberspaces).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm/powerpc/e500mc: fix tlb invalidation on cpu migration

2013-02-18 Thread Scott Wood
The existing check handles the case where we've migrated to a different
core than we last ran on, but it doesn't handle the case where we're
still on the same cpu we last ran on, but some other vcpu has run on
this cpu in the meantime.

Signed-off-by: Scott Wood scottw...@freescale.com
---
This seems to have been the cause of the userspace segfaults I was
seeing (the other TLB patches I posted are still needed as well).

 arch/powerpc/kvm/e500mc.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index 1f89d26..8637689 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -111,6 +111,7 @@ void kvmppc_mmu_msr_notify(struct kvm_vcpu *vcpu, u32 
old_msr)
 void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
struct kvmppc_vcpu_e500 *vcpu_e500 = to_e500(vcpu);
+   static struct kvm_vcpu *last_vcpu_on_cpu[NR_CPUS];
 
kvmppc_booke_vcpu_load(vcpu, cpu);
 
@@ -136,8 +137,11 @@ void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
mtspr(SPRN_GDEAR, vcpu-arch.shared-dar);
mtspr(SPRN_GESR, vcpu-arch.shared-esr);
 
-   if (vcpu-arch.oldpir != mfspr(SPRN_PIR))
+   if (vcpu-arch.oldpir != mfspr(SPRN_PIR) ||
+   last_vcpu_on_cpu[smp_processor_id()] != vcpu) {
kvmppc_e500_tlbil_all(vcpu_e500);
+   last_vcpu_on_cpu[smp_processor_id()] = vcpu;
+   }
 
kvmppc_load_guest_fp(vcpu);
 }
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] kvm: add device control API

2013-02-18 Thread Christoffer Dall
On Mon, Feb 18, 2013 at 4:53 PM, Scott Wood scottw...@freescale.com wrote:
 On 02/18/2013 06:44:20 PM, Christoffer Dall wrote:

 On Wed, Feb 13, 2013 at 9:49 PM, Scott Wood scottw...@freescale.com
 wrote:
  index 0350e0d..dbaf012 100644
  --- a/include/linux/kvm_host.h
  +++ b/include/linux/kvm_host.h
  @@ -335,6 +335,25 @@ struct kvm_memslots {
  short id_to_index[KVM_MEM_SLOTS_NUM];
   };
 
  +/*
  + * The worst case number of simultaneous devices will likely be very
  low
  + * (usually zero or one) for the forseeable future.  If the worst case
  + * exceeds this, then it can be increased, or we can convert to idr.
  + */

 This comment is on the heavy side (if at all needed). If you want to
 remind people of idr, just put that in a single line. A define is a
 define is a define.


 OK.


  +#define KVM_CREATE_DEVICE_IOWR(KVMIO,  0xac, struct
  kvm_create_device)
  +#define KVM_SET_DEVICE_ATTR  _IOW(KVMIO,  0xad, struct
  kvm_device_attr)
  +#define KVM_GET_DEVICE_ATTR  _IOW(KVMIO,  0xae, struct
  kvm_device_attr)

 _IOWR ?


 struct kvm_device_attr itself is write-only, though the data pointed to by
 the addr field goes the other way for GET.  ONE_REG is in the same situation
 and also uses _IOW for both.



ok.

Btw., what about the size of the attr? implicitly defined through the attr id?

  +static int kvm_ioctl_create_device(struct kvm *kvm,
  +  struct kvm_create_device *cd)
  +{
  +   struct kvm_device *dev = NULL;
  +   bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
  +   int id;
  +   int r;
  +
  +   mutex_lock(kvm-lock);
  +
  +   id = kvm-num_devices;
  +   if (id = KVM_MAX_DEVICES  !test) {
  +   r = -ENOSPC;
  +   goto out;
  +   }
  +
  +   switch (cd-type) {
  +   default:
  +   r = -ENODEV;
  +   goto out;
  +   }

 do we really believe that there will be any arch-generic recognition
 of types? shouldn't this be a call to an arch-specific function
 instead. Which makes me wonder whether the device type IDs should be
 arch specific as well...


 I prefer to look at it from the other direction -- is there any reason why
 this *should* be architecture specific?  What will that make easier?


The fact that you don't have to create static inlines for the
architectures that don't define the functions that get called or have
to similar #ifdef tricks, and I also think it's easier to read the
arch-specific bits of the code that way, instead of some arbitrary
function that you have to trace through to figure out where it's
called from.

 By doing device recognition here we don't need a separate copy of this per
 arch (including some #ifdef or modifying every arch at once -- including ARM
 which I can't modify yet because it's not merged), and *if* we should end up
 with an in-kernel-emulated device that gets used across multiple
 architectures, it would be annoying to have to modify all relevant
 architectures (and worse to deal with per-arch numberspaces).

I would say that's exactly what you're going to need with your approach:

switch (cd-type) {
case KVM_ARM_VGIC_V2_0:
kvm_arm_vgic_v2_0_create(...);
}


are you going to ifdef here in this function, or? I think it's cleaner
to have the single arch-specific hooks and handle the cases there.

The use case of having a single device which is so central to the
system that we emulate it inside the kernel and is shared across
multiple archs is pretty far fetched to me.

However, this is internal and can always be changed, so if everyone
agrees on the overall API, whichever way you implement it is fine with
me.


 -Scott

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html