Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/02/2013 06:47 AM, Scott Wood wrote:

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood scottw...@freescale.com
---
  Documentation/virtual/kvm/api.txt|   70 ++
  Documentation/virtual/kvm/devices/README |1 +
  arch/powerpc/include/asm/kvm_host.h  |6 +++
  arch/powerpc/include/asm/kvm_ppc.h   |2 +
  arch/powerpc/kvm/powerpc.c   |7 +++
  include/uapi/linux/kvm.h |   27 
  virt/kvm/kvm_main.c  |   31 +
  7 files changed, 144 insertions(+)
  create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 976eb65..77328aa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from 
the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.

+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+  be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
+   __u32   fd; /* out: device handle */
+   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+  (e.g. read-only attribute, or attribute that only makes
+  sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the devices directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+   __u32   flags;  /* no flags currently defined */
+   __u32   group;  /* device-defined */
+   __u64   attr;   /* group-defined */
+   __u64   addr;   /* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  addr is ignored.

  4.77 KVM_ARM_VCPU_INIT

diff --git a/Documentation/virtual/kvm/devices/README 
b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..e0caae2 100644
--- 

Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Alexander Graf

On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Thursday, March 28, 2013 10:06 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan
 Bharat-R65777
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
 
 
 On 21.03.2013, at 07:25, Bharat Bhushan wrote:
 
 From: Bharat Bhushan bharat.bhus...@freescale.com
 
 This patch adds the debug stub support on booke/bookehv.
 Now QEMU debug stub can use hw breakpoint, watchpoint and software
 breakpoint to debug guest.
 
 Debug registers are saved/restored on vcpu_put()/vcpu_get().
 Also the debug registers are saved restored only if guest is using
 debug resources.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v2:
 - save/restore in vcpu_get()/vcpu_put()
 - some more minor cleanup based on review comments.
 
 arch/powerpc/include/asm/kvm_host.h |   10 ++
 arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
 arch/powerpc/kvm/booke.c|  252 
 ---
 arch/powerpc/kvm/e500_emulate.c |   10 ++
 4 files changed, 272 insertions(+), 22 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index f4ba881..8571952 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
 u32 mmucfg;
 u32 epr;
 u32 crit_save;
 +   /* guest debug registers*/
 struct kvmppc_booke_debug_reg dbg_reg;
 +   /* shadow debug registers */
 +   struct kvmppc_booke_debug_reg shadow_dbg_reg;
 +   /* host debug registers*/
 +   struct kvmppc_booke_debug_reg host_dbg_reg;
 +   /*
 +* Flag indicating that debug registers are used by guest
 +* and requires save restore.
 +   */
 +   bool debug_save_restore;
 #endif
 gpa_t paddr_accessed;
 gva_t vaddr_accessed;
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h
 b/arch/powerpc/include/uapi/asm/kvm.h
 index 15f9a00..d7ce449 100644
 --- a/arch/powerpc/include/uapi/asm/kvm.h
 +++ b/arch/powerpc/include/uapi/asm/kvm.h
 @@ -25,6 +25,7 @@
 /* Select powerpc specific features in linux/kvm.h */ #define
 __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
 +#define __KVM_HAVE_GUEST_DEBUG
 
 struct kvm_regs {
 __u64 pc;
 @@ -267,7 +268,24 @@ struct kvm_fpu {
 __u64 fpr[32];
 };
 
 +/*
 + * Defines for h/w breakpoint, watchpoint (read, write or both) and
 + * software breakpoint.
 + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
 + * for KVM_DEBUG_EXIT.
 + */
 +#define KVMPPC_DEBUG_NONE  0x0
 +#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
 +#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
 +#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
 struct kvm_debug_exit_arch {
 +   __u64 address;
 +   /*
 +* exiting to userspace because of h/w breakpoint, watchpoint
 +* (read, write or both) and software breakpoint.
 +*/
 +   __u32 status;
 +   __u32 reserved;
 };
 
 /* for KVM_SET_GUEST_DEBUG */
 @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
  * Type denotes h/w breakpoint, read watchpoint, write
  * watchpoint or watchpoint (both read and write).
  */
 -#define KVMPPC_DEBUG_NOTYPE0x0
 -#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
 -#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
 -#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
 __u32 type;
 __u32 reserved;
 } bp[16];
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
 1de93a8..bf20056 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu
 *vcpu) #endif }
 
 +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
 +   /* Synchronize guest's desire to get debug interrupts into shadow
 +MSR */ #ifndef CONFIG_KVM_BOOKE_HV
 +   vcpu-arch.shadow_msr = ~MSR_DE;
 +   vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
 +
 +   /* Force enable debug interrupts when user space wants to debug */
 +   if (vcpu-guest_debug) {
 +#ifdef CONFIG_KVM_BOOKE_HV
 +   /*
 +* Since there is no shadow MSR, sync MSR_DE into the guest
 +* visible MSR. Do not allow guest to change MSR[DE].
 +*/
 +   vcpu-arch.shared-msr |= MSR_DE;
 +   mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);
 
 This mtspr should really just be a bit or in shadow_mspr when guest_debug 
 gets
 enabled. It should automatically get synchronized as soon as the next
 vpcu_load() happens.
 
 I think this is not required here as shadow_dbsr already have MSRP_DEP set.
 
 Will setup shadow_msrp when setting guest_debug and clear shadow_msrp when 
 guest_debug is cleared.
 But that will also not be sufficient as it not sure when vcpu_load() will be 
 called after the shadow_msrp is changed. So 

Re: [PATCH 2/4 v2] KVM: PPC: debug stub interface parameter defined

2013-04-02 Thread Alexander Graf

On 29.03.2013, at 04:08, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Friday, March 29, 2013 7:26 AM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421; Bhushan
 Bharat-R65777
 Subject: Re: [PATCH 2/4 v2] KVM: PPC: debug stub interface parameter defined
 
 
 On 21.03.2013, at 07:24, Bharat Bhushan wrote:
 
 From: Bharat Bhushan bharat.bhus...@freescale.com
 
 This patch defines the interface parameter for KVM_SET_GUEST_DEBUG
 ioctl support. Follow up patches will use this for setting up hardware
 breakpoints, watchpoints and software breakpoints.
 
 Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below.
 This is because I am not sure what is required for book3s. So this
 ioctl behaviour will not change for book3s.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v2:
 - No Change
 
 arch/powerpc/include/uapi/asm/kvm.h |   23 +++
 arch/powerpc/kvm/book3s.c   |6 ++
 arch/powerpc/kvm/booke.c|6 ++
 arch/powerpc/kvm/powerpc.c  |6 --
 4 files changed, 35 insertions(+), 6 deletions(-)
 
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h
 b/arch/powerpc/include/uapi/asm/kvm.h
 index c2ff99c..15f9a00 100644
 --- a/arch/powerpc/include/uapi/asm/kvm.h
 +++ b/arch/powerpc/include/uapi/asm/kvm.h
 @@ -272,8 +272,31 @@ struct kvm_debug_exit_arch {
 
 /* for KVM_SET_GUEST_DEBUG */
 struct kvm_guest_debug_arch {
 +   struct {
 +   /* H/W breakpoint/watchpoint address */
 +   __u64 addr;
 +   /*
 +* Type denotes h/w breakpoint, read watchpoint, write
 +* watchpoint or watchpoint (both read and write).
 +*/
 +#define KVMPPC_DEBUG_NOTYPE0x0
 +#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
 +#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
 +#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
 
 Are you sure you want to introduce these here, just to remove them again in a
 later patch?
 
 Up to this patch the scope was limited to this structure. So for clarity I 
 defined here and later the scope expands so moved out of this structure. I do 
 not think this really matters, let me know how you want to see ?

Well, at least I want to see the names be identical between the patches ;).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery

2013-04-02 Thread Yangminqiang
Hi Tomoki

I tried your smart patch cpu isolation and direct interrupt delivery,  
http://article.gmane.org/gmane.linux.kernel/1353803

got  output when I run qemu
kvm_set_slave_cpu: Invalid argument

So I wonder
* Did I  misuse your patches? 
* How is the offlined CPU assigned? or the Guest OS will automaticly detect and 
use it?

details of my trial:
- based on v3.6-rc4 and qemu-kvm-1.0 as you commented
- boot the kernel with intel_iommu=on
  BOOT_IMAGE=(hd0,1)/boot/vmlinuz-3.6.0-rc4+ root=/dev/sda1 rhgb quiet 
selinux=0 intel_iommu=on
- the offlined cpu
  # cat /sys/devices/system/cpu/offline 
  23
- qemu command line
  qemu-kvm -enable-kvm -m 1024 -cpu qemu64,+x2apic -no-kvm-pit -serial pty 
-nographic -drive 
file=/mnt/sdb/vmtest/testfc.qcow2,if=virtio,index=0,format=qcow2 -spice 
port=12000,addr=186.100.8.171,disable-ticketing,plaintext-channel=main,plaintext-channel=playback,plaintext-channel=record,image-compression=auto_glz


Thanks,
Yang Minqiang--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM guest memory mapping

2013-04-02 Thread Tony Roberts
Hi list,

I've just started doing some research into VM memory allocation, and
I've got a few questions about how KVM performs memory translations
from guest to host, using Intel-VT extensions.  My questions relate to
the implementation of Intel EPTs.

I've put in a few printk statements within the KVM source,
specifically mmu.c to try to follow what is happening within the VM
and hypervisor, however, I'm a little bit lost at what I'm seeing.

The very first virtual memory access from within my guest triggers a
'handle_ept_violation', this is to be expected as it's the very first,
and no pages will have been allocated as of yet.

The value taken from the guest's CR2 register is: 0xfff0 (which I
am assuming to be a guest physical address).  Upon this ept violation
occurring, the function tdp_page_fault is called, which then in turn
calls __direct_map.  I'm a little confused about exactly what
__direct_map is actually doing.

The input to __direct_map is:

gpa_t v: fff0
gfn_t gfn: f
pfn_t pfn: 35b649
level: 1

Firstly, I'm confused as to why the gpa_t type variable is called 'v'.
This would indicate to me that it's a virtual address, however it is
being stored as a guest physical type.  Could anyone explain why this
is named as such?

After this I can see a lot of different memory addresses being passed
around the system, but I'd still like to better understand how KVM
allocates and finally translates guest addresses into host physical
address.  If anyone could help explain how __direct_map functions, I
would appreciate it.

Thanks

Tony
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] ARM: EXYNOS5440: DTS: Add virtual GIC DT bindings

2013-04-02 Thread Kukjin Kim
Giridhar Maruthy wrote:
 
 Exynos5440 has GIC which has virtualization support
 in them. These are used by KVM.
 
 Signed-off-by: Giridhar Maruthy giridha...@samsung.com
 ---
  arch/arm/boot/dts/exynos5440.dtsi |6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)
 
 diff --git a/arch/arm/boot/dts/exynos5440.dtsi
 b/arch/arm/boot/dts/exynos5440.dtsi
 index c374a31..25c6134 100644
 --- a/arch/arm/boot/dts/exynos5440.dtsi
 +++ b/arch/arm/boot/dts/exynos5440.dtsi
 @@ -26,7 +26,11 @@
   compatible = arm,cortex-a15-gic;
   #interrupt-cells = 3;
   interrupt-controller;
 - reg = 0x2E1000 0x1000, 0x2E2000 0x1000;
 + reg =   0x2E1000 0x1000,
 + 0x2E2000 0x1000,
 + 0x2E4000 0x2000,
 + 0x2E6000 0x2000;
 + interrupts = 1 9 0xf04;
   };
 
   cpus {
 --
 1.7.9.5

Looks ok to me, applied.

Thanks.

- Kukjin

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers

2013-04-02 Thread Kukjin Kim
Alexander Graf wrote:
 
 When running on an exynos 5250 SoC, we don't initialize the architected
 timers. The chip however supports architected timers.
 
Yes, exynos5250 can support, mct(multi core timer) is used though.

 When we don't initialize them, KVM will try to access them and run into
 NULL pointer dereferences attempting to do so.
 
Yes, right.

 This patch is really more of a hack than a real fix, but does get me
 working with KVM on Arndale.
 
Hmm, if you think, this is _really_ a hack, you need to add some comments
about that for clearance, and since the mct.c file has been moved into
drivers/clocksource/, this should be re-worked.

BTW, I discussed about this with Thomas and Giridhar just now, we reached
this 3rd patch could be dropped because the correct way is to add a dts
node for arch timer which patch 2nd is already doing after 3.9-rc1 because
of CLOCKSOURCE_OF_DECLARE macro.

So if you' OK above, let me know so that I can take only 1st and 2nd
patches to support KVM on exynos5250.

Thanks.

- Kukjin

 Signed-off-by: Alexander Graf ag...@suse.de
 ---
  arch/arm/mach-exynos/mct.c |4 
  1 file changed, 4 insertions(+)
 
 diff --git a/arch/arm/mach-exynos/mct.c b/arch/arm/mach-exynos/mct.c
 index c9d6650..eefb8af 100644
 --- a/arch/arm/mach-exynos/mct.c
 +++ b/arch/arm/mach-exynos/mct.c
 @@ -482,4 +482,8 @@ void __init exynos4_timer_init(void)
   exynos4_timer_resources();
   exynos4_clocksource_init();
   exynos4_clockevent_init();
 +
 + if (soc_is_exynos5250()) {
 + arch_timer_of_register();
 + }
  }
 --
 1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers

2013-04-02 Thread Alexander Graf
On 04/02/2013 12:44 PM, Kukjin Kim wrote:
 Alexander Graf wrote:
 When running on an exynos 5250 SoC, we don't initialize the architected
 timers. The chip however supports architected timers.

 Yes, exynos5250 can support, mct(multi core timer) is used though.

 When we don't initialize them, KVM will try to access them and run into
 NULL pointer dereferences attempting to do so.

 Yes, right.

 This patch is really more of a hack than a real fix, but does get me
 working with KVM on Arndale.

 Hmm, if you think, this is _really_ a hack, you need to add some comments
 about that for clearance, and since the mct.c file has been moved into
 drivers/clocksource/, this should be re-worked.

 BTW, I discussed about this with Thomas and Giridhar just now, we reached
 this 3rd patch could be dropped because the correct way is to add a dts
 node for arch timer which patch 2nd is already doing after 3.9-rc1 because
 of CLOCKSOURCE_OF_DECLARE macro.

 So if you' OK above, let me know so that I can take only 1st and 2nd
 patches to support KVM on exynos5250.

I'd say go ahead and take them and I'll verify whether things work on
your tree :).

What's the git repo of your branch?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: allow host header to be included even for !CONFIG_KVM

2013-04-02 Thread Gleb Natapov
On Mon, Mar 25, 2013 at 02:14:20PM -0700, Kevin Hilman wrote:
 Gleb Natapov g...@redhat.com writes:
 
  On Sun, Mar 24, 2013 at 02:44:26PM +0100, Frederic Weisbecker wrote:
  2013/3/21 Gleb Natapov g...@redhat.com:
   Isn't is simpler for kernel/context_tracking.c to define empty
   __guest_enter()/__guest_exit() if !CONFIG_KVM.
  
  That doesn't look right. Off-cases are usually handled from the
  headers, right? So that we avoid iffdeffery ugliness in core code.
  Lets put it in linux/context_tracking.h header then.
 
 Here's a version to do that.
 
Frederic, are you OK with this version?


 Kevin
 
 From d9d909394479dd7ff90b7bddb95a564945406719 Mon Sep 17 00:00:00 2001
 From: Kevin Hilman khil...@linaro.org
 Date: Mon, 25 Mar 2013 14:12:41 -0700
 Subject: [PATCH v2] ontext_tracking: fix !CONFIG_KVM compile: add stub guest
  enter/exit
 
 When KVM is not enabled, or not available on a platform, the KVM
 headers should not be included.  Instead, just define stub
 __guest_[enter|exit] functions.
 
 Cc: Frederic Weisbecker fweis...@gmail.com
 Signed-off-by: Kevin Hilman khil...@linaro.org
 ---
  include/linux/context_tracking.h | 7 +++
  kernel/context_tracking.c| 1 -
  2 files changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/include/linux/context_tracking.h 
 b/include/linux/context_tracking.h
 index 365f4a6..9d0f242 100644
 --- a/include/linux/context_tracking.h
 +++ b/include/linux/context_tracking.h
 @@ -3,6 +3,13 @@
  
  #include linux/sched.h
  #include linux/percpu.h
 +#if IS_ENABLED(CONFIG_KVM)
 +#include linux/kvm_host.h
 +#else
 +#define __guest_enter()
 +#define __guest_exit()
 +#endif
 +
  #include asm/ptrace.h
  
  struct context_tracking {
 diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
 index 65349f0..85bdde1 100644
 --- a/kernel/context_tracking.c
 +++ b/kernel/context_tracking.c
 @@ -15,7 +15,6 @@
   */
  
  #include linux/context_tracking.h
 -#include linux/kvm_host.h
  #include linux/rcupdate.h
  #include linux/sched.h
  #include linux/hardirq.h
 -- 
 1.8.2

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Call kvm_apic_match_dest() to check destination vcpu

2013-04-02 Thread Gleb Natapov
On Mon, Apr 01, 2013 at 12:42:33AM +, Zhang, Yang Z wrote:
 Zhang, Yang Z wrote on 2013-03-21:
  From: Yang Zhang yang.z.zh...@intel.com
  
  For a given vcpu, kvm_apic_match_dest() will tell you whether
  the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
  and use kvm_apic_match_dest() instead.
  
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
  ---
   arch/x86/kvm/lapic.c |   47 ---
   arch/x86/kvm/lapic.h |4 
   virt/kvm/ioapic.c|9 -
   3 files changed, 4 insertions(+), 56 deletions(-)
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index a8e9369..e227474 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
  return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
   }
  -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
  -   struct kvm_lapic_irq *irq,
  -   u64 *eoi_exit_bitmap)
  -{
  -   struct kvm_lapic **dst;
  -   struct kvm_apic_map *map;
  -   unsigned long bitmap = 1;
  -   int i;
  -
  -   rcu_read_lock();
  -   map = rcu_dereference(vcpu-kvm-arch.apic_map);
  -
  -   if (unlikely(!map)) {
  -   __set_bit(irq-vector, (unsigned long *)eoi_exit_bitmap);
  -   goto out;
  -   }
  -
  -   if (irq-dest_mode == 0) { /* physical mode */
  -   if (irq-delivery_mode == APIC_DM_LOWEST ||
  -   irq-dest_id == 0xff) {
  -   __set_bit(irq-vector,
  - (unsigned long *)eoi_exit_bitmap);
  -   goto out;
  -   }
  -   dst = map-phys_map[irq-dest_id  0xff];
  -   } else {
  -   u32 mda = irq-dest_id  (32 - map-ldr_bits);
  -
  -   dst = map-logical_map[apic_cluster_id(map, mda)];
  -
  -   bitmap = apic_logical_id(map, mda);
  -   }
  -
  -   for_each_set_bit(i, bitmap, 16) {
  -   if (!dst[i])
  -   continue;
  -   if (dst[i]-vcpu == vcpu) {
  -   __set_bit(irq-vector,
  - (unsigned long *)eoi_exit_bitmap);
  -   break;
  -   }
  -   }
  -
  -out:
  -   rcu_read_unlock();
  -}
  -
   static void recalculate_apic_map(struct kvm *kvm)
   {
  struct kvm_apic_map *new, *old = NULL;
  diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
  index 2c721b9..baa20cf 100644
  --- a/arch/x86/kvm/lapic.h
  +++ b/arch/x86/kvm/lapic.h
  @@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map
  *map, u32 ldr)
  return ldr  map-lid_mask;
   }
  -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
  -   struct kvm_lapic_irq *irq,
  -   u64 *eoi_bitmap);
  -
   static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
   {
  return vcpu-arch.apic-pending_events;
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
  index ce82b94..b54ddfa 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -132,11 +132,10 @@ void kvm_ioapic_calculate_eoi_exitmap(struct
  kvm_vcpu *vcpu,
  (e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
   kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC,
   index))) {
  -   irqe.dest_id = e-fields.dest_id; - 
  irqe.vector =
  e-fields.vector; - irqe.dest_mode = e-fields.dest_mode;
  -   irqe.delivery_mode = e-fields.delivery_mode  8;
  -   kvm_calculate_eoi_exitmap(vcpu, irqe, 
  eoi_exit_bitmap); +  if
  (kvm_apic_match_dest(vcpu, NULL, 0, +   
  e-fields.dest_id,
  e-fields.dest_mode)) + __set_bit(irqe.vector, 
  +(unsigned long
  *)eoi_exit_bitmap);
  }
  }
  spin_unlock(ioapic-lock);
  --
  1.7.1
 
 Any comments?
 
You can drop irqe now since it was needed for
kvm_calculate_eoi_exitmap() call.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 0/3] virtio/vhost: Add checks for uninitialized VQs

2013-04-02 Thread Michael S. Tsirkin
On Mon, Apr 01, 2013 at 11:58:21PM +, Nicholas A. Bellinger wrote:
 From: Nicholas Bellinger n...@linux-iscsi.org
 
 Hi folks,
 
 This series adds a virtio_queue_valid() for use by virtio-pci code in
 order to prevent opreations upon uninitialized VQs, which is currently
 expected to occur during seabios setup of virtio-scsi with in-flight
 vhost-scsi-pci device code.
 
 On the vhost side, it also adds virtio_queue_valid() sanity checks in
 vhost_virtqueue_[start,stop]() and vhost_verify_ring_mappings() in order
 to skip the same uninitialized VQs.
 
 Changes from v1:
   - Remove now unnecessary virtio_queue_get_num() calls in virtio-pci.c
   - Add virtio_queue_valid() calls in vhost_virtqueue_[start,stop]()
 
 Please review.
 
 --nab

Looks reasonable.
Acked-by: Michael S. Tsirkin m...@redhat.com

So - does this fix the issues you saw with vhost-scsi?

 Michael S. Tsirkin (1):
   virtio: add API to check that ring is setup
 
 Nicholas Bellinger (2):
   virtio-pci: Add virtio_queue_valid checks ahead of
 virtio_queue_get_num
   vhost: Skip uninitialized VQs in vhost_virtqueue_[start,stop]
 
  hw/vhost.c  |   12 
  hw/virtio-pci.c |   34 +++---
  hw/virtio.c |5 +
  hw/virtio.h |1 +
  4 files changed, 33 insertions(+), 19 deletions(-)
 
 -- 
 1.7.2.5
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup

2013-04-02 Thread Michael S. Tsirkin
On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote:
 On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote:
  On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote:
   On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote:
On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote:
 On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin wrote:
  On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote:
   Currently, vs-vs_endpoint is used indicate if the endpoint is 
   setup or
   not. It is set or cleared in vhost_scsi_set_endpoint() or
   vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. 
   However, when
   we check it in vhost_scsi_handle_vq(), we ignored the lock.
   
   Instead of using the vs-vs_endpoint and the vs-dev.mutex lock to
   indicate the status of the endpoint, we use per virtqueue
   vq-private_data to indicate it. In this way, we can only take the
   vq-mutex lock which is per queue and make the concurrent 
   multiqueue
   process having less lock contention. Further, in the read side of
   vq-private_data, we can even do not take only lock if it is 
   accessed in
   the vhost worker thread, because it is protected by vhost rcu.
   
   Signed-off-by: Asias He as...@redhat.com
   ---
drivers/vhost/tcm_vhost.c | 38 
   +-
1 file changed, 33 insertions(+), 5 deletions(-)
   
   diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
   index 5e3d4487..0524267 100644
   --- a/drivers/vhost/tcm_vhost.c
   +++ b/drivers/vhost/tcm_vhost.c
   @@ -67,7 +67,6 @@ struct vhost_scsi {
 /* Protected by vhost_scsi-dev.mutex */
 struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET];
 char vs_vhost_wwpn[TRANSPORT_IQN_LEN];
   - bool vs_endpoint;

 struct vhost_dev dev;
 struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
   @@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov)
((unsigned long)iov-iov_base  PAGE_MASK))  
   PAGE_SHIFT;
}

   +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue *vq)
   +{
   + bool ret = false;
   +
   + /*
   +  * We can handle the vq only after the endpoint is setup by 
   calling the
   +  * VHOST_SCSI_SET_ENDPOINT ioctl.
   +  *
   +  * TODO: Check that we are running from vhost_worker which acts
   +  * as read-side critical section for vhost kind of RCU.
   +  * See the comments in struct vhost_virtqueue in 
   drivers/vhost/vhost.h
   +  */
   + if (rcu_dereference_check(vq-private_data, 1))
   + ret = true;
   +
   + return ret;
   +}
   +
static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
{
 return 1;
   @@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct 
   vhost_scsi *vs,
 int head, ret;
 u8 target;

   - /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */
   - if (unlikely(!vs-vs_endpoint))
   + if (!tcm_vhost_check_endpoint(vq))
 return;
  
  
  I would just move the check to under vq mutex,
  and avoid rcu completely. In vhost-net we are using
  private data outside lock so we can't do this,
  no such issue here.
 
 Are you talking about:
 
handle_tx:
/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq-private_data, 1);
if (!sock)
return;

wmem = atomic_read(sock-sk-sk_wmem_alloc);
if (wmem = sock-sk-sk_sndbuf) {
mutex_lock(vq-mutex);
tx_poll_start(net, sock);
mutex_unlock(vq-mutex);
return;
}
mutex_lock(vq-mutex);
 
 Why not do the atomic_read and tx_poll_start under the vq-mutex, and 
 thus do
 the check under the lock as well.

handle_rx:
mutex_lock(vq-mutex);

/* TODO: check that we are running from vhost_worker? */
struct socket *sock = 
 rcu_dereference_check(vq-private_data, 1);

if (!sock)
return;

mutex_lock(vq-mutex);
 
 Can't we can do the check under the vq-mutex here?
 
 The rcu is still there but it makes the code easier to read. IMO, If 
 we want to
 use rcu, use it explicitly and avoid the vhost rcu completely. 
 
 mutex_lock(vq-mutex);
   @@ -829,11 +845,12 @@ static int vhost_scsi_set_endpoint(
sizeof(vs-vs_vhost_wwpn));
 for (i = 0; i  VHOST_SCSI_MAX_VQ; i++) {
 

Re: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status

2013-04-02 Thread Gleb Natapov
On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2013-03-26:
  Il 22/03/2013 06:24, Yang Zhang ha scritto:
  +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu,
  +  struct rtc_status *rtc_status, int irq)
  +{
  +  if (irq != RTC_GSI)
  +  return;
  +
  +  if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map))
  +  --rtc_status-pending_eoi;
  +
  +  WARN_ON(rtc_status-pending_eoi  0);
  +}
  
  This is the only case where you're passing the struct rtc_status instead
  of the struct kvm_ioapic.  Please use the latter, and make it the first
  argument.
 
  @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic, 
  int
  irq)
 irqe.level = 1;
 irqe.shorthand = 0;
  -  return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL);
  +  if (irq == RTC_GSI) {
  +  ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe,
  +  ioapic-rtc_status.dest_map);
  +  ioapic-rtc_status.pending_eoi = ret;
  
  I think you should either add a
  
  BUG_ON(ioapic-rtc_status.pending_eoi != 0);
  or use ioapic-rtc_status.pending_eoi += ret (or both).
  
 There may malicious guest to write EOI more than once. And the pending_eoi 
 will be negative. But it should not be a bug. Just WARN_ON is enough. And we 
 already do it in ack_eoi. So don't need to do duplicated thing here.
 
Since we track vcpus that already called EOI and decrement pending_eoi
only once for each vcpu malicious guest cannot trigger it, but we
already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need
another one here. += will be correct (since pending_eoi == 0 here), but
confusing since it makes an impression that pending_eoi may not be zero.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup

2013-04-02 Thread Michael S. Tsirkin
On Tue, Apr 02, 2013 at 09:27:57AM +1030, Rusty Russell wrote:
 Michael S. Tsirkin m...@redhat.com writes:
  Rusty's currently doing some reorgs of -net let's delay
  cleanups there to avoid stepping on each other's toys.
  Let's focus on scsi here.
  E.g. any chance framing assumptions can be fixed in 3.10?
 
 I am waiting for your removal of the dma-compelete ordering stuff in
 vhost-net.
 
 Cheers,
 Rusty.

Sure.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/11] KVM: s390: More patches for kvm-next.

2013-04-02 Thread Gleb Natapov
On Mon, Mar 25, 2013 at 05:22:47PM +0100, Cornelia Huck wrote:
 Hi,
 
 here are some kvm/s390 patches that have accumulated in our queue.
 
 Changes include fixes in the lpsw(e) and stsi handlers, proper
 handling of interrupt injection failures and a gmap optimization.
 
 Also included are patches allowing support for standby memory on
 kvm guests. Standby memory is used for providing hotpluggable
 memory on s390.
 
 Please consider applying.
 
Applied, thanks.

 Christian Borntraeger (1):
   KVM: s390: Dont do a gmap update on minor memslot changes
 
 Heiko Carstens (7):
   KVM: s390: fix 24 bit psw handling in lpsw/lpswe handler
   KVM: s390: fix psw conversion in lpsw handler
   KVM: s390: fix return code handling in lpsw/lpswe handlers
   KVM: s390: make if statements in lpsw/lpswe handlers readable
   KVM: s390: fix and enforce return code handling for irq injections
   KVM: s390: fix stsi exception handling
   KVM: s390: fix compile with !CONFIG_COMPAT
 
 Nick Wang (3):
   KVM: s390: Change the virtual memory mapping location for virtio
 devices
   KVM: s390: Remove the sanity checks for kvm memory slot
   KVM: s390: Enable KVM_CAP_NR_MEMSLOTS on s390
 
  arch/s390/kvm/intercept.c |  12 +--
  arch/s390/kvm/kvm-s390.c  |  32 ---
  arch/s390/kvm/kvm-s390.h  |  12 +--
  arch/s390/kvm/priv.c  | 203 
 +++---
  drivers/s390/kvm/kvm_virtio.c |  11 +--
  5 files changed, 108 insertions(+), 162 deletions(-)
 
 -- 
 1.7.12.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM call agenda for 2013-04-02

2013-04-02 Thread Juan Quintela
Juan Quintela quint...@redhat.com wrote:
 Hi

 Please send in any agenda topics you are interested in.

As there are no items, today call is cancelled.

Happy hacking.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] ARM: KVM: fix HYP mapping limitations around zero

2013-04-02 Thread Marc Zyngier
The current code for creating HYP mapping doesn't like to wrap
around zero, which prevents from mapping anything into the last
page of the virtual address space.

It doesn't take much effort to remove this limitation, making
the code more consistent with the rest of the kernel in the process.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/mmu.c | 21 ++---
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 24811d1..eb4f8fa 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -131,11 +131,12 @@ static void create_hyp_pte_mappings(pmd_t *pmd, unsigned 
long start,
pte_t *pte;
unsigned long addr;
 
-   for (addr = start; addr  end; addr += PAGE_SIZE) {
+   addr = start;
+   do {
pte = pte_offset_kernel(pmd, addr);
kvm_set_pte(pte, pfn_pte(pfn, prot));
pfn++;
-   }
+   } while (addr += PAGE_SIZE, addr != end);
 }
 
 static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
@@ -146,7 +147,8 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned 
long start,
pte_t *pte;
unsigned long addr, next;
 
-   for (addr = start; addr  end; addr = next) {
+   addr = start;
+   do {
pmd = pmd_offset(pud, addr);
 
BUG_ON(pmd_sect(*pmd));
@@ -164,7 +166,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned 
long start,
 
create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
pfn += (next - addr)  PAGE_SHIFT;
-   }
+   } while (addr = next, addr != end);
 
return 0;
 }
@@ -179,11 +181,10 @@ static int __create_hyp_mappings(pgd_t *pgdp,
unsigned long addr, next;
int err = 0;
 
-   if (start = end)
-   return -EINVAL;
-
mutex_lock(kvm_hyp_pgd_mutex);
-   for (addr = start  PAGE_MASK; addr  end; addr = next) {
+   addr = start  PAGE_MASK;
+   end = PAGE_ALIGN(end);
+   do {
pgd = pgdp + pgd_index(addr);
pud = pud_offset(pgd, addr);
 
@@ -202,7 +203,7 @@ static int __create_hyp_mappings(pgd_t *pgdp,
if (err)
goto out;
pfn += (next - addr)  PAGE_SHIFT;
-   }
+   } while (addr = next, addr != end);
 out:
mutex_unlock(kvm_hyp_pgd_mutex);
return err;
@@ -216,8 +217,6 @@ out:
  * The same virtual address as the kernel virtual address is also used
  * in Hyp-mode mapping (modulo HYP_PAGE_OFFSET) to the same underlying
  * physical pages.
- *
- * Note: Wrapping around zero in the to address is not supported.
  */
 int create_hyp_mappings(void *from, void *to)
 {
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-02 Thread Marc Zyngier
Our HYP init code suffers from two major design issues:
- it cannot support CPU hotplug, as we tear down the idmap very early
- it cannot perform a TLB invalidation when switching from init to
  runtime mappings, as pages are manipulated from PL1 exclusively

The hotplug problem mandates that we keep two sets of page tables
(boot and runtime). The TLB problem mandates that we're able to
transition from one PGD to another while in HYP, invalidating the TLBs
in the process.

To be able to do this, we need to share a page between the two page
tables. A page that will have the same VA in both configurations. All we
need is a VA that has the following properties:
- This VA can't be used to represent a kernel mapping.
- This VA will not conflict with the physical address of the kernel text

The vectors page seems to satisfy this requirement:
- The kernel never maps anything else there
- The kernel text being copied at the beginning of the physical memory,
  it is unlikely to use the last 64kB (I doubt we'll ever support KVM
  on a system with something like 4MB of RAM, but patches are very
  welcome).

Let's call this VA the trampoline VA.

Now, we map our init page at 3 locations:
- idmap in the boot pgd
- trampoline VA in the boot pgd
- trampoline VA in the runtime pgd

The init scenario is now the following:
- We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
  runtime stack, runtime vectors
- Enable the MMU with the boot pgd
- Jump to a target into the trampoline page (remember, this is the same
  physical page!)
- Now switch to the runtime pgd (same VA, and still the same physical
  page!)
- Invalidate TLBs
- Set stack and vectors
- Profit! (or eret, if you only care about the code).

Note that we keep the boot mapping permanently (it is not strictly an
idmap anymore) to allow for CPU hotplug in later patches.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_host.h | 18 ---
 arch/arm/include/asm/kvm_mmu.h  | 21 ++--
 arch/arm/kvm/arm.c  |  9 ++
 arch/arm/kvm/init.S | 29 +++--
 arch/arm/kvm/mmu.c  | 71 ++---
 5 files changed, 101 insertions(+), 47 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index a7a0bb5..3556684 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -190,22 +190,32 @@ int kvm_arm_coproc_set_reg(struct kvm_vcpu *vcpu, const 
struct kvm_one_reg *);
 int handle_exit(struct kvm_vcpu *vcpu, struct kvm_run *run,
int exception_index);
 
-static inline void __cpu_init_hyp_mode(unsigned long long pgd_ptr,
+static inline void __cpu_init_hyp_mode(unsigned long long boot_pgd_ptr,
+  unsigned long long pgd_ptr,
   unsigned long hyp_stack_ptr,
   unsigned long vector_ptr)
 {
unsigned long pgd_low, pgd_high;
 
-   pgd_low = (pgd_ptr  ((1ULL  32) - 1));
-   pgd_high = (pgd_ptr  32ULL);
+   pgd_low = (boot_pgd_ptr  ((1ULL  32) - 1));
+   pgd_high = (boot_pgd_ptr  32ULL);
 
/*
 * Call initialization code, and switch to the full blown
 * HYP code. The init code doesn't need to preserve these registers as
-* r1-r3 and r12 are already callee save according to the AAPCS.
+* r1-r3 and r12 are already callee saved according to the AAPCS.
 * Note that we slightly misuse the prototype by casing the pgd_low to
 * a void *.
+*
+* We don't have enough registers to perform the full init in one go.
+* Install the boot PGD first, and then install the runtime PGD,
+* stack pointer and vectors.
 */
+   kvm_call_hyp((void *)pgd_low, pgd_high, 0, 0);
+
+   pgd_low = (pgd_ptr  ((1ULL  32) - 1));
+   pgd_high = (pgd_ptr  32ULL);
+
kvm_call_hyp((void *)pgd_low, pgd_high, hyp_stack_ptr, vector_ptr);
 }
 
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 92eb20d..3567a49 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -19,17 +19,29 @@
 #ifndef __ARM_KVM_MMU_H__
 #define __ARM_KVM_MMU_H__
 
-#include asm/cacheflush.h
-#include asm/pgalloc.h
+#include asm/memory.h
+#include asm/page.h
 
 /*
  * We directly use the kernel VA for the HYP, as we can directly share
  * the mapping (HTTBR covers TTBR1).
  */
-#define HYP_PAGE_OFFSET_MASK   (~0UL)
+#define HYP_PAGE_OFFSET_MASK   UL(~0)
 #define HYP_PAGE_OFFSETPAGE_OFFSET
 #define KERN_TO_HYP(kva)   (kva)
 
+/*
+ * Our virtual mapping for the boot-time MMU-enable code. Must be
+ * shared across all the page-tables. Conveniently, we use the vectors
+ * page, where no kernel data will ever be shared with HYP.
+ */
+#define TRAMPOLINE_VA  UL(CONFIG_VECTORS_BASE)
+
+#ifndef __ASSEMBLY__
+
+#include 

[PATCH 1/7] ARM: KVM: simplify HYP mapping population

2013-04-02 Thread Marc Zyngier
The way we populate HYP mappings is a bit convoluted, to say the least.
Passing a pointer around to keep track of the current PFN is quite
odd, and we end-up having two different PTE accessors for no good
reason.

Simplify the whole thing by unifying the two PTE accessors, passing
a pgprot_t around, and moving the various validity checks to the
upper layers.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/mmu.c | 100 ++---
 1 file changed, 41 insertions(+), 59 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 2f12e40..24811d1 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -125,54 +125,34 @@ void free_hyp_pmds(void)
 }
 
 static void create_hyp_pte_mappings(pmd_t *pmd, unsigned long start,
-   unsigned long end)
+   unsigned long end, unsigned long pfn,
+   pgprot_t prot)
 {
pte_t *pte;
unsigned long addr;
-   struct page *page;
 
-   for (addr = start  PAGE_MASK; addr  end; addr += PAGE_SIZE) {
-   unsigned long hyp_addr = KERN_TO_HYP(addr);
-
-   pte = pte_offset_kernel(pmd, hyp_addr);
-   BUG_ON(!virt_addr_valid(addr));
-   page = virt_to_page(addr);
-   kvm_set_pte(pte, mk_pte(page, PAGE_HYP));
-   }
-}
-
-static void create_hyp_io_pte_mappings(pmd_t *pmd, unsigned long start,
-  unsigned long end,
-  unsigned long *pfn_base)
-{
-   pte_t *pte;
-   unsigned long addr;
-
-   for (addr = start  PAGE_MASK; addr  end; addr += PAGE_SIZE) {
-   unsigned long hyp_addr = KERN_TO_HYP(addr);
-
-   pte = pte_offset_kernel(pmd, hyp_addr);
-   BUG_ON(pfn_valid(*pfn_base));
-   kvm_set_pte(pte, pfn_pte(*pfn_base, PAGE_HYP_DEVICE));
-   (*pfn_base)++;
+   for (addr = start; addr  end; addr += PAGE_SIZE) {
+   pte = pte_offset_kernel(pmd, addr);
+   kvm_set_pte(pte, pfn_pte(pfn, prot));
+   pfn++;
}
 }
 
 static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
-  unsigned long end, unsigned long *pfn_base)
+  unsigned long end, unsigned long pfn,
+  pgprot_t prot)
 {
pmd_t *pmd;
pte_t *pte;
unsigned long addr, next;
 
for (addr = start; addr  end; addr = next) {
-   unsigned long hyp_addr = KERN_TO_HYP(addr);
-   pmd = pmd_offset(pud, hyp_addr);
+   pmd = pmd_offset(pud, addr);
 
BUG_ON(pmd_sect(*pmd));
 
if (pmd_none(*pmd)) {
-   pte = pte_alloc_one_kernel(NULL, hyp_addr);
+   pte = pte_alloc_one_kernel(NULL, addr);
if (!pte) {
kvm_err(Cannot allocate Hyp pte\n);
return -ENOMEM;
@@ -182,25 +162,17 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned 
long start,
 
next = pmd_addr_end(addr, end);
 
-   /*
-* If pfn_base is NULL, we map kernel pages into HYP with the
-* virtual address. Otherwise, this is considered an I/O
-* mapping and we map the physical region starting at
-* *pfn_base to [start, end[.
-*/
-   if (!pfn_base)
-   create_hyp_pte_mappings(pmd, addr, next);
-   else
-   create_hyp_io_pte_mappings(pmd, addr, next, pfn_base);
+   create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
+   pfn += (next - addr)  PAGE_SHIFT;
}
 
return 0;
 }
 
-static int __create_hyp_mappings(void *from, void *to, unsigned long *pfn_base)
+static int __create_hyp_mappings(pgd_t *pgdp,
+unsigned long start, unsigned long end,
+unsigned long pfn, pgprot_t prot)
 {
-   unsigned long start = (unsigned long)from;
-   unsigned long end = (unsigned long)to;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -209,21 +181,14 @@ static int __create_hyp_mappings(void *from, void *to, 
unsigned long *pfn_base)
 
if (start = end)
return -EINVAL;
-   /* Check for a valid kernel memory mapping */
-   if (!pfn_base  (!virt_addr_valid(from) || !virt_addr_valid(to - 1)))
-   return -EINVAL;
-   /* Check for a valid kernel IO mapping */
-   if (pfn_base  (!is_vmalloc_addr(from) || !is_vmalloc_addr(to - 1)))
-   return -EINVAL;
 
mutex_lock(kvm_hyp_pgd_mutex);
-   for (addr = start; addr  end; addr = next) {
-   unsigned long hyp_addr = KERN_TO_HYP(addr);
-

[PATCH 7/7] ARM: KVM: perform HYP initilization for hotplugged CPUs

2013-04-02 Thread Marc Zyngier
Now that we have the necessary infrastructure to boot a hotplugged CPU
at any point in time, wire a CPU notifier that will perform the HYP
init for the incoming CPU.

Note that this depends on the platform code and/or firmware to boot the
incoming CPU with HYP mode enabled and return to the kernel by following
the normal boot path (HYP stub installed).

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/arm.c | 47 +++
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index f0f3290..6cc076e 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -793,8 +793,9 @@ long kvm_arch_vm_ioctl(struct file *filp,
}
 }
 
-static void cpu_init_hyp_mode(void *vector)
+static void cpu_init_hyp_mode(void *dummy)
 {
+   phys_addr_t init_vector_addr = virt_to_phys(__kvm_hyp_init);
unsigned long long boot_pgd_ptr;
unsigned long long pgd_ptr;
unsigned long hyp_stack_ptr;
@@ -802,7 +803,7 @@ static void cpu_init_hyp_mode(void *vector)
unsigned long vector_ptr;
 
/* Switch from the HYP stub to our own HYP init vector */
-   __hyp_set_vectors((unsigned long)vector);
+   __hyp_set_vectors(init_vector_addr);
 
boot_pgd_ptr = (unsigned long long)kvm_mmu_get_boot_httbr();
pgd_ptr = (unsigned long long)kvm_mmu_get_httbr();
@@ -813,12 +814,28 @@ static void cpu_init_hyp_mode(void *vector)
__cpu_init_hyp_mode(boot_pgd_ptr, pgd_ptr, hyp_stack_ptr, vector_ptr);
 }
 
+static int hyp_init_cpu_notify(struct notifier_block *self,
+  unsigned long action, void *cpu)
+{
+   switch (action) {
+   case CPU_STARTING:
+   case CPU_STARTING_FROZEN:
+   cpu_init_hyp_mode(NULL);
+   break;
+   }
+
+   return NOTIFY_OK;
+}
+
+static struct notifier_block hyp_init_cpu_nb = {
+   .notifier_call = hyp_init_cpu_notify,
+};
+
 /**
  * Inits Hyp-mode on all online CPUs
  */
 static int init_hyp_mode(void)
 {
-   phys_addr_t init_phys_addr;
int cpu;
int err = 0;
 
@@ -851,19 +868,6 @@ static int init_hyp_mode(void)
}
 
/*
-* Execute the init code on each CPU.
-*
-* Note: The stack is not mapped yet, so don't do anything else than
-* initializing the hypervisor mode on each CPU using a local stack
-* space for temporary storage.
-*/
-   init_phys_addr = virt_to_phys(__kvm_hyp_init);
-   for_each_online_cpu(cpu) {
-   smp_call_function_single(cpu, cpu_init_hyp_mode,
-(void *)(long)init_phys_addr, 1);
-   }
-
-   /*
 * Map the Hyp-code called directly from the host
 */
err = create_hyp_mappings(__kvm_hyp_code_start, __kvm_hyp_code_end);
@@ -908,6 +912,11 @@ static int init_hyp_mode(void)
}
 
/*
+* Execute the init code on each CPU.
+*/
+   on_each_cpu(cpu_init_hyp_mode, NULL, 1);
+
+   /*
 * Init HYP view of VGIC
 */
err = kvm_vgic_hyp_init();
@@ -963,6 +972,12 @@ int kvm_arch_init(void *opaque)
if (err)
goto out_err;
 
+   err = register_cpu_notifier(hyp_init_cpu_nb);
+   if (err) {
+   kvm_err(Cannot register HYP init CPU notifier (%d)\n, err);
+   goto out_err;
+   }
+
kvm_coproc_table_init();
return 0;
 out_err:
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] ARM: KVM: Revamping the HYP init code for fun and profit

2013-04-02 Thread Marc Zyngier
Over the past few weeks, I've gradually realized how broken our HYP
idmap code is. Badly broken.

The main problem is about supporting CPU hotplug. Imagine a CPU being
initialized normally, running VMs, and then being powered down. So
far, so good. Now mentally bring it back online. The CPU will come
back via the secondary CPU boot path, and then what? We cannot use it
anymore, because we need an idmap which is long gone, and because our
page tables are now live, containing the world-switch code, VM
structures, and other bits and pieces.

Another fun issue is that we don't have any TLB invalidation in the
HYP init code. And guess what? we cannot do it! HYP TLB invalidation
has to occur in HYP, and once we've installed the runtime page tables,
it is already too late. It is actually fairly easy to construct a
scenario where idmap and runtime pages have colliding translations.

The nail on the coffin was provided by Catalin Marinas who told me how
much he disliked the arm64 HYP idmap code, and made me realize that we
already have all the necessary code in arch/arm/kvm/mmu.c. It just
needs a tiny bit of care and affection. With a chainsaw.

The solution to the first two issues is a bit tricky, but doesn't
involve a lot of code. The hotplug problem mandates that we keep two
sets of page tables (boot and runtime). The TLB problem mandates that
we're able to transition from one PGD to another while in HYP,
invalidating the TLBs in the process.

To be able to do this, we need to share a page between the two page
tables. A page that will have the same VA in both configurations. All
we need is a VA that has the following properties:
- This VA can't be used to represent a kernel mapping.
- This VA will not conflict with the physical address of the kernel
  text

The vectors page VA seems to satisfy this requirement:
- The kernel never maps anything else there
- The kernel text being copied at the beginning of the physical
  memory, it is unlikely to use the last 64kB (I doubt we'll ever
  support KVM on a system with something like 4MB of RAM, but patches
  are very welcome).

Let's call this VA the trampoline VA.

Now, we map our init page at 3 locations:
- idmap in the boot pgd
- trampoline VA in the boot pgd
- trampoline VA in the runtime pgd

The init scenario is now the following:
- We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
  runtime stack, runtime vectors
- Enable the MMU with the boot pgd
- Jump to a target into the trampoline page (remember, this is the
  same physical page!)
- Now switch to the runtime pgd (same VA, and still the same physical
  page!)
- Invalidate TLBs
- Set stack and vectors
- Profit! (or eret, if you only care about the code).

Once we have this infrastructure in place, supporting CPU hot-plug is
a piece of cake. Just wire a cpu-notifier in the existing code.

This has been tested on both arm (VE TC2) and arm64 (Foundation Model).

Marc Zyngier (7):
  ARM: KVM: simplify HYP mapping population
  ARM: KVM: fix HYP mapping limitations around zero
  ARM: KVM: move to a KVM provided HYP idmap
  ARM: KVM: enforce page alignment for identity mapped code
  ARM: KVM: parametrize HYP page table freeing
  ARM: KVM: switch to a dual-step HYP init code
  ARM: KVM: perform HYP initilization for hotplugged CPUs

 arch/arm/include/asm/idmap.h|   1 -
 arch/arm/include/asm/kvm_host.h |  18 +++-
 arch/arm/include/asm/kvm_mmu.h  |  24 -
 arch/arm/kernel/vmlinux.lds.S   |   2 +-
 arch/arm/kvm/arm.c  |  58 ++
 arch/arm/kvm/init.S |  36 ++-
 arch/arm/kvm/mmu.c  | 232 +---
 arch/arm/mm/idmap.c |  31 +-
 8 files changed, 227 insertions(+), 175 deletions(-)

-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] ARM: KVM: parametrize HYP page table freeing

2013-04-02 Thread Marc Zyngier
In order to prepare for having to deal with multiple HYP page tables,
pass the PGD parameter to the function performing the freeing of the
page tables.

Also move the freeing of the PGD itself there, and rename the
free_hyp_pmds to free_hyp_pgds.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_mmu.h |  2 +-
 arch/arm/kvm/arm.c |  2 +-
 arch/arm/kvm/mmu.c | 30 +-
 3 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 3c71a1d..92eb20d 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -32,7 +32,7 @@
 
 int create_hyp_mappings(void *from, void *to);
 int create_hyp_io_mappings(void *from, void *to, phys_addr_t);
-void free_hyp_pmds(void);
+void free_hyp_pgds(void);
 
 int kvm_alloc_stage2_pgd(struct kvm *kvm);
 void kvm_free_stage2_pgd(struct kvm *kvm);
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 2ce90bb..6eba879 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -936,7 +936,7 @@ static int init_hyp_mode(void)
 out_free_context:
free_percpu(kvm_host_cpu_state);
 out_free_mappings:
-   free_hyp_pmds();
+   free_hyp_pgds();
 out_free_stack_pages:
for_each_possible_cpu(cpu)
free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index 7d23480..85b3553 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -86,42 +86,46 @@ static void free_ptes(pmd_t *pmd, unsigned long addr)
}
 }
 
-static void free_hyp_pgd_entry(unsigned long addr)
+static void free_hyp_pgd_entry(pgd_t *pgdp, unsigned long addr)
 {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-   unsigned long hyp_addr = KERN_TO_HYP(addr);
 
-   pgd = hyp_pgd + pgd_index(hyp_addr);
-   pud = pud_offset(pgd, hyp_addr);
+   pgd = pgdp + pgd_index(addr);
+   pud = pud_offset(pgd, addr);
 
if (pud_none(*pud))
return;
BUG_ON(pud_bad(*pud));
 
-   pmd = pmd_offset(pud, hyp_addr);
+   pmd = pmd_offset(pud, addr);
free_ptes(pmd, addr);
pmd_free(NULL, pmd);
pud_clear(pud);
 }
 
 /**
- * free_hyp_pmds - free a Hyp-mode level-2 tables and child level-3 tables
+ * free_hyp_pgds - free Hyp-mode page tables
  *
- * Assumes this is a page table used strictly in Hyp-mode and therefore 
contains
+ * Assumes hyp_pgd is a page table used strictly in Hyp-mode and therefore 
contains
  * either mappings in the kernel memory area (above PAGE_OFFSET), or
  * device mappings in the vmalloc range (from VMALLOC_START to VMALLOC_END).
  */
-void free_hyp_pmds(void)
+void free_hyp_pgds(void)
 {
unsigned long addr;
 
mutex_lock(kvm_hyp_pgd_mutex);
-   for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += PGDIR_SIZE)
-   free_hyp_pgd_entry(addr);
-   for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr += 
PGDIR_SIZE)
-   free_hyp_pgd_entry(addr);
+
+   if (hyp_pgd) {
+   for (addr = PAGE_OFFSET; virt_addr_valid(addr); addr += 
PGDIR_SIZE)
+   free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr));
+   for (addr = VMALLOC_START; is_vmalloc_addr((void*)addr); addr 
+= PGDIR_SIZE)
+   free_hyp_pgd_entry(hyp_pgd, KERN_TO_HYP(addr));
+   kfree(hyp_pgd);
+   }
+
mutex_unlock(kvm_hyp_pgd_mutex);
 }
 
@@ -741,7 +745,7 @@ int kvm_mmu_init(void)
 
return 0;
 out:
-   kfree(hyp_pgd);
+   free_hyp_pgds();
return err;
 }
 
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code

2013-04-02 Thread Marc Zyngier
We're about to move to a init procedure where we rely on the
fact that the init code fits in a single page. Make sure we
align the idmap text on a page boundary, and that the code is
not bigger than a single page.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kernel/vmlinux.lds.S | 2 +-
 arch/arm/kvm/init.S   | 7 +++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
index b571484..d9dd265 100644
--- a/arch/arm/kernel/vmlinux.lds.S
+++ b/arch/arm/kernel/vmlinux.lds.S
@@ -20,7 +20,7 @@
VMLINUX_SYMBOL(__idmap_text_start) = .; \
*(.idmap.text)  \
VMLINUX_SYMBOL(__idmap_text_end) = .;   \
-   ALIGN_FUNCTION();   \
+   . = ALIGN(PAGE_SIZE);   \
VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \
*(.hyp.idmap.text)  \
VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
index 9f37a79..35a463f 100644
--- a/arch/arm/kvm/init.S
+++ b/arch/arm/kvm/init.S
@@ -111,4 +111,11 @@ __do_hyp_init:
.globl __kvm_hyp_init_end
 __kvm_hyp_init_end:
 
+   /*
+* The above code *must* fit in a single page for the trampoline
+* madness to work. Whoever decides to change it must make sure
+* we map the right amount of memory for the trampoline to work.
+* The line below ensures any breakage will get noticed.
+*/
+   .org__kvm_hyp_init + PAGE_SIZE
.popsection
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] ARM: KVM: move to a KVM provided HYP idmap

2013-04-02 Thread Marc Zyngier
After the HYP page table rework, it is pretty easy to let the KVM
code provide its own idmap, rather than expecting the kernel to
provide it. It takes actually less code to do so.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/idmap.h   |  1 -
 arch/arm/include/asm/kvm_mmu.h |  1 -
 arch/arm/kvm/mmu.c | 24 +++-
 arch/arm/mm/idmap.c| 31 +--
 4 files changed, 24 insertions(+), 33 deletions(-)

diff --git a/arch/arm/include/asm/idmap.h b/arch/arm/include/asm/idmap.h
index 1a66f907..bf863ed 100644
--- a/arch/arm/include/asm/idmap.h
+++ b/arch/arm/include/asm/idmap.h
@@ -8,7 +8,6 @@
 #define __idmap __section(.idmap.text) noinline notrace
 
 extern pgd_t *idmap_pgd;
-extern pgd_t *hyp_pgd;
 
 void setup_mm_for_reboot(void);
 
diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
index 970f3b5..3c71a1d 100644
--- a/arch/arm/include/asm/kvm_mmu.h
+++ b/arch/arm/include/asm/kvm_mmu.h
@@ -21,7 +21,6 @@
 
 #include asm/cacheflush.h
 #include asm/pgalloc.h
-#include asm/idmap.h
 
 /*
  * We directly use the kernel VA for the HYP, as we can directly share
diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index eb4f8fa..7d23480 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -32,6 +32,7 @@
 
 extern char  __hyp_idmap_text_start[], __hyp_idmap_text_end[];
 
+static pgd_t *hyp_pgd;
 static DEFINE_MUTEX(kvm_hyp_pgd_mutex);
 
 static void kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa)
@@ -715,12 +716,33 @@ phys_addr_t kvm_mmu_get_httbr(void)
 
 int kvm_mmu_init(void)
 {
+   unsigned long hyp_idmap_start = virt_to_phys(__hyp_idmap_text_start);
+   unsigned long hyp_idmap_end = virt_to_phys(__hyp_idmap_text_end);
+   int err;
+
+   hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
if (!hyp_pgd) {
kvm_err(Hyp mode PGD not allocated\n);
-   return -ENOMEM;
+   err = -ENOMEM;
+   goto out;
+   }
+
+   /* Create the idmap in the boot page tables */
+   err =   __create_hyp_mappings(boot_hyp_pgd,
+ hyp_idmap_start, hyp_idmap_end,
+ __phys_to_pfn(hyp_idmap_start),
+ PAGE_HYP);
+
+   if (err) {
+   kvm_err(Failed to idmap %lx-%lx\n,
+   hyp_idmap_start, hyp_idmap_end);
+   goto out;
}
 
return 0;
+out:
+   kfree(hyp_pgd);
+   return err;
 }
 
 /**
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index 5ee505c..9c467d0 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -83,37 +83,10 @@ static void identity_mapping_add(pgd_t *pgd, const char 
*text_start,
} while (pgd++, addr = next, addr != end);
 }
 
-#if defined(CONFIG_ARM_VIRT_EXT)  defined(CONFIG_ARM_LPAE)
-pgd_t *hyp_pgd;
-
-extern char  __hyp_idmap_text_start[], __hyp_idmap_text_end[];
-
-static int __init init_static_idmap_hyp(void)
-{
-   hyp_pgd = kzalloc(PTRS_PER_PGD * sizeof(pgd_t), GFP_KERNEL);
-   if (!hyp_pgd)
-   return -ENOMEM;
-
-   pr_info(Setting up static HYP identity map for 0x%p - 0x%p\n,
-   __hyp_idmap_text_start, __hyp_idmap_text_end);
-   identity_mapping_add(hyp_pgd, __hyp_idmap_text_start,
-__hyp_idmap_text_end, PMD_SECT_AP1);
-
-   return 0;
-}
-#else
-static int __init init_static_idmap_hyp(void)
-{
-   return 0;
-}
-#endif
-
 extern char  __idmap_text_start[], __idmap_text_end[];
 
 static int __init init_static_idmap(void)
 {
-   int ret;
-
idmap_pgd = pgd_alloc(init_mm);
if (!idmap_pgd)
return -ENOMEM;
@@ -123,12 +96,10 @@ static int __init init_static_idmap(void)
identity_mapping_add(idmap_pgd, __idmap_text_start,
 __idmap_text_end, 0);
 
-   ret = init_static_idmap_hyp();
-
/* Flush L1 for the hardware to see this page table content */
flush_cache_louis();
 
-   return ret;
+   return 0;
 }
 early_initcall(init_static_idmap);
 
-- 
1.8.1.4


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check

2013-04-02 Thread Michael S. Tsirkin
On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote:
 On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: 
  Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto:
   On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote:
   I think it's the right thing to do, but maybe not the right place
   to do this, need to reset after all IO is done, before
   ring memory is write protected.
  
   Our emails are crossing each other unfortunately, but I want to
   reinforce this: ring memory is not write protected.
   
   Understood.  However, AFAICT the act of write protecting these ranges
   for ROM generates the offending callbacks to vhost_set_memory().
   
   The part that I'm missing is if ring memory is not being write protected
   by make_bios_readonly_intel(), why are the vhost_set_memory() calls
   being invoked..?
  
  Because mappings change for the region that contains the ring.  vhost
  doesn't know yet that the changes do not affect ring memory,
  vhost_set_memory() is called exactly to ascertain that.
  
 
 Hi Paolo  Co,
 
 Here's a bit more information on what is going on with the same
 cpu_physical_memory_map() failure in vhost_verify_ring_mappings()..
 
 So as before, at the point that seabios is marking memory as readonly
 for ROM in src/shadow.c:make_bios_readonly_intel() with the following
 call:
 
 Calling pci_config_writeb(0x31): bdf: 0x pam: 0x005b
 
 the memory API update hook triggers back into vhost_region_del() code,
 and following occurs:
 
 Entering vhost_region_del section: 0x7fd30a213b60 offset_within_region: 
 0xc size: 2146697216 readonly: 0
 vhost_region_del: is_rom: 0, rom_device: 0
 vhost_region_del: readable: 1
 vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648
 vhost_region_del: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd30a213b60 add: 0, dev-started: 1
 Entering verify_ring_mappings: start_addr 0x000c size: 2146697216
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 
 5124
 address_space_map: addr: 0xed000, plen: 5124
 address_space_map: l: 4096, len: 5124
 phys_page_find got PHYS_MAP_NODE_NIL ..
 address_space_map: section: 0x7fd30fabaed0 memory_region_is_ram: 0 readonly: 0
 address_space_map: section: 0x7fd30fabaed0 offset_within_region: 0x0 section 
 size: 18446744073709551615
 Unable to map ring buffer for ring 2, l: 4096
 
 So the interesting part is that phys_page_find() is not able to locate
 the corresponding page for vq-ring_phys: 0xed000 from the
 vhost_region_del() callback with section-offset_within_region:
 0xc..
 
 Is there any case where this would not be considered a bug..? 
 
 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 
 0xc size: 32768 readonly: 1
 vhost_region_add: is_rom: 0, rom_device: 0
 vhost_region_add: readable: 1
 vhost_region_add: ram_addr 0x, addr: 0x   0 size: 
 2147483648
 vhost_region_add: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1
 Entering verify_ring_mappings: start_addr 0x000c size: 32768
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
 verify_ring_mappings: Got !ranges_overlap, skipping
 register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
 Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 
 0xc8000 size: 2146664448 readonly: 0
 vhost_region_add: is_rom: 0, rom_device: 0
 vhost_region_add: readable: 1
 vhost_region_add: ram_addr 0x, addr: 0x   0 size: 
 2147483648
 vhost_region_add: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1
 Entering verify_ring_mappings: start_addr 0x000c8000 size: 2146664448
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0x0 ring_size: 0
 verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
 verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 
 5124
 address_space_map: addr: 0xed000, plen: 5124
 address_space_map: l: 4096, len: 5124
 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 readonly: 0
 address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 
 section size: 2146664448
 address_space_map: l: 4096, len: 1028
 address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 readonly: 0
 address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 
 section size: 2146664448
 address_space_map: Calling qemu_ram_ptr_length: raddr: 0x

Re: [PATCH uq/master v2 1/2] kvm: reset state from the CPU's reset method

2013-04-02 Thread Gleb Natapov
On Fri, Mar 22, 2013 at 09:37:16PM +0100, Paolo Bonzini wrote:
 Now that we have a CPU object with a reset method, it is better to
 keep the KVM reset close to the CPU reset.  Using qemu_register_reset
 as we do now keeps them far apart.
 
 As a side effect, a CPU reset (cpu_reset) will reset the KVM state too.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
  include/sysemu/kvm.h   |  2 --
  kvm-all.c  | 11 ---
  target-arm/kvm.c   |  4 
  target-i386/cpu.c  |  5 +
  target-i386/kvm_i386.h |  1 +
  target-ppc/kvm.c   |  4 
  target-s390x/cpu.c |  4 
  target-s390x/cpu.h |  1 +
  8 files changed, 11 insertions(+), 21 deletions(-)
 
 diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
 index f2d97b5..50072c5 100644
 --- a/include/sysemu/kvm.h
 +++ b/include/sysemu/kvm.h
 @@ -199,8 +199,6 @@ int kvm_arch_init_vcpu(CPUState *cpu);
  /* Returns VCPU ID to be used on KVM_CREATE_VCPU ioctl() */
  unsigned long kvm_arch_vcpu_id(CPUState *cpu);
  
 -void kvm_arch_reset_vcpu(CPUState *cpu);
 -
  int kvm_arch_on_sigbus_vcpu(CPUState *cpu, int code, void *addr);
  int kvm_arch_on_sigbus(int code, void *addr);
  
 diff --git a/kvm-all.c b/kvm-all.c
 index 9b433d3..57616ef 100644
 --- a/kvm-all.c
 +++ b/kvm-all.c
 @@ -207,13 +207,6 @@ static int kvm_set_user_memory_region(KVMState *s, 
 KVMSlot *slot)
  return kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, mem);
  }
  
 -static void kvm_reset_vcpu(void *opaque)
 -{
 -CPUState *cpu = opaque;
 -
 -kvm_arch_reset_vcpu(cpu);
 -}
 -
  int kvm_init_vcpu(CPUState *cpu)
  {
  KVMState *s = kvm_state;
 @@ -253,10 +246,6 @@ int kvm_init_vcpu(CPUState *cpu)
  }
  
  ret = kvm_arch_init_vcpu(cpu);
 -if (ret == 0) {
 -qemu_register_reset(kvm_reset_vcpu, cpu);
 -kvm_arch_reset_vcpu(cpu);
 -}
  err:
  return ret;
  }
 diff --git a/target-arm/kvm.c b/target-arm/kvm.c
 index 82e2e08..841b85f 100644
 --- a/target-arm/kvm.c
 +++ b/target-arm/kvm.c
 @@ -430,10 +430,6 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run 
 *run)
  return 0;
  }
  
 -void kvm_arch_reset_vcpu(CPUState *cs)
 -{
 -}
 -
  bool kvm_arch_stop_on_emulation_error(CPUState *cs)
  {
  return true;
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index a0640db..a5746cd 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -24,6 +24,7 @@
  #include cpu.h
  #include sysemu/kvm.h
  #include sysemu/cpus.h
 +#include kvm_i386.h
  #include topology.h
  
  #include qemu/option.h
 @@ -2015,6 +2016,10 @@ static void x86_cpu_reset(CPUState *s)
  }
  
  s-halted = !cpu_is_bsp(cpu);
 +
 +if (kvm_enabled()) {
 +kvm_arch_reset_vcpu(s);
 +}
  #endif
  }
  
 diff --git a/target-i386/kvm_i386.h b/target-i386/kvm_i386.h
 index 4392ab4..3accc2d 100644
 --- a/target-i386/kvm_i386.h
 +++ b/target-i386/kvm_i386.h
 @@ -14,6 +14,7 @@
  #include sysemu/kvm.h
  
  bool kvm_allows_irq0_override(void);
 +void kvm_arch_reset_vcpu(CPUState *cs);
  
  int kvm_device_pci_assign(KVMState *s, PCIHostDeviceAddress *dev_addr,
uint32_t flags, uint32_t *dev_id);
 diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
 index e663ff0..0adea12 100644
 --- a/target-ppc/kvm.c
 +++ b/target-ppc/kvm.c
 @@ -424,10 +424,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
  return ret;
  }
  
 -void kvm_arch_reset_vcpu(CPUState *cpu)
 -{
 -}
 -
  static void kvm_sw_tlb_put(PowerPCCPU *cpu)
  {
  CPUPPCState *env = cpu-env;
 diff --git a/target-s390x/cpu.c b/target-s390x/cpu.c
 index 23fe51f..6321384 100644
 --- a/target-s390x/cpu.c
 +++ b/target-s390x/cpu.c
 @@ -84,6 +84,10 @@ static void s390_cpu_reset(CPUState *s)
   * after incrementing the cpu counter */
  #if !defined(CONFIG_USER_ONLY)
  s-halted = 1;
 +
 +if (kvm_enabled()) {
 +kvm_arch_reset_vcpu(s);
Does this compile with kvm support disabled?

 +}
  #endif
  tlb_flush(env, 1);
  }
 diff --git a/target-s390x/cpu.h b/target-s390x/cpu.h
 index e351005..fc84159 100644
 --- a/target-s390x/cpu.h
 +++ b/target-s390x/cpu.h
 @@ -352,6 +352,7 @@ void s390x_cpu_timer(void *opaque);
  int s390_virtio_hypercall(CPUS390XState *env);
  
  #ifdef CONFIG_KVM
 +void kvm_arch_reset_vcpu(CPUState *cs);
  void kvm_s390_interrupt(S390CPU *cpu, int type, uint32_t code);
  void kvm_s390_virtio_irq(S390CPU *cpu, int config_change, uint64_t token);
  void kvm_s390_interrupt_internal(S390CPU *cpu, int type, uint32_t parm,

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, April 02, 2013 1:57 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
 
 
 On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:
 
 
 
  -Original Message-
  From: Alexander Graf [mailto:ag...@suse.de]
  Sent: Thursday, March 28, 2013 10:06 PM
  To: Bhushan Bharat-R65777
  Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421;
  Bhushan
  Bharat-R65777
  Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
  support
 
 
  On 21.03.2013, at 07:25, Bharat Bhushan wrote:
 
  From: Bharat Bhushan bharat.bhus...@freescale.com
 
  This patch adds the debug stub support on booke/bookehv.
  Now QEMU debug stub can use hw breakpoint, watchpoint and software
  breakpoint to debug guest.
 
  Debug registers are saved/restored on vcpu_put()/vcpu_get().
  Also the debug registers are saved restored only if guest is using
  debug resources.
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
  v2:
  - save/restore in vcpu_get()/vcpu_put()
  - some more minor cleanup based on review comments.
 
  arch/powerpc/include/asm/kvm_host.h |   10 ++
  arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
  arch/powerpc/kvm/booke.c|  252 
  -
 --
  arch/powerpc/kvm/e500_emulate.c |   10 ++
  4 files changed, 272 insertions(+), 22 deletions(-)
 
  diff --git a/arch/powerpc/include/asm/kvm_host.h
  b/arch/powerpc/include/asm/kvm_host.h
  index f4ba881..8571952 100644
  --- a/arch/powerpc/include/asm/kvm_host.h
  +++ b/arch/powerpc/include/asm/kvm_host.h
  @@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
u32 mmucfg;
u32 epr;
u32 crit_save;
  + /* guest debug registers*/
struct kvmppc_booke_debug_reg dbg_reg;
  + /* shadow debug registers */
  + struct kvmppc_booke_debug_reg shadow_dbg_reg;
  + /* host debug registers*/
  + struct kvmppc_booke_debug_reg host_dbg_reg;
  + /*
  +  * Flag indicating that debug registers are used by guest
  +  * and requires save restore.
  + */
  + bool debug_save_restore;
  #endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
  diff --git a/arch/powerpc/include/uapi/asm/kvm.h
  b/arch/powerpc/include/uapi/asm/kvm.h
  index 15f9a00..d7ce449 100644
  --- a/arch/powerpc/include/uapi/asm/kvm.h
  +++ b/arch/powerpc/include/uapi/asm/kvm.h
  @@ -25,6 +25,7 @@
  /* Select powerpc specific features in linux/kvm.h */ #define
  __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
  +#define __KVM_HAVE_GUEST_DEBUG
 
  struct kvm_regs {
__u64 pc;
  @@ -267,7 +268,24 @@ struct kvm_fpu {
__u64 fpr[32];
  };
 
  +/*
  + * Defines for h/w breakpoint, watchpoint (read, write or both) and
  + * software breakpoint.
  + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
  + * for KVM_DEBUG_EXIT.
  + */
  +#define KVMPPC_DEBUG_NONE0x0
  +#define KVMPPC_DEBUG_BREAKPOINT  (1UL  1)
  +#define KVMPPC_DEBUG_WATCH_WRITE (1UL  2)
  +#define KVMPPC_DEBUG_WATCH_READ  (1UL  3)
  struct kvm_debug_exit_arch {
  + __u64 address;
  + /*
  +  * exiting to userspace because of h/w breakpoint, watchpoint
  +  * (read, write or both) and software breakpoint.
  +  */
  + __u32 status;
  + __u32 reserved;
  };
 
  /* for KVM_SET_GUEST_DEBUG */
  @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
 * Type denotes h/w breakpoint, read watchpoint, write
 * watchpoint or watchpoint (both read and write).
 */
  -#define KVMPPC_DEBUG_NOTYPE  0x0
  -#define KVMPPC_DEBUG_BREAKPOINT  (1UL  1)
  -#define KVMPPC_DEBUG_WATCH_WRITE (1UL  2)
  -#define KVMPPC_DEBUG_WATCH_READ  (1UL  3)
__u32 type;
__u32 reserved;
} bp[16];
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
  index
  1de93a8..bf20056 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct
  kvm_vcpu
  *vcpu) #endif }
 
  +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
  + /* Synchronize guest's desire to get debug interrupts into shadow
  +MSR */ #ifndef CONFIG_KVM_BOOKE_HV
  + vcpu-arch.shadow_msr = ~MSR_DE;
  + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
  +
  + /* Force enable debug interrupts when user space wants to debug */
  + if (vcpu-guest_debug) {
  +#ifdef CONFIG_KVM_BOOKE_HV
  + /*
  +  * Since there is no shadow MSR, sync MSR_DE into the guest
  +  * visible MSR. Do not allow guest to change MSR[DE].
  +  */
  + vcpu-arch.shared-msr |= MSR_DE;
  + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);
 
  This mtspr should really just be a bit or in shadow_mspr when
  guest_debug gets enabled. It should automatically get synchronized as
  soon as the next
  

Re: [PATCH] pmu: prepare for migration support

2013-04-02 Thread Gleb Natapov
On Thu, Mar 28, 2013 at 05:18:35PM +0100, Paolo Bonzini wrote:
 In order to migrate the PMU state correctly, we need to restore the
 values of MSR_CORE_PERF_GLOBAL_STATUS (a read-only register) and
 MSR_CORE_PERF_GLOBAL_OVF_CTRL (which has side effects when written).
 We also need to write the full 40-bit value of the performance counter,
  which would only be possible with a v3 architectural PMU's
  full-width counter MSRs.
 
  To distinguish host-initiated writes from the guest's, pass the
  full struct msr_data to kvm_pmu_set_msr.
 
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
Applied, thanks.

 ---
  arch/x86/include/asm/kvm_host.h |  2 +-
  arch/x86/kvm/pmu.c  | 14 +++---
  arch/x86/kvm/x86.c  |  4 ++--
  3 files changed, 14 insertions(+), 6 deletions(-)
 
 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
 index 36fba01..e2e09f3 100644
 --- a/arch/x86/include/asm/kvm_host.h
 +++ b/arch/x86/include/asm/kvm_host.h
 @@ -1029,7 +1029,7 @@ void kvm_pmu_reset(struct kvm_vcpu *vcpu);
  void kvm_pmu_cpuid_update(struct kvm_vcpu *vcpu);
  bool kvm_pmu_msr(struct kvm_vcpu *vcpu, u32 msr);
  int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *data);
 -int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 +int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
  int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
  void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
  void kvm_deliver_pmi(struct kvm_vcpu *vcpu);
 diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
 index cfc258a..c53e797 100644
 --- a/arch/x86/kvm/pmu.c
 +++ b/arch/x86/kvm/pmu.c
 @@ -360,10 +360,12 @@ int kvm_pmu_get_msr(struct kvm_vcpu *vcpu, u32 index, 
 u64 *data)
   return 1;
  }
  
 -int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data)
 +int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
  {
   struct kvm_pmu *pmu = vcpu-arch.pmu;
   struct kvm_pmc *pmc;
 + u32 index = msr_info-index;
 + u64 data = msr_info-data;
  
   switch (index) {
   case MSR_CORE_PERF_FIXED_CTR_CTRL:
 @@ -375,6 +377,10 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, 
 u64 data)
   }
   break;
   case MSR_CORE_PERF_GLOBAL_STATUS:
 + if (msr_info-host_initiated) {
 + pmu-global_status = data;
 + return 0;
 + }
   break; /* RO MSR */
   case MSR_CORE_PERF_GLOBAL_CTRL:
   if (pmu-global_ctrl == data)
 @@ -386,7 +392,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
 data)
   break;
   case MSR_CORE_PERF_GLOBAL_OVF_CTRL:
   if (!(data  (pmu-global_ctrl_mask  ~(3ull62 {
 - pmu-global_status = ~data;
 + if (!msr_info-host_initiated)
 + pmu-global_status = ~data;
   pmu-global_ovf_ctrl = data;
   return 0;
   }
 @@ -394,7 +401,8 @@ int kvm_pmu_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 
 data)
   default:
   if ((pmc = get_gp_pmc(pmu, index, MSR_IA32_PERFCTR0)) ||
   (pmc = get_fixed_pmc(pmu, index))) {
 - data = (s64)(s32)data;
 + if (!msr_info-host_initiated)
 + data = (s64)(s32)data;
   pmc-counter += data - read_pmc(pmc);
   return 0;
   } else if ((pmc = get_gp_pmc(pmu, index, MSR_P6_EVNTSEL0))) {
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 3e0a8ba..1d928af 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2042,7 +2042,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info)
   case MSR_P6_EVNTSEL0:
   case MSR_P6_EVNTSEL1:
   if (kvm_pmu_msr(vcpu, msr))
 - return kvm_pmu_set_msr(vcpu, msr, data);
 + return kvm_pmu_set_msr(vcpu, msr_info);
  
   if (pr || data != 0)
   vcpu_unimpl(vcpu, disabled perfctr wrmsr: 
 @@ -2088,7 +2088,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct 
 msr_data *msr_info)
   if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
   return xen_hvm_config(vcpu, data);
   if (kvm_pmu_msr(vcpu, msr))
 - return kvm_pmu_set_msr(vcpu, msr, data);
 + return kvm_pmu_set_msr(vcpu, msr_info);
   if (!ignore_msrs) {
   vcpu_unimpl(vcpu, unhandled wrmsr: 0x%x data %llx\n,
   msr, data);
 -- 
 1.8.1.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup

2013-04-02 Thread Asias He
On Tue, Apr 02, 2013 at 03:15:31PM +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote:
  On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote:
   On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote:
On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote:
 On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote:
  On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin wrote:
   On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote:
Currently, vs-vs_endpoint is used indicate if the endpoint is 
setup or
not. It is set or cleared in vhost_scsi_set_endpoint() or
vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. 
However, when
we check it in vhost_scsi_handle_vq(), we ignored the lock.

Instead of using the vs-vs_endpoint and the vs-dev.mutex lock 
to
indicate the status of the endpoint, we use per virtqueue
vq-private_data to indicate it. In this way, we can only take 
the
vq-mutex lock which is per queue and make the concurrent 
multiqueue
process having less lock contention. Further, in the read side 
of
vq-private_data, we can even do not take only lock if it is 
accessed in
the vhost worker thread, because it is protected by vhost rcu.

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 38 
+-
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/tcm_vhost.c 
b/drivers/vhost/tcm_vhost.c
index 5e3d4487..0524267 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -67,7 +67,6 @@ struct vhost_scsi {
/* Protected by vhost_scsi-dev.mutex */
struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET];
char vs_vhost_wwpn[TRANSPORT_IQN_LEN];
-   bool vs_endpoint;
 
struct vhost_dev dev;
struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
@@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov)
   ((unsigned long)iov-iov_base  PAGE_MASK))  
PAGE_SHIFT;
 }
 
+static bool tcm_vhost_check_endpoint(struct vhost_virtqueue 
*vq)
+{
+   bool ret = false;
+
+   /*
+* We can handle the vq only after the endpoint is 
setup by calling the
+* VHOST_SCSI_SET_ENDPOINT ioctl.
+*
+* TODO: Check that we are running from vhost_worker 
which acts
+* as read-side critical section for vhost kind of RCU.
+* See the comments in struct vhost_virtqueue in 
drivers/vhost/vhost.h
+*/
+   if (rcu_dereference_check(vq-private_data, 1))
+   ret = true;
+
+   return ret;
+}
+
 static int tcm_vhost_check_true(struct se_portal_group *se_tpg)
 {
return 1;
@@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct 
vhost_scsi *vs,
int head, ret;
u8 target;
 
-   /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */
-   if (unlikely(!vs-vs_endpoint))
+   if (!tcm_vhost_check_endpoint(vq))
return;
   
   
   I would just move the check to under vq mutex,
   and avoid rcu completely. In vhost-net we are using
   private data outside lock so we can't do this,
   no such issue here.
  
  Are you talking about:
  
 handle_tx:
 /* TODO: check that we are running from vhost_worker? */
 sock = rcu_dereference_check(vq-private_data, 1);
 if (!sock)
 return;
 
 wmem = atomic_read(sock-sk-sk_wmem_alloc);
 if (wmem = sock-sk-sk_sndbuf) {
 mutex_lock(vq-mutex);
 tx_poll_start(net, sock);
 mutex_unlock(vq-mutex);
 return;
 }
 mutex_lock(vq-mutex);
  
  Why not do the atomic_read and tx_poll_start under the vq-mutex, 
  and thus do
  the check under the lock as well.
 
 handle_rx:
 mutex_lock(vq-mutex);
 
 /* TODO: check that we are running from vhost_worker? */
 struct socket *sock = 
  rcu_dereference_check(vq-private_data, 1);
 
 if (!sock)
 return;
 
 mutex_lock(vq-mutex);
  
  Can't we can do the check under the vq-mutex 

Re: [PATCH V2 2/2] tcm_vhost: Use vq-private_data to indicate if the endpoint is setup

2013-04-02 Thread Michael S. Tsirkin
On Tue, Apr 02, 2013 at 11:10:02PM +0800, Asias He wrote:
 On Tue, Apr 02, 2013 at 03:15:31PM +0300, Michael S. Tsirkin wrote:
  On Mon, Apr 01, 2013 at 10:13:47AM +0800, Asias He wrote:
   On Sun, Mar 31, 2013 at 11:20:24AM +0300, Michael S. Tsirkin wrote:
On Fri, Mar 29, 2013 at 02:22:52PM +0800, Asias He wrote:
 On Thu, Mar 28, 2013 at 11:18:22AM +0200, Michael S. Tsirkin wrote:
  On Thu, Mar 28, 2013 at 04:10:02PM +0800, Asias He wrote:
   On Thu, Mar 28, 2013 at 08:16:59AM +0200, Michael S. Tsirkin 
   wrote:
On Thu, Mar 28, 2013 at 10:17:28AM +0800, Asias He wrote:
 Currently, vs-vs_endpoint is used indicate if the endpoint 
 is setup or
 not. It is set or cleared in vhost_scsi_set_endpoint() or
 vhost_scsi_clear_endpoint() under the vs-dev.mutex lock. 
 However, when
 we check it in vhost_scsi_handle_vq(), we ignored the lock.
 
 Instead of using the vs-vs_endpoint and the vs-dev.mutex 
 lock to
 indicate the status of the endpoint, we use per virtqueue
 vq-private_data to indicate it. In this way, we can only 
 take the
 vq-mutex lock which is per queue and make the concurrent 
 multiqueue
 process having less lock contention. Further, in the read 
 side of
 vq-private_data, we can even do not take only lock if it is 
 accessed in
 the vhost worker thread, because it is protected by vhost 
 rcu.
 
 Signed-off-by: Asias He as...@redhat.com
 ---
  drivers/vhost/tcm_vhost.c | 38 
 +-
  1 file changed, 33 insertions(+), 5 deletions(-)
 
 diff --git a/drivers/vhost/tcm_vhost.c 
 b/drivers/vhost/tcm_vhost.c
 index 5e3d4487..0524267 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -67,7 +67,6 @@ struct vhost_scsi {
   /* Protected by vhost_scsi-dev.mutex */
   struct tcm_vhost_tpg *vs_tpg[VHOST_SCSI_MAX_TARGET];
   char vs_vhost_wwpn[TRANSPORT_IQN_LEN];
 - bool vs_endpoint;
  
   struct vhost_dev dev;
   struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ];
 @@ -91,6 +90,24 @@ static int iov_num_pages(struct iovec *iov)
  ((unsigned long)iov-iov_base  PAGE_MASK))  
 PAGE_SHIFT;
  }
  
 +static bool tcm_vhost_check_endpoint(struct vhost_virtqueue 
 *vq)
 +{
 + bool ret = false;
 +
 + /*
 +  * We can handle the vq only after the endpoint is 
 setup by calling the
 +  * VHOST_SCSI_SET_ENDPOINT ioctl.
 +  *
 +  * TODO: Check that we are running from vhost_worker 
 which acts
 +  * as read-side critical section for vhost kind of RCU.
 +  * See the comments in struct vhost_virtqueue in 
 drivers/vhost/vhost.h
 +  */
 + if (rcu_dereference_check(vq-private_data, 1))
 + ret = true;
 +
 + return ret;
 +}
 +
  static int tcm_vhost_check_true(struct se_portal_group 
 *se_tpg)
  {
   return 1;
 @@ -581,8 +598,7 @@ static void vhost_scsi_handle_vq(struct 
 vhost_scsi *vs,
   int head, ret;
   u8 target;
  
 - /* Must use ioctl VHOST_SCSI_SET_ENDPOINT */
 - if (unlikely(!vs-vs_endpoint))
 + if (!tcm_vhost_check_endpoint(vq))
   return;


I would just move the check to under vq mutex,
and avoid rcu completely. In vhost-net we are using
private data outside lock so we can't do this,
no such issue here.
   
   Are you talking about:
   
  handle_tx:
  /* TODO: check that we are running from vhost_worker? 
   */
  sock = rcu_dereference_check(vq-private_data, 1);
  if (!sock)
  return;
  
  wmem = atomic_read(sock-sk-sk_wmem_alloc);
  if (wmem = sock-sk-sk_sndbuf) {
  mutex_lock(vq-mutex);
  tx_poll_start(net, sock);
  mutex_unlock(vq-mutex);
  return;
  }
  mutex_lock(vq-mutex);
   
   Why not do the atomic_read and tx_poll_start under the vq-mutex, 
   and thus do
   the check under the lock as well.
  
  handle_rx:
  mutex_lock(vq-mutex);
  
  /* TODO: check that we are running from vhost_worker? 
   */
  struct socket *sock = 
   rcu_dereference_check(vq-private_data, 1);
  

[PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access

2013-04-02 Thread Asias He
In vhost_scsi_handle_vq:

  tv_tpg = vs-vs_tpg[target];
  if (!tv_tpg) {
  
  return
  }

  tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req,

1) vs-vs_tpg[target] might change after the NULL check and 2) the above
line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use
ACCESS_ONCE. Thanks mst for catching this up!

Signed-off-by: Asias He as...@redhat.com
---
 drivers/vhost/tcm_vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
index 0524267..32d95e3 100644
--- a/drivers/vhost/tcm_vhost.c
+++ b/drivers/vhost/tcm_vhost.c
@@ -668,7 +668,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
 
/* Extract the tpgt */
target = v_req.lun[1];
-   tv_tpg = vs-vs_tpg[target];
+   tv_tpg = ACCESS_ONCE(vs-vs_tpg[target]);
 
/* Target does not exist, fail the request */
if (unlikely(!tv_tpg)) {
-- 
1.8.1.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access

2013-04-02 Thread Michael S. Tsirkin
On Tue, Apr 02, 2013 at 11:31:37PM +0800, Asias He wrote:
 In vhost_scsi_handle_vq:
 
   tv_tpg = vs-vs_tpg[target];
   if (!tv_tpg) {
   
   return
   }
 
   tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req,
 
 1) vs-vs_tpg[target] might change after the NULL check and 2) the above
 line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use
 ACCESS_ONCE. Thanks mst for catching this up!
 
 Signed-off-by: Asias He as...@redhat.com

OK this might be ok for 3.9.

Acked-by: Michael S. Tsirkin m...@redhat.com

Nicholas can you pick this up pls?

For 3.10 I still think it's best to get rid of it
and stick vs-vs_tpg in vq-private_data.

 ---
  drivers/vhost/tcm_vhost.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c
 index 0524267..32d95e3 100644
 --- a/drivers/vhost/tcm_vhost.c
 +++ b/drivers/vhost/tcm_vhost.c
 @@ -668,7 +668,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs,
  
   /* Extract the tpgt */
   target = v_req.lun[1];
 - tv_tpg = vs-vs_tpg[target];
 + tv_tpg = ACCESS_ONCE(vs-vs_tpg[target]);
  
   /* Target does not exist, fail the request */
   if (unlikely(!tv_tpg)) {
 -- 
 1.8.1.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Alexander Graf

On 04/02/2013 04:09 PM, Bhushan Bharat-R65777 wrote:



-Original Message-
From: Alexander Graf [mailto:ag...@suse.de]
Sent: Tuesday, April 02, 2013 1:57 PM
To: Bhushan Bharat-R65777
Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421
Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support


On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:




-Original Message-
From: Alexander Graf [mailto:ag...@suse.de]
Sent: Thursday, March 28, 2013 10:06 PM
To: Bhushan Bharat-R65777
Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421;
Bhushan
Bharat-R65777
Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
support


On 21.03.2013, at 07:25, Bharat Bhushan wrote:


From: Bharat Bhushanbharat.bhus...@freescale.com

This patch adds the debug stub support on booke/bookehv.
Now QEMU debug stub can use hw breakpoint, watchpoint and software
breakpoint to debug guest.

Debug registers are saved/restored on vcpu_put()/vcpu_get().
Also the debug registers are saved restored only if guest is using
debug resources.

Signed-off-by: Bharat Bhushanbharat.bhus...@freescale.com
---
v2:
- save/restore in vcpu_get()/vcpu_put()
- some more minor cleanup based on review comments.

arch/powerpc/include/asm/kvm_host.h |   10 ++
arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
arch/powerpc/kvm/booke.c|  252 -

--

arch/powerpc/kvm/e500_emulate.c |   10 ++
4 files changed, 272 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h
b/arch/powerpc/include/asm/kvm_host.h
index f4ba881..8571952 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
u32 mmucfg;
u32 epr;
u32 crit_save;
+   /* guest debug registers*/
struct kvmppc_booke_debug_reg dbg_reg;
+   /* shadow debug registers */
+   struct kvmppc_booke_debug_reg shadow_dbg_reg;
+   /* host debug registers*/
+   struct kvmppc_booke_debug_reg host_dbg_reg;
+   /*
+* Flag indicating that debug registers are used by guest
+* and requires save restore.
+   */
+   bool debug_save_restore;
#endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h
b/arch/powerpc/include/uapi/asm/kvm.h
index 15f9a00..d7ce449 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -25,6 +25,7 @@
/* Select powerpc specific features inlinux/kvm.h  */ #define
__KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
+#define __KVM_HAVE_GUEST_DEBUG

struct kvm_regs {
__u64 pc;
@@ -267,7 +268,24 @@ struct kvm_fpu {
__u64 fpr[32];
};

+/*
+ * Defines for h/w breakpoint, watchpoint (read, write or both) and
+ * software breakpoint.
+ * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
+ * for KVM_DEBUG_EXIT.
+ */
+#define KVMPPC_DEBUG_NONE  0x0
+#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
+#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
+#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
struct kvm_debug_exit_arch {
+   __u64 address;
+   /*
+* exiting to userspace because of h/w breakpoint, watchpoint
+* (read, write or both) and software breakpoint.
+*/
+   __u32 status;
+   __u32 reserved;
};

/* for KVM_SET_GUEST_DEBUG */
@@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
 * Type denotes h/w breakpoint, read watchpoint, write
 * watchpoint or watchpoint (both read and write).
 */
-#define KVMPPC_DEBUG_NOTYPE0x0
-#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
-#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
-#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
__u32 type;
__u32 reserved;
} bp[16];
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index
1de93a8..bf20056 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct
kvm_vcpu
*vcpu) #endif }

+static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
+   /* Synchronize guest's desire to get debug interrupts into shadow
+MSR */ #ifndef CONFIG_KVM_BOOKE_HV
+   vcpu-arch.shadow_msr= ~MSR_DE;
+   vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
+
+   /* Force enable debug interrupts when user space wants to debug */
+   if (vcpu-guest_debug) {
+#ifdef CONFIG_KVM_BOOKE_HV
+   /*
+* Since there is no shadow MSR, sync MSR_DE into the guest
+* visible MSR. Do not allow guest to change MSR[DE].
+*/
+   vcpu-arch.shared-msr |= MSR_DE;
+   mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);

This mtspr should really just be a bit or in shadow_mspr when
guest_debug gets enabled. 

[PATCH v2] ARM: KVM: promote vfp_host pointer to generic host cpu context

2013-04-02 Thread Marc Zyngier
We use the vfp_host pointer to store the host VFP context, should
the guest start using VFP itself.

Actually, we can use this pointer in a more generic way to store
CPU speficic data, and arm64 is using it to dump the whole host
state before switching to the guest.

Simply rename the vfp_host field to host_cpu_context, and the
corresponding type to kvm_cpu_context_t. No change in functionnality.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_host.h |  8 +---
 arch/arm/kernel/asm-offsets.c   |  2 +-
 arch/arm/kvm/arm.c  | 28 ++--
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index 78813b8..a7a0bb5 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -87,7 +87,7 @@ struct kvm_vcpu_fault_info {
u32 hyp_pc; /* PC when exception was taken from Hyp mode */
 };
 
-typedef struct vfp_hard_struct kvm_kernel_vfp_t;
+typedef struct vfp_hard_struct kvm_cpu_context_t;
 
 struct kvm_vcpu_arch {
struct kvm_regs regs;
@@ -105,8 +105,10 @@ struct kvm_vcpu_arch {
struct kvm_vcpu_fault_info fault;
 
/* Floating point registers (VFP and Advanced SIMD/NEON) */
-   kvm_kernel_vfp_t vfp_guest;
-   kvm_kernel_vfp_t *vfp_host;
+   struct vfp_hard_struct vfp_guest;
+
+   /* Host FP context */
+   kvm_cpu_context_t *host_cpu_context;
 
/* VGIC state */
struct vgic_cpu vgic_cpu;
diff --git a/arch/arm/kernel/asm-offsets.c b/arch/arm/kernel/asm-offsets.c
index ee1ac39..92562a2 100644
--- a/arch/arm/kernel/asm-offsets.c
+++ b/arch/arm/kernel/asm-offsets.c
@@ -154,7 +154,7 @@ int main(void)
   DEFINE(VCPU_MIDR,offsetof(struct kvm_vcpu, arch.midr));
   DEFINE(VCPU_CP15,offsetof(struct kvm_vcpu, arch.cp15));
   DEFINE(VCPU_VFP_GUEST,   offsetof(struct kvm_vcpu, arch.vfp_guest));
-  DEFINE(VCPU_VFP_HOST,offsetof(struct kvm_vcpu, 
arch.vfp_host));
+  DEFINE(VCPU_VFP_HOST,offsetof(struct kvm_vcpu, 
arch.host_cpu_context));
   DEFINE(VCPU_REGS,offsetof(struct kvm_vcpu, arch.regs));
   DEFINE(VCPU_USR_REGS,offsetof(struct kvm_vcpu, 
arch.regs.usr_regs));
   DEFINE(VCPU_SVC_REGS,offsetof(struct kvm_vcpu, 
arch.regs.svc_regs));
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index e821c37..2ce90bb 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -48,7 +48,7 @@ __asm__(.arch_extension  virt);
 #endif
 
 static DEFINE_PER_CPU(unsigned long, kvm_arm_hyp_stack_page);
-static kvm_kernel_vfp_t __percpu *kvm_host_vfp_state;
+static kvm_cpu_context_t __percpu *kvm_host_cpu_state;
 static unsigned long hyp_default_vectors;
 
 /* Per-CPU variable containing the currently running vcpu. */
@@ -325,7 +325,7 @@ void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
vcpu-cpu = cpu;
-   vcpu-arch.vfp_host = this_cpu_ptr(kvm_host_vfp_state);
+   vcpu-arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
 
/*
 * Check whether this vcpu requires the cache to be flushed on
@@ -889,24 +889,24 @@ static int init_hyp_mode(void)
}
 
/*
-* Map the host VFP structures
+* Map the host CPU structures
 */
-   kvm_host_vfp_state = alloc_percpu(kvm_kernel_vfp_t);
-   if (!kvm_host_vfp_state) {
+   kvm_host_cpu_state = alloc_percpu(kvm_cpu_context_t);
+   if (!kvm_host_cpu_state) {
err = -ENOMEM;
-   kvm_err(Cannot allocate host VFP state\n);
+   kvm_err(Cannot allocate host CPU state\n);
goto out_free_mappings;
}
 
for_each_possible_cpu(cpu) {
-   kvm_kernel_vfp_t *vfp;
+   kvm_cpu_context_t *cpu_ctxt;
 
-   vfp = per_cpu_ptr(kvm_host_vfp_state, cpu);
-   err = create_hyp_mappings(vfp, vfp + 1);
+   cpu_ctxt = per_cpu_ptr(kvm_host_cpu_state, cpu);
+   err = create_hyp_mappings(cpu_ctxt, cpu_ctxt + 1);
 
if (err) {
-   kvm_err(Cannot map host VFP state: %d\n, err);
-   goto out_free_vfp;
+   kvm_err(Cannot map host CPU state: %d\n, err);
+   goto out_free_context;
}
}
 
@@ -915,7 +915,7 @@ static int init_hyp_mode(void)
 */
err = kvm_vgic_hyp_init();
if (err)
-   goto out_free_vfp;
+   goto out_free_context;
 
 #ifdef CONFIG_KVM_ARM_VGIC
vgic_present = true;
@@ -933,8 +933,8 @@ static int init_hyp_mode(void)
kvm_info(Hyp mode initialized successfully\n);
 
return 0;
-out_free_vfp:
-   free_percpu(kvm_host_vfp_state);
+out_free_context:
+   free_percpu(kvm_host_cpu_state);
 out_free_mappings:

RFC: vfio API changes needed for powerpc

2013-04-02 Thread Yoder Stuart-B08248
Alex,

We are in the process of implementing vfio-pci support for the Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a number
of windows.  All sizes and window counts must be power of 2.  To illustrate,
below is a mapping for a 256MB guest, including guest memory (backed by
64MB huge pages) and some windows for MSIs:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
5   0x1400  0xf_fe045000  4KB// msi bank 2
6   0x1800  0xf_fe046000  4KB// msi bank 3
7- -  disabled

There are a couple of updates needed to the vfio user-kernel interface
that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   attributes.

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
VFIO_IOMMU_SET_ATTR
VFIO_IOMMU_GET_ATTR 

2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to target
   to generate interrupts.  The Linux MSI driver can allocate MSIs from
   the 3 banks any way it wants, and currently user space has no way of
   knowing which bank may be used for a given device.   

   There are 3 options we have discussed and would like your direction:

   A.  Implicit mappings -- with this approach user space would not
   explicitly map MSIs.  User space would be required to set the
   geometry so that there are 3 unused windows (the last 3 windows)
   for MSIs, and it would be up to the kernel to create the mappings.
   This approach requires some specific semantics (leaving 3 windows)
   and it potentially gets a little weird-- when should the kernel
   actually create the MSI mappings?  When should they be unmapped?
   Some convention would need to be established.

   B.  Explicit mapping using DMA map flags.  The idea is that a new
   flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
   a mapping is to be created for the supplied iova.  No vaddr
   is given though.  So in the above example there would be a
   a dma map at 0x1000 for 24KB (and no vaddr).   It's
   up to the kernel to determine which bank gets mapped where.
   So, this option puts user space in control of which windows
   are used for MSIs and when MSIs are mapped/unmapped.   There
   would need to be some semantics as to how this is used-- it
   only makes sense

   C.  Explicit mapping using normal DMA map.  The last idea is that
   we would introduce a new ioctl to give user-space an fd to 
   the MSI bank, which could be mmapped.  The flow would be
   something like this:
  -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
  -user space mmaps the fd, getting a vaddr
  -user space does a normal DMA map for desired iova
   This approach makes everything explicit, but adds a new ioctl
   applicable most likely only to the PAMU (type2 iommu).

Any feedback or direction?

Thanks,
Stuart 



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Scott Wood

On 04/02/2013 09:09:34 AM, Bhushan Bharat-R65777 wrote:



 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, April 02, 2013 1:57 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood Scott-B07421
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub  
support



 On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:

 
 
  -Original Message-
  From: Alexander Graf [mailto:ag...@suse.de]
  Sent: Thursday, March 28, 2013 10:06 PM
  To: Bhushan Bharat-R65777
  Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; Wood  
Scott-B07421;

  Bhushan
  Bharat-R65777
  Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
  support
 
 
  How does the normal debug register switching code work in Linux?
  Can't we just reuse that? Or rely on it to restore working state  
when

  another process gets scheduled in?
 
  Good point, I can see debug registers loading in function  
__switch_to()-

 switch_booke_debug_regs() in file arch/powerpc/kernel/process.c.
  So as long as assume that host will not use debug resources we  
can rely on
 this restore. But I am not sure that this is a fare assumption. As  
Scott earlier

 mentioned someone can use debug resource for kernel debugging also.

 Someone in the kernel can also use floating point registers. But  
then it's his

 responsibility to clean up the mess he leaves behind.

I am neither convinced by what you said and nor even have much reason  
to oppose :)


Scott,
	I remember you mentioned that host can use debug resources, you  
comment on this ?


I thought the conclusion we reached was that it was OK as long as KVM  
waits until it actually needs the debug resources to mess with the  
registers.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

Alex,

We are in the process of implementing vfio-pci support for the  
Freescale
IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
different

than x86, and will involve creating a 'type 2' vfio implementation.

For each device's DMA mappings, PAMU has an overall aperture and a  
number
of windows.  All sizes and window counts must be power of 2.  To  
illustrate,
below is a mapping for a 256MB guest, including guest memory (backed  
by

64MB huge pages) and some windows for MSIs:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
5   0x1400  0xf_fe045000  4KB// msi bank 2
6   0x1800  0xf_fe046000  4KB// msi bank 3
7- -  disabled

There are a couple of updates needed to the vfio user-kernel  
interface

that we would like your feedback on.

1.  IOMMU geometry

   The kernel IOMMU driver now has an interface (see domain_set_attr,
   domain_get_attr) that lets us set the domain geometry using
   attributes.

   We want to expose that to user space, so envision needing a couple
   of new ioctls to do this:
VFIO_IOMMU_SET_ATTR
VFIO_IOMMU_GET_ATTR


Note that this means attributes need to be updated for user-API  
appropriateness, such as using fixed-size types.



2.   MSI window mappings

   The more problematic question is how to deal with MSIs.  We need to
   create mappings for up to 3 MSI banks that a device may need to  
target
   to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
   the 3 banks any way it wants, and currently user space has no way  
of

   knowing which bank may be used for a given device.

   There are 3 options we have discussed and would like your  
direction:


   A.  Implicit mappings -- with this approach user space would not
   explicitly map MSIs.  User space would be required to set the
   geometry so that there are 3 unused windows (the last 3  
windows)


Where does userspace get the number 3 from?  E.g. on newer chips  
there are 4 MSI banks.  Maybe future chips have even more.



   B.  Explicit mapping using DMA map flags.  The idea is that a new
   flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
   a mapping is to be created for the supplied iova.  No vaddr
   is given though.  So in the above example there would be a
   a dma map at 0x1000 for 24KB (and no vaddr).


A single 24 KiB mapping wouldn't work (and why 24KB?  What if only one  
MSI group is involved in this VFIO group?  What if four MSI groups are  
involved?).  You'd need to either have a naturally aligned,  
power-of-two sized mapping that covers exactly the pages you want to  
map and no more, or you'd need to create a separate mapping for each  
MSI bank, and due to PAMU subwindow alignment restrictions these  
mappings could not be contiguous in iova-space.



   C.  Explicit mapping using normal DMA map.  The last idea is that
   we would introduce a new ioctl to give user-space an fd to
   the MSI bank, which could be mmapped.  The flow would be
   something like this:
  -for each group user space calls new ioctl  
VFIO_GROUP_GET_MSI_FD

  -user space mmaps the fd, getting a vaddr
  -user space does a normal DMA map for desired iova
   This approach makes everything explicit, but adds a new ioctl
   applicable most likely only to the PAMU (type2 iommu).


The new ioctl isn't really specific to PAMU (or whatever type2 is  
supposed to be, which nobody ever explains when I ask), so much as to  
the MSI implementation.  It just exposes the MSI register as another  
device resource (well, technically a groupwide resource, unless we  
expose it on a per-device basis and provide enough information for  
userspace to recognize when it's the same for other devices in the  
group) to be mmapped, which userspace can choose to map in the IOMMU as  
well.


Note that in the explicit case, userspace would have to program the MSI  
iova into the PCI device's config space (or communicate the chosen  
address to the kernel so it can set the config space registers).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
Hi Stuart,

On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
 Alex,
 
 We are in the process of implementing vfio-pci support for the Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
 than x86, and will involve creating a 'type 2' vfio implementation.
 
 For each device's DMA mappings, PAMU has an overall aperture and a number
 of windows.  All sizes and window counts must be power of 2.  To illustrate,
 below is a mapping for a 256MB guest, including guest memory (backed by
 64MB huge pages) and some windows for MSIs:
 
 Total aperture: 512MB
 # of windows: 8
 
 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled
 
 There are a couple of updates needed to the vfio user-kernel interface
 that we would like your feedback on.
 
 1.  IOMMU geometry
 
The kernel IOMMU driver now has an interface (see domain_set_attr,
domain_get_attr) that lets us set the domain geometry using
attributes.
 
We want to expose that to user space, so envision needing a couple
of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR 

Any ioctls to the vfiofd (/dev/vfio/vfio) not claimed by vfio-core are
passed to the IOMMU driver.  So you can effectively have your own type2
ioctl extensions.  Alexey has already posted patches to do this for
SPAPR that add VFIO_IOMMU_ENABLE/DISABLE to allow him access to
VFIO_IOMMU_GET_INFO to examine locked page requirements.  As Scott notes
we need to come up with a clean userspace interface for these though.

 2.   MSI window mappings
 
The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.   
 
There are 3 options we have discussed and would like your direction:
 
A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)
for MSIs, and it would be up to the kernel to create the mappings.
This approach requires some specific semantics (leaving 3 windows)
and it potentially gets a little weird-- when should the kernel
actually create the MSI mappings?  When should they be unmapped?
Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
number exposed to userspace on GET and transparently add MSI entries on
SET.  On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.  What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

This could also be done as another type2 ioctl extension.  What's the
value to userspace in determining which windows are used by which banks?
It sounds like the case that there are X banks and if userspace wants to
use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between MSI
or RAM?

C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to 
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

 Alex,

 We are in the process of implementing vfio-pci support for the Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite different
 than x86, and will involve creating a 'type 2' vfio implementation.

 For each device's DMA mappings, PAMU has an overall aperture and a number
 of windows.  All sizes and window counts must be power of 2.  To
 illustrate,
 below is a mapping for a 256MB guest, including guest memory (backed by
 64MB huge pages) and some windows for MSIs:

 Total aperture: 512MB
 # of windows: 8

 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled

 There are a couple of updates needed to the vfio user-kernel interface
 that we would like your feedback on.

 1.  IOMMU geometry

The kernel IOMMU driver now has an interface (see domain_set_attr,
domain_get_attr) that lets us set the domain geometry using
attributes.

We want to expose that to user space, so envision needing a couple
of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR


 Note that this means attributes need to be updated for user-API
 appropriateness, such as using fixed-size types.


 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your direction:

A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)


 Where does userspace get the number 3 from?  E.g. on newer chips there are
 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future chips
is nil, and if it ever happened user space could adjust.  Also,
practically speaking since memory is typically allocate in powers of
2 way you need to approximately double the window geometry anyway.

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).


 A single 24 KiB mapping wouldn't work (and why 24KB? What if only one MSI
 group is involved in this VFIO group?  What if four MSI groups are
 involved?).  You'd need to either have a naturally aligned, power-of-two
 sized mapping that covers exactly the pages you want to map and no more, or
 you'd need to create a separate mapping for each MSI bank, and due to PAMU
 subwindow alignment restrictions these mappings could not be contiguous in
 iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI banks
perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.

If only one MSI bank was involved the kernel could get clever and only enable
the banks actually needed.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:
On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com  
wrote:

 On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:

 Alex,

 We are in the process of implementing vfio-pci support for the  
Freescale
 IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite  
different

 than x86, and will involve creating a 'type 2' vfio implementation.

 For each device's DMA mappings, PAMU has an overall aperture and a  
number

 of windows.  All sizes and window counts must be power of 2.  To
 illustrate,
 below is a mapping for a 256MB guest, including guest memory  
(backed by

 64MB huge pages) and some windows for MSIs:

 Total aperture: 512MB
 # of windows: 8

 win gphys/
 #   iovaphys  size
 ---   
 0   0x  0xX_XX00  64MB
 1   0x0400  0xX_XX00  64MB
 2   0x0800  0xX_XX00  64MB
 3   0x0C00  0xX_XX00  64MB
 4   0x1000  0xf_fe044000  4KB// msi bank 1
 5   0x1400  0xf_fe045000  4KB// msi bank 2
 6   0x1800  0xf_fe046000  4KB// msi bank 3
 7- -  disabled

 There are a couple of updates needed to the vfio user-kernel  
interface

 that we would like your feedback on.

 1.  IOMMU geometry

The kernel IOMMU driver now has an interface (see  
domain_set_attr,

domain_get_attr) that lets us set the domain geometry using
attributes.

We want to expose that to user space, so envision needing a  
couple

of new ioctls to do this:
 VFIO_IOMMU_SET_ATTR
 VFIO_IOMMU_GET_ATTR


 Note that this means attributes need to be updated for user-API
 appropriateness, such as using fixed-size types.


 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We  
need to
create mappings for up to 3 MSI banks that a device may need to  
target
to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
the 3 banks any way it wants, and currently user space has no  
way of

knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your  
direction:


A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set  
the
geometry so that there are 3 unused windows (the last 3  
windows)



 Where does userspace get the number 3 from?  E.g. on newer chips  
there are

 4 MSI banks.  Maybe future chips have even more.

Ok, then make the number 4.   The chance of more MSI banks in future  
chips

is nil,


What makes you so sure?  Especially since you seem to be presenting  
this as not specifically an MPIC API.



and if it ever happened user space could adjust.


What bit of API is going to tell it that it needs to adjust?

Also, practically speaking since memory is typically allocate in  
powers of

2 way you need to approximately double the window geometry anyway.


Only if your existing mapping needs fit exactly in a power of two.

B.  Explicit mapping using DMA map flags.  The idea is that a  
new

flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).


 A single 24 KiB mapping wouldn't work (and why 24KB? What if only  
one MSI

 group is involved in this VFIO group?  What if four MSI groups are
 involved?).  You'd need to either have a naturally aligned,  
power-of-two
 sized mapping that covers exactly the pages you want to map and no  
more, or
 you'd need to create a separate mapping for each MSI bank, and due  
to PAMU
 subwindow alignment restrictions these mappings could not be  
contiguous in

 iova-space.

You're right, a single 24KB mapping wouldn't work--  in the case of 3  
MSI banks

perhaps we could just do one 64MB*3 mapping to identify which windows
are used for MSIs.


Where did the assumption of a 64MiB subwindow size come from?

If only one MSI bank was involved the kernel could get clever and  
only enable

the banks actually needed.


I'd rather see cleverness kept in userspace.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
alex.william...@redhat.com wrote:
 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need to
create mappings for up to 3 MSI banks that a device may need to target
to generate interrupts.  The Linux MSI driver can allocate MSIs from
the 3 banks any way it wants, and currently user space has no way of
knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your direction:

A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3 windows)
for MSIs, and it would be up to the kernel to create the mappings.
This approach requires some specific semantics (leaving 3 windows)
and it potentially gets a little weird-- when should the kernel
actually create the MSI mappings?  When should they be unmapped?
Some convention would need to be established.

 VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
 number exposed to userspace on GET and transparently add MSI entries on
 SET.

The number of windows is always power of 2 (and max is 256).  And to reduce
PAMU cache pressure you want to use the fewest number of windows
you can.So, I don't see practically how we could transparently
steal entries to
add the MSIs. Either user space knows to leave empty windows for
MSIs and by convention the kernel knows which windows those are (as
in option #A) or explicitly tell the kernel which windows (as in option #B).

 On x86 the interrupt remapper handles this transparently when MSI
 is enabled and userspace never gets direct access to the device MSI
 address/data registers.  What kind of restrictions do you have around
 adding and removing windows while the aperture is enabled?

The windows can be enabled/disabled event while the aperture is
enabled (pretty sure)...

B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

 This could also be done as another type2 ioctl extension.  What's the
 value to userspace in determining which windows are used by which banks?
 It sounds like the case that there are X banks and if userspace wants to
 use MSI it needs to leave X windows available for that.  Is this just
 buying userspace a few more windows to allow them the choice between MSI
 or RAM?

Yes, it would potentially give user space the flexibility some more windows.
It also makes more explicit when the MSI mappings are created.  In option
#A the MSI mappings would probably get created at the time of the first
normal DMA map.

So, you're saying with this approach you'd rather see a new type 2
ioctl instead of adding new flags to DMA map, right?

C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

 And the DMA_MAP of that mmap then allows userspace to select the window
 used?  This one seems like a lot of overhead, adding a new ioctl, new
 fd, mmap, special mapping path, etc.  It would be less overhead to just
 add an ioctl to enable MSI, maybe letting userspace pick which windows
 get used, but I'm still not sure what the value is to userspace in
 exposing it.  Thanks,

Thanks,
Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 03:32:17 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
 2.   MSI window mappings

The more problematic question is how to deal with MSIs.  We need  
to
create mappings for up to 3 MSI banks that a device may need to  
target
to generate interrupts.  The Linux MSI driver can allocate MSIs  
from
the 3 banks any way it wants, and currently user space has no  
way of

knowing which bank may be used for a given device.

There are 3 options we have discussed and would like your  
direction:


A.  Implicit mappings -- with this approach user space would not
explicitly map MSIs.  User space would be required to set the
geometry so that there are 3 unused windows (the last 3  
windows)
for MSIs, and it would be up to the kernel to create the  
mappings.
This approach requires some specific semantics (leaving 3  
windows)
and it potentially gets a little weird-- when should the  
kernel
actually create the MSI mappings?  When should they be  
unmapped?

Some convention would need to be established.

VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
the
number exposed to userspace on GET and transparently add MSI entries  
on

SET.


What do you mean by reduce the number exposed?  Userspace decides how  
many entries there are, but it must be a power of two beteen 1 and 256.



On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device MSI
address/data registers.


x86 has a totally different mechanism here, as far as I understand --  
even before you get into restrictions on mappings.



What kind of restrictions do you have around
adding and removing windows while the aperture is enabled?


Subwindows can be modified while the aperture is enabled, but the  
aperture size and number of subwindows cannot be changed.



B.  Explicit mapping using DMA map flags.  The idea is that a new
flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
a mapping is to be created for the supplied iova.  No vaddr
is given though.  So in the above example there would be a
a dma map at 0x1000 for 24KB (and no vaddr).   It's
up to the kernel to determine which bank gets mapped where.
So, this option puts user space in control of which windows
are used for MSIs and when MSIs are mapped/unmapped.   There
would need to be some semantics as to how this is used-- it
only makes sense

This could also be done as another type2 ioctl extension.


Again, what is type2, specifically?  If someone else is adding their  
own IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close enough?  What assumptions can a user make when they see that  
they're dealing with type2?


What's the value to userspace in determining which windows are used  
by which banks?


That depends on who programs the MSI config space address.  What is  
important is userspace controlling which iovas will be dedicated to  
this, in case it wants to put something else there.


It sounds like the case that there are X banks and if userspace wants  
to

use MSI it needs to leave X windows available for that.  Is this just
buying userspace a few more windows to allow them the choice between  
MSI

or RAM?


Well, there could be that.  But also, userspace will generally have a  
much better idea of the type of mappings it's creating, so it's easier  
to keep everything explicit at the kernel/user interface than require  
more complicated code in the kernel to figure things out automatically  
(not just for MSIs but in general).


If the kernel automatically creates the MSI mappings, when does it  
assume that userspace is done creating its own?  What if userspace  
doesn't need any DMA other than the MSIs?  What if userspace wants to  
continue dynamically modifying its other mappings?



C.  Explicit mapping using normal DMA map.  The last idea is that
we would introduce a new ioctl to give user-space an fd to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl  
VFIO_GROUP_GET_MSI_FD

   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new ioctl
applicable most likely only to the PAMU (type2 iommu).

And the DMA_MAP of that mmap then allows userspace to select the  
window

used?  This one seems like a lot of overhead, adding a new ioctl, new
fd, mmap, special mapping path, etc.


There's going to be special stuff no matter what.  This would keep it  
separated from the IOMMU map code.


I'm not sure what you mean by overhead here... the runtime overhead  
of setting things up is not particularly relevant as long 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:47 PM, Scott Wood scottw...@freescale.com wrote:
 On 04/02/2013 03:38:42 PM, Stuart Yoder wrote:

 On Tue, Apr 2, 2013 at 2:39 PM, Scott Wood scottw...@freescale.com
 wrote:
  On 04/02/2013 12:32:00 PM, Yoder Stuart-B08248 wrote:
 
  Alex,
 
  We are in the process of implementing vfio-pci support for the
  Freescale
  IOMMU (PAMU).  It is an aperture/window-based IOMMU and is quite
  different
  than x86, and will involve creating a 'type 2' vfio implementation.
 
  For each device's DMA mappings, PAMU has an overall aperture and a
  number
  of windows.  All sizes and window counts must be power of 2.  To
  illustrate,
  below is a mapping for a 256MB guest, including guest memory (backed by
  64MB huge pages) and some windows for MSIs:
 
  Total aperture: 512MB
  # of windows: 8
 
  win gphys/
  #   iovaphys  size
  ---   
  0   0x  0xX_XX00  64MB
  1   0x0400  0xX_XX00  64MB
  2   0x0800  0xX_XX00  64MB
  3   0x0C00  0xX_XX00  64MB
  4   0x1000  0xf_fe044000  4KB// msi bank 1
  5   0x1400  0xf_fe045000  4KB// msi bank 2
  6   0x1800  0xf_fe046000  4KB// msi bank 3
  7- -  disabled
 
  There are a couple of updates needed to the vfio user-kernel interface
  that we would like your feedback on.
 
  1.  IOMMU geometry
 
 The kernel IOMMU driver now has an interface (see domain_set_attr,
 domain_get_attr) that lets us set the domain geometry using
 attributes.
 
 We want to expose that to user space, so envision needing a couple
 of new ioctls to do this:
  VFIO_IOMMU_SET_ATTR
  VFIO_IOMMU_GET_ATTR
 
 
  Note that this means attributes need to be updated for user-API
  appropriateness, such as using fixed-size types.
 
 
  2.   MSI window mappings
 
 The more problematic question is how to deal with MSIs.  We need to
 create mappings for up to 3 MSI banks that a device may need to
  target
 to generate interrupts.  The Linux MSI driver can allocate MSIs from
 the 3 banks any way it wants, and currently user space has no way of
 knowing which bank may be used for a given device.
 
 There are 3 options we have discussed and would like your direction:
 
 A.  Implicit mappings -- with this approach user space would not
 explicitly map MSIs.  User space would be required to set the
 geometry so that there are 3 unused windows (the last 3 windows)
 
 
  Where does userspace get the number 3 from?  E.g. on newer chips there
  are
  4 MSI banks.  Maybe future chips have even more.

 Ok, then make the number 4.   The chance of more MSI banks in future chips
 is nil,


 What makes you so sure?  Especially since you seem to be presenting this as
 not specifically an MPIC API.


 and if it ever happened user space could adjust.


 What bit of API is going to tell it that it needs to adjust?

Haven't thought through that completely, but I guess we could add an API
to return the number of MSI banks for type 2 iommus.

 Also, practically speaking since memory is typically allocate in powers of
 2 way you need to approximately double the window geometry anyway.


 Only if your existing mapping needs fit exactly in a power of two.


 B.  Explicit mapping using DMA map flags.  The idea is that a new
 flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
 a mapping is to be created for the supplied iova.  No vaddr
 is given though.  So in the above example there would be a
 a dma map at 0x1000 for 24KB (and no vaddr).
 
 
  A single 24 KiB mapping wouldn't work (and why 24KB? What if only one
  MSI
  group is involved in this VFIO group?  What if four MSI groups are
  involved?).  You'd need to either have a naturally aligned, power-of-two
  sized mapping that covers exactly the pages you want to map and no more,
  or
  you'd need to create a separate mapping for each MSI bank, and due to
  PAMU
  subwindow alignment restrictions these mappings could not be contiguous
  in
  iova-space.

 You're right, a single 24KB mapping wouldn't work--  in the case of 3 MSI
 banks
 perhaps we could just do one 64MB*3 mapping to identify which windows
 are used for MSIs.


 Where did the assumption of a 64MiB subwindow size come from?

The example I was using.   User space would need to create a
mapping for window_size * msi_bank_count.

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Stuart Yoder
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote:
 This could also be done as another type2 ioctl extension.


 Again, what is type2, specifically?  If someone else is adding their own
 IOMMU that is kind of, sort of like PAMU, how would they know if it's close
 enough?  What assumptions can a user make when they see that they're dealing
 with type2?

We will define that as part of the type2 implementation.   Highly unlikely
anything but a PAMU will comply.

 What's the value to userspace in determining which windows are used by
 which banks?


 That depends on who programs the MSI config space address.  What is
 important is userspace controlling which iovas will be dedicated to this, in
 case it wants to put something else there.


 It sounds like the case that there are X banks and if userspace wants to
 use MSI it needs to leave X windows available for that.  Is this just
 buying userspace a few more windows to allow them the choice between MSI
 or RAM?


 Well, there could be that.  But also, userspace will generally have a much
 better idea of the type of mappings it's creating, so it's easier to keep
 everything explicit at the kernel/user interface than require more
 complicated code in the kernel to figure things out automatically (not just
 for MSIs but in general).

 If the kernel automatically creates the MSI mappings, when does it assume
 that userspace is done creating its own?  What if userspace doesn't need any
 DMA other than the MSIs?  What if userspace wants to continue dynamically
 modifying its other mappings?


 C.  Explicit mapping using normal DMA map.  The last idea is that
 we would introduce a new ioctl to give user-space an fd to
 the MSI bank, which could be mmapped.  The flow would be
 something like this:
-for each group user space calls new ioctl
  VFIO_GROUP_GET_MSI_FD
-user space mmaps the fd, getting a vaddr
-user space does a normal DMA map for desired iova
 This approach makes everything explicit, but adds a new ioctl
 applicable most likely only to the PAMU (type2 iommu).

 And the DMA_MAP of that mmap then allows userspace to select the window
 used?  This one seems like a lot of overhead, adding a new ioctl, new
 fd, mmap, special mapping path, etc.


 There's going to be special stuff no matter what.  This would keep it
 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime overhead of
 setting things up is not particularly relevant as long as it's reasonable.
 If you mean development and maintenance effort, keeping things well
 separated should help.

We don't need to change DMA_MAP.  If we can simply add a new type 2
ioctl that allows user space to set which windows are MSIs, it seems vastly
less complex than an ioctl to supply a new fd, mmap of it, etc.

So maybe 2 ioctls:
VFIO_IOMMU_GET_MSI_COUNT
VFIO_IOMMU_MAP_MSI(iova, size)

Stuart
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:32 PM, Alex Williamson
 alex.william...@redhat.com wrote:
  2.   MSI window mappings
 
 The more problematic question is how to deal with MSIs.  We need to
 create mappings for up to 3 MSI banks that a device may need to target
 to generate interrupts.  The Linux MSI driver can allocate MSIs from
 the 3 banks any way it wants, and currently user space has no way of
 knowing which bank may be used for a given device.
 
 There are 3 options we have discussed and would like your direction:
 
 A.  Implicit mappings -- with this approach user space would not
 explicitly map MSIs.  User space would be required to set the
 geometry so that there are 3 unused windows (the last 3 windows)
 for MSIs, and it would be up to the kernel to create the mappings.
 This approach requires some specific semantics (leaving 3 windows)
 and it potentially gets a little weird-- when should the kernel
 actually create the MSI mappings?  When should they be unmapped?
 Some convention would need to be established.
 
  VFIO would have control of SET/GET_ATTR, right?  So we could reduce the
  number exposed to userspace on GET and transparently add MSI entries on
  SET.
 
 The number of windows is always power of 2 (and max is 256).  And to reduce
 PAMU cache pressure you want to use the fewest number of windows
 you can.So, I don't see practically how we could transparently
 steal entries to
 add the MSIs. Either user space knows to leave empty windows for
 MSIs and by convention the kernel knows which windows those are (as
 in option #A) or explicitly tell the kernel which windows (as in option #B).

Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has an
MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's aperture,
but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the devices
attached have MSI and add those windows and align to a power of 2.

  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device MSI
  address/data registers.  What kind of restrictions do you have around
  adding and removing windows while the aperture is enabled?
 
 The windows can be enabled/disabled event while the aperture is
 enabled (pretty sure)...
 
 B.  Explicit mapping using DMA map flags.  The idea is that a new
 flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
 a mapping is to be created for the supplied iova.  No vaddr
 is given though.  So in the above example there would be a
 a dma map at 0x1000 for 24KB (and no vaddr).   It's
 up to the kernel to determine which bank gets mapped where.
 So, this option puts user space in control of which windows
 are used for MSIs and when MSIs are mapped/unmapped.   There
 would need to be some semantics as to how this is used-- it
 only makes sense
 
  This could also be done as another type2 ioctl extension.  What's the
  value to userspace in determining which windows are used by which banks?
  It sounds like the case that there are X banks and if userspace wants to
  use MSI it needs to leave X windows available for that.  Is this just
  buying userspace a few more windows to allow them the choice between MSI
  or RAM?
 
 Yes, it would potentially give user space the flexibility some more windows.
 It also makes more explicit when the MSI mappings are created.  In option
 #A the MSI mappings would probably get created at the time of the first
 normal DMA map.
 
 So, you're saying with this approach you'd rather see a new type 2
 ioctl instead of adding new flags to DMA map, right?

I'm not sure I know enough yet to have a suggestion.  What would be the
purpose of userspace specifying the iova and size here?  If userspace
just needs to know that it needs X addition windows for MSI and can tell
the kernel to use banks 0 through (X-1) for MSI, that sounds more like
an ioctl interface than a DMA_MAP flag.  Thanks,

Alex

 C.  Explicit mapping using normal DMA map.  The last idea is that
 we would introduce a new ioctl to give user-space an fd to
 the MSI bank, which could be mmapped.  The flow would be
 something like this:
-for each group user space calls new ioctl VFIO_GROUP_GET_MSI_FD
-user space mmaps the fd, getting a vaddr
-user space does a normal DMA map for desired iova
 This approach makes everything explicit, but adds a new ioctl
 applicable most likely only to the PAMU (type2 iommu).
 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
 On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 17:32 +, Yoder Stuart-B08248 wrote:
   2.   MSI window mappings
  
  The more problematic question is how to deal with MSIs.  We need  
  to
  create mappings for up to 3 MSI banks that a device may need to  
  target
  to generate interrupts.  The Linux MSI driver can allocate MSIs  
  from
  the 3 banks any way it wants, and currently user space has no  
  way of
  knowing which bank may be used for a given device.
  
  There are 3 options we have discussed and would like your  
  direction:
  
  A.  Implicit mappings -- with this approach user space would not
  explicitly map MSIs.  User space would be required to set the
  geometry so that there are 3 unused windows (the last 3  
  windows)
  for MSIs, and it would be up to the kernel to create the  
  mappings.
  This approach requires some specific semantics (leaving 3  
  windows)
  and it potentially gets a little weird-- when should the  
  kernel
  actually create the MSI mappings?  When should they be  
  unmapped?
  Some convention would need to be established.
  
  VFIO would have control of SET/GET_ATTR, right?  So we could reduce  
  the
  number exposed to userspace on GET and transparently add MSI entries  
  on
  SET.
 
 What do you mean by reduce the number exposed?  Userspace decides how  
 many entries there are, but it must be a power of two beteen 1 and 256.

I didn't understand the API.

  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device MSI
  address/data registers.
 
 x86 has a totally different mechanism here, as far as I understand --  
 even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?

  What kind of restrictions do you have around
  adding and removing windows while the aperture is enabled?
 
 Subwindows can be modified while the aperture is enabled, but the  
 aperture size and number of subwindows cannot be changed.
 
  B.  Explicit mapping using DMA map flags.  The idea is that a new
  flag to DMA map (VFIO_DMA_MAP_FLAG_MSI) would mean that
  a mapping is to be created for the supplied iova.  No vaddr
  is given though.  So in the above example there would be a
  a dma map at 0x1000 for 24KB (and no vaddr).   It's
  up to the kernel to determine which bank gets mapped where.
  So, this option puts user space in control of which windows
  are used for MSIs and when MSIs are mapped/unmapped.   There
  would need to be some semantics as to how this is used-- it
  only makes sense
  
  This could also be done as another type2 ioctl extension.
 
 Again, what is type2, specifically?  If someone else is adding their  
 own IOMMU that is kind of, sort of like PAMU, how would they know if  
 it's close enough?  What assumptions can a user make when they see that  
 they're dealing with type2?

Naming always has and always will be a problem.  I assume this is named
type2 rather than PAMU because it's trying to expose a generic windowed
IOMMU fitting the IOMMU API.  Like type1, it doesn't really make sense
to name it IOMMU API because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.  Type1 is
arbitrary.  It might as well be named brown and this one can be
blue.

  What's the value to userspace in determining which windows are used  
  by which banks?
 
 That depends on who programs the MSI config space address.  What is  
 important is userspace controlling which iovas will be dedicated to  
 this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user programmed
iova?  But an iova selects a window and I thought there were some number
of MSI banks and we don't really know which ones we'll need...  still
confused.

  It sounds like the case that there are X banks and if userspace wants  
  to
  use MSI it needs to leave X windows available for that.  Is this just
  buying userspace a few more windows to allow them the choice between  
  MSI
  or RAM?
 
 Well, there could be that.  But also, userspace will generally have a  
 much better idea of the type of mappings it's creating, so it's easier  
 to keep everything explicit at the kernel/user interface than require  
 more complicated code in the kernel to figure things out automatically  
 (not just for MSIs but in general).
 
 If the kernel automatically creates the MSI mappings, when does it  
 assume that userspace is done creating its own?  What if userspace  
 doesn't need any DMA other than the MSIs?  What if userspace wants to  
 continue 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com wrote:
  C.  Explicit mapping using normal DMA map.  The last idea is that
  we would introduce a new ioctl to give user-space an fd to
  the MSI bank, which could be mmapped.  The flow would be
  something like this:
 -for each group user space calls new ioctl
   VFIO_GROUP_GET_MSI_FD
 -user space mmaps the fd, getting a vaddr
 -user space does a normal DMA map for desired iova
  This approach makes everything explicit, but adds a new ioctl
  applicable most likely only to the PAMU (type2 iommu).
 
  And the DMA_MAP of that mmap then allows userspace to select the window
  used?  This one seems like a lot of overhead, adding a new ioctl, new
  fd, mmap, special mapping path, etc.
 
 
  There's going to be special stuff no matter what.  This would keep it
  separated from the IOMMU map code.
 
  I'm not sure what you mean by overhead here... the runtime overhead of
  setting things up is not particularly relevant as long as it's reasonable.
  If you mean development and maintenance effort, keeping things well
  separated should help.
 
 We don't need to change DMA_MAP.  If we can simply add a new type 2
 ioctl that allows user space to set which windows are MSIs, it seems vastly
 less complex than an ioctl to supply a new fd, mmap of it, etc.
 
 So maybe 2 ioctls:
 VFIO_IOMMU_GET_MSI_COUNT
 VFIO_IOMMU_MAP_MSI(iova, size)
 

How are MSIs related to devices on PAMU?  On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl (actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with it
being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:08:27 PM, Stuart Yoder wrote:
On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood scottw...@freescale.com  
wrote:

 This could also be done as another type2 ioctl extension.


 Again, what is type2, specifically?  If someone else is adding  
their own
 IOMMU that is kind of, sort of like PAMU, how would they know if  
it's close
 enough?  What assumptions can a user make when they see that  
they're dealing

 with type2?

We will define that as part of the type2 implementation.   Highly  
unlikely

anything but a PAMU will comply.


So then why not just call it pamu instead of being obfuscatory?

 There's going to be special stuff no matter what.  This would keep  
it

 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime  
overhead of
 setting things up is not particularly relevant as long as it's  
reasonable.

 If you mean development and maintenance effort, keeping things well
 separated should help.

We don't need to change DMA_MAP.  If we can simply add a new type 2
ioctl that allows user space to set which windows are MSIs,


And what specifically does that ioctl do?  It causes new mappings to be  
created, right?  So you're changing (or at least adding to) the DMA map  
mechanism.


it seems vastly less complex than an ioctl to supply a new fd, mmap  
of it, etc.


I don't see enough complexity in the mmap approach for anything to be  
vastly less complex in comparison.  I think you're building the mmap  
approach up in your head to be a lot worse that it would actually be.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:16:11 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
 The number of windows is always power of 2 (and max is 256).  And  
to reduce

 PAMU cache pressure you want to use the fewest number of windows
 you can.So, I don't see practically how we could transparently
 steal entries to
 add the MSIs. Either user space knows to leave empty windows for
 MSIs and by convention the kernel knows which windows those are (as
 in option #A) or explicitly tell the kernel which windows (as in  
option #B).


Ok, apparently I don't understand the API.  Is it something like
userspace calls GET_ATTR and finds out that there are 256 available
windows, userspace determines that it needs 8 for RAM and then it has  
an

MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
prone to exploitation by the first userspace to allocate it's  
aperture,


What exploitation?

It's not as if there is a pool of 256 global windows that users  
allocate from.  The subwindow count is just how finely divided the  
aperture is.  The only way one user will affect another is through  
cache contention (which is why we want the minimum number of subwindows  
that we can get away with).



but I'm also not sure why userspace could specify the (non-power of 2)
number of windows it needs for RAM, then VFIO would see that the  
devices

attached have MSI and add those windows and align to a power of 2.


If you double the subwindow count without userspace knowing, you have  
to double the aperture as well (and you may need to grow up or down  
depending on alignment).  This means you also need to halve the maximum  
aperture that userspace can request.  And you need to expose a  
different number of maximum subwindows in the IOMMU API based on  
whether we might have MSIs of this type.  It's ugly and awkward, and  
removes the possibility for userspace to place the MSIs in some unused  
slot in the middle, or not use MSIs at all.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:32:04 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
 On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
  On x86 the interrupt remapper handles this transparently when MSI
  is enabled and userspace never gets direct access to the device  
MSI

  address/data registers.

 x86 has a totally different mechanism here, as far as I understand  
--

 even before you get into restrictions on mappings.

So what control will userspace have over programming the actually MSI
vectors on PAMU?


Not sure what you mean -- PAMU doesn't get explicitly involved in  
MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
you want isolation, you need to make sure that an MSI group is only  
used by one VFIO group, and that you're on a chip that has alias pages  
with just one MSI bank register each (newer chips do, but the first  
chip to have a PAMU didn't).



  This could also be done as another type2 ioctl extension.

 Again, what is type2, specifically?  If someone else is adding  
their

 own IOMMU that is kind of, sort of like PAMU, how would they know if
 it's close enough?  What assumptions can a user make when they see  
that

 they're dealing with type2?

Naming always has and always will be a problem.  I assume this is  
named
type2 rather than PAMU because it's trying to expose a generic  
windowed

IOMMU fitting the IOMMU API.


But how closely is the MSI situation related to a generic windowed  
IOMMU, then?  We could just as well have a highly flexible IOMMU in  
terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
be mapped rather than a translation table.  Or we could have a windowed  
IOMMU that has an MSI translation table.



Like type1, it doesn't really make sense
to name it IOMMU API because that's a kernel internal interface and
we're designing a userspace interface that just happens to use that.
Tagging it to a piece of hardware makes it less reusable.


Well, that's my point.  Is it reusable at all, anyway?  If not, then  
giving it a more obscure name won't change that.  If it is reusable,  
then where is the line drawn between things that are PAMU-specific or  
MPIC-specific and things that are part of the generic windowed IOMMU  
abstraction?


 Type1 is arbitrary.  It might as well be named brown and this one  
can be

blue.


The difference is that type1 seems to refer to hardware that can do  
arbitrary 4K page mappings, possibly constrained by an aperture but  
nothing else.  More than one IOMMU can reasonably fit that.  The odds  
that another IOMMU would have exactly the same restrictions as PAMU  
seem smaller in comparison.


In any case, if you had to deal with some Intel-only quirk, would it  
make sense to call it a type1 attribute?  I'm not advocating one way  
or the other on whether an abstraction is viable here (though Stuart  
seems to think it's highly unlikely anything but a PAMU will comply),  
just that if it is to be abstracted rather than a hardware-specific  
interface, we need to document what is and is not part of the  
abstraction.  Otherwise a non-PAMU-specific user won't know what they  
can rely on, and someone adding support for a new windowed IOMMU won't  
know if theirs is close enough, or they need to introduce a type3.


  What's the value to userspace in determining which windows are  
used

  by which banks?

 That depends on who programs the MSI config space address.  What is
 important is userspace controlling which iovas will be dedicated to
 this, in case it wants to put something else there.

So userspace is programming the MSI vectors, targeting a user  
programmed
iova?  But an iova selects a window and I thought there were some  
number

of MSI banks and we don't really know which ones we'll need...  still
confused.


Userspace would also need a way to find out the page offset and data  
value.  That may be an argument in favor of having the two ioctls  
Stuart later suggested (get MSI count, and map MSI).  Would there be  
any complication in the VFIO code from tracking a mapping that doesn't  
have a userspace virtual address associated with it?


 There's going to be special stuff no matter what.  This would keep  
it

 separated from the IOMMU map code.

 I'm not sure what you mean by overhead here... the runtime  
overhead

 of setting things up is not particularly relevant as long as it's
 reasonable.  If you mean development and maintenance effort, keeping
 things well separated should help.

Overhead in terms of code required and complexity.  More things to
reference count and shut down in the proper order on userspace exit.
Thanks,


That didn't stop others from having me convert the KVM device control  
API to use file descriptors instead of something more ad-hoc with a  
better-defined destruction order. :-)


I don't know if it necessarily needs to be a separate fd -- it could be  
just another device resource like BARs, with some way for userspace to  

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Scott Wood

On 04/02/2013 04:38:45 PM, Alex Williamson wrote:

On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
 On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
scottw...@freescale.com wrote:
  C.  Explicit mapping using normal DMA map.  The last idea  
is that
  we would introduce a new ioctl to give user-space an fd  
to

  the MSI bank, which could be mmapped.  The flow would be
  something like this:
 -for each group user space calls new ioctl
   VFIO_GROUP_GET_MSI_FD
 -user space mmaps the fd, getting a vaddr
 -user space does a normal DMA map for desired iova
  This approach makes everything explicit, but adds a new  
ioctl

  applicable most likely only to the PAMU (type2 iommu).
 
  And the DMA_MAP of that mmap then allows userspace to select the  
window
  used?  This one seems like a lot of overhead, adding a new  
ioctl, new

  fd, mmap, special mapping path, etc.
 
 
  There's going to be special stuff no matter what.  This would  
keep it

  separated from the IOMMU map code.
 
  I'm not sure what you mean by overhead here... the runtime  
overhead of
  setting things up is not particularly relevant as long as it's  
reasonable.
  If you mean development and maintenance effort, keeping things  
well

  separated should help.

 We don't need to change DMA_MAP.  If we can simply add a new type  
2
 ioctl that allows user space to set which windows are MSIs, it  
seems vastly

 less complex than an ioctl to supply a new fd, mmap of it, etc.

 So maybe 2 ioctls:
 VFIO_IOMMU_GET_MSI_COUNT


Do you mean a count of actual MSIs or a count of MSI banks used by the  
whole VFIO group?



 VFIO_IOMMU_MAP_MSI(iova, size)


Not sure how you mean size to be used -- for MPIC it would be 4K per  
bank, and you can only map one bank at a time (which bank you're  
mapping should be a parameter, if only so that the kernel doesn't have  
to keep iteration state for you).



How are MSIs related to devices on PAMU?


PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
device is standard PCI stuff.  Each MSI bank (which is part of the  
MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
map all MSI banks that are in use by any of the devices in the group.   
Ideally we'd let the VFIO grouping influence the allocation of MSIs.



On x86 MSI count is very
device specific, which means it wold be a VFIO_DEVICE_* ioctl  
(actually
VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
it

being a device ioctl is that you need to get the device FD, but the
IOMMU protection needs to be established before you can get that... so
there's an ordering problem if you need it from the device before
configuring the IOMMU.  Thanks,


What do you mean by IOMMU protection needs to be established?   
Wouldn't we just start with no mappings in place?


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH-v2 0/3] virtio/vhost: Add checks for uninitialized VQs

2013-04-02 Thread Nicholas A. Bellinger
On Tue, 2013-04-02 at 15:01 +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 01, 2013 at 11:58:21PM +, Nicholas A. Bellinger wrote:
  From: Nicholas Bellinger n...@linux-iscsi.org
  
  Hi folks,
  
  This series adds a virtio_queue_valid() for use by virtio-pci code in
  order to prevent opreations upon uninitialized VQs, which is currently
  expected to occur during seabios setup of virtio-scsi with in-flight
  vhost-scsi-pci device code.
  
  On the vhost side, it also adds virtio_queue_valid() sanity checks in
  vhost_virtqueue_[start,stop]() and vhost_verify_ring_mappings() in order
  to skip the same uninitialized VQs.
  
  Changes from v1:
- Remove now unnecessary virtio_queue_get_num() calls in virtio-pci.c
- Add virtio_queue_valid() calls in vhost_virtqueue_[start,stop]()
  
  Please review.
  
  --nab
 
 Looks reasonable.
 Acked-by: Michael S. Tsirkin m...@redhat.com
 

Thanks MST!

Anthony, do you want to pick these up now..?  Or shall I include in the
next vhost-scsi-pci PATCH-v3 series..?

--nab

 So - does this fix the issues you saw with vhost-scsi?
 
  Michael S. Tsirkin (1):
virtio: add API to check that ring is setup
  
  Nicholas Bellinger (2):
virtio-pci: Add virtio_queue_valid checks ahead of
  virtio_queue_get_num
vhost: Skip uninitialized VQs in vhost_virtqueue_[start,stop]
  
   hw/vhost.c  |   12 
   hw/virtio-pci.c |   34 +++---
   hw/virtio.c |5 +
   hw/virtio.h |1 +
   4 files changed, 33 insertions(+), 19 deletions(-)
  
  -- 
  1.7.2.5
 --
 To unsubscribe from this list: send the line unsubscribe target-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] tcm_vhost: Use ACCESS_ONCE for vs-vs_tpg[target] access

2013-04-02 Thread Nicholas A. Bellinger
On Tue, 2013-04-02 at 18:39 +0300, Michael S. Tsirkin wrote:
 On Tue, Apr 02, 2013 at 11:31:37PM +0800, Asias He wrote:
  In vhost_scsi_handle_vq:
  
tv_tpg = vs-vs_tpg[target];
if (!tv_tpg) {

return
}
  
tv_cmd = vhost_scsi_allocate_cmd(tv_tpg, v_req,
  
  1) vs-vs_tpg[target] might change after the NULL check and 2) the above
  line might access tv_tpg from vs-vs_tpg[target]. To prevent 2), use
  ACCESS_ONCE. Thanks mst for catching this up!
  
  Signed-off-by: Asias He as...@redhat.com
 
 OK this might be ok for 3.9.
 
 Acked-by: Michael S. Tsirkin m...@redhat.com
 
 Nicholas can you pick this up pls?
 

Applying to target-pending/master now.

 For 3.10 I still think it's best to get rid of it
 and stick vs-vs_tpg in vq-private_data.
 

Your call here.  Given that vhost-scsi-pci code + Seabios w/ virtio-scsi
enabled will be broken without Asias's two extra vq-private_data and
initialize vq-last_used_idx changes on the list, they will certainly
need to hit 3.9.x code once your happy to ACK for v3.10.

Asias, I assume you'll be updating this soon..?

--nab

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status

2013-04-02 Thread Zhang, Yang Z
Gleb Natapov wrote on 2013-04-02:
 On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote:
 Paolo Bonzini wrote on 2013-03-26:
 Il 22/03/2013 06:24, Yang Zhang ha scritto:
 +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu,
 +  struct rtc_status *rtc_status, int irq)
 +{
 +  if (irq != RTC_GSI)
 +  return;
 +
 +  if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map))
 +  --rtc_status-pending_eoi;
 +
 +  WARN_ON(rtc_status-pending_eoi  0);
 +}
 
 This is the only case where you're passing the struct rtc_status instead
 of the struct kvm_ioapic.  Please use the latter, and make it the first
 argument.
 
 @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic,
 int
 irq)
irqe.level = 1;
irqe.shorthand = 0;
 -  return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL);
 +  if (irq == RTC_GSI) {
 +  ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe,
 +  ioapic-rtc_status.dest_map);
 +  ioapic-rtc_status.pending_eoi = ret;
 
 I think you should either add a
 
 BUG_ON(ioapic-rtc_status.pending_eoi != 0);
 or use ioapic-rtc_status.pending_eoi += ret (or both).
 
 There may malicious guest to write EOI more than once. And the pending_eoi
 will be negative. But it should not be a bug. Just WARN_ON is enough. And we
 already do it in ack_eoi. So don't need to do duplicated thing here.
 
 Since we track vcpus that already called EOI and decrement pending_eoi
 only once for each vcpu malicious guest cannot trigger it, but we
 already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need
 another one here. += will be correct (since pending_eoi == 0 here), but
 confusing since it makes an impression that pending_eoi may not be zero.
Yes, I also make the wrong impression.
With previous implementation, the pening_eoi may not be zero: Calculate the 
destination vcpu via parse IOAPIC entry, and if using lowest priority deliver 
mode, set all possible vcpus in dest_map even it doesn't receive it finally. At 
same time, a malicious guest can send IPI with same vector of RTC to those 
vcpus who is in dest_map but not have RTC interrupt. Then the pending_eoi will 
be negative.
Now, we set the dest_map with the vcpus who really received the interrupt. The 
above case cannot happen. So as you and Paolo suggested, it is better to use +=.

Best regards,
Yang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] KVM: Initialize irqfd from kvm_init().

2013-04-02 Thread Sasha Levin
On 02/28/2013 04:22 AM, Cornelia Huck wrote:
 Currently, eventfd introduces module_init/module_exit functions
 to initialize/cleanup the irqfd workqueue. This only works, however,
 if no other module_init/module_exit functions are built into the
 same module.
 
 Let's just move the initialization and cleanup to kvm_init and kvm_exit.
 This way, it is also clearer where kvm startup may fail.
 
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com

I'm seeing this during boot:

[6.763302] [ cut here ]
[6.763763] WARNING: at kernel/workqueue.c:4204 
destroy_workqueue+0x1df/0x3d0()
[6.764507] Modules linked in:
[6.764792] Pid: 1, comm: swapper/0 Tainted: GW
3.9.0-rc5-next-20130402-sasha-00015-g3522ec5 #324
[6.765654] Call Trace:
[6.765875]  [811074fb] warn_slowpath_common+0x8b/0xc0
[6.766436]  [81107545] warn_slowpath_null+0x15/0x20
[6.766947]  [8112ca7f] destroy_workqueue+0x1df/0x3d0
[6.768631]  [8100d880] kvm_irqfd_exit+0x10/0x20
[6.77]  [81004dbb] kvm_init+0x2ab/0x310
[6.770607]  [86183dc0] ? cpu_has_kvm_support+0x4d/0x4d
[6.771241]  [86183fb4] vmx_init+0x1f4/0x437
[6.771709]  [86183dc0] ? cpu_has_kvm_support+0x4d/0x4d
[6.772266]  [810020f2] do_one_initcall+0xb2/0x1b0
[6.772995]  [86180021] kernel_init_freeable+0x15d/0x1ef
[6.773857]  [8617f801] ? loglevel+0x31/0x31
[6.774609]  [83d51230] ? rest_init+0x140/0x140
[6.775551]  [83d51239] kernel_init+0x9/0xf0
[6.776162]  [83dbf37c] ret_from_fork+0x7c/0xb0
[6.776662]  [83d51230] ? rest_init+0x140/0x140
[6.777241] ---[ end trace 10bba684ced4346a ]---

And I think it has something to do with this patch.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] KVM: Call kvm_apic_match_dest() to check destination vcpu

2013-04-02 Thread Yang Zhang
From: Yang Zhang yang.z.zh...@intel.com

For a given vcpu, kvm_apic_match_dest() will tell you whether
the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
and use kvm_apic_match_dest() instead.

Signed-off-by: Yang Zhang yang.z.zh...@intel.com
---
 arch/x86/kvm/lapic.c |   47 ---
 arch/x86/kvm/lapic.h |4 
 virt/kvm/ioapic.c|9 +++--
 3 files changed, 3 insertions(+), 57 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index a8e9369..e227474 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
 }
 
-void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
-   struct kvm_lapic_irq *irq,
-   u64 *eoi_exit_bitmap)
-{
-   struct kvm_lapic **dst;
-   struct kvm_apic_map *map;
-   unsigned long bitmap = 1;
-   int i;
-
-   rcu_read_lock();
-   map = rcu_dereference(vcpu-kvm-arch.apic_map);
-
-   if (unlikely(!map)) {
-   __set_bit(irq-vector, (unsigned long *)eoi_exit_bitmap);
-   goto out;
-   }
-
-   if (irq-dest_mode == 0) { /* physical mode */
-   if (irq-delivery_mode == APIC_DM_LOWEST ||
-   irq-dest_id == 0xff) {
-   __set_bit(irq-vector,
- (unsigned long *)eoi_exit_bitmap);
-   goto out;
-   }
-   dst = map-phys_map[irq-dest_id  0xff];
-   } else {
-   u32 mda = irq-dest_id  (32 - map-ldr_bits);
-
-   dst = map-logical_map[apic_cluster_id(map, mda)];
-
-   bitmap = apic_logical_id(map, mda);
-   }
-
-   for_each_set_bit(i, bitmap, 16) {
-   if (!dst[i])
-   continue;
-   if (dst[i]-vcpu == vcpu) {
-   __set_bit(irq-vector,
- (unsigned long *)eoi_exit_bitmap);
-   break;
-   }
-   }
-
-out:
-   rcu_read_unlock();
-}
-
 static void recalculate_apic_map(struct kvm *kvm)
 {
struct kvm_apic_map *new, *old = NULL;
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2c721b9..baa20cf 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map 
*map, u32 ldr)
return ldr  map-lid_mask;
 }
 
-void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
-   struct kvm_lapic_irq *irq,
-   u64 *eoi_bitmap);
-
 static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
 {
return vcpu-arch.apic-pending_events;
diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
index 5ba005c..bb3906d 100644
--- a/virt/kvm/ioapic.c
+++ b/virt/kvm/ioapic.c
@@ -124,7 +124,6 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
 {
struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
union kvm_ioapic_redirect_entry *e;
-   struct kvm_lapic_irq irqe;
int index;
 
spin_lock(ioapic-lock);
@@ -135,11 +134,9 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu 
*vcpu,
(e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
 kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC,
 index))) {
-   irqe.dest_id = e-fields.dest_id;
-   irqe.vector = e-fields.vector;
-   irqe.dest_mode = e-fields.dest_mode;
-   irqe.delivery_mode = e-fields.delivery_mode  8;
-   kvm_calculate_eoi_exitmap(vcpu, irqe, eoi_exit_bitmap);
+   if (kvm_apic_match_dest(vcpu, NULL, 0,
+   e-fields.dest_id, e-fields.dest_mode))
+   __set_bit(e-fileds.vector, (unsigned long 
*)eoi_exit_bitmap);
}
}
spin_unlock(ioapic-lock);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Paul Mackerras
On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.
 
 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
 
 Both device types and individual attributes can be tested without having
 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.
 
 Signed-off-by: Scott Wood scottw...@freescale.com

Some comments below...

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 976eb65..77328aa 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents 
 from the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.
  
 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability; you add it in a later
patch.  Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably
should add the KVM_CAP_DEVICE_CTRL capability too.


 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created

Is this still a possible error code?

 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
   u64 dac[KVMPPC_BOOKE_MAX_DAC];
  };
  
 +#define KVMPPC_IRQCHIP_NONE  0
 +#define KVMPPC_IRQCHIP_MPIC  1

This define should go in the patch that adds the MPIC device.

  struct kvm_vcpu_arch {
   ulong host_stack;
   u32 host_pid;
 @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
   unsigned long magic_page_pa; /* phys addr to map the magic page to */
   unsigned long magic_page_ea; /* effect. addr to map the magic page to */
  
 + int irqchip_type;
 + void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?

 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 16b4595..bdfa526 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
   tasklet_kill(vcpu-arch.tasklet);
  
   kvmppc_remove_vcpu_debugfs(vcpu);
 +
 + switch (vcpu-arch.irqchip_type) {
 + case KVMPPC_IRQCHIP_MPIC:
 + mpic_put(vcpu-arch.irqchip_priv);
 + break;
 + }

This is going to break bisection, since you don't define mpic_put() in
this patch.

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 74d0ff3..20ce2d2 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_EPR 86
  #define KVM_CAP_ARM_PSCI 87
  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 +#define KVM_CAP_DEVICE_CTRL 89
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
  #define KVM_ARM_SET_DEVICE_ADDR_IOW(KVMIO,  0xab, struct 
 kvm_arm_device_addr)
  
  /*
 + * Device control API, available with KVM_CAP_DEVICE_CTRL
 + */
 +#define KVM_CREATE_DEVICE_TEST   1
 +
 +struct kvm_create_device {
 + __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 + __u32   fd; /* out: device handle */
 + __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +struct kvm_device_attr {
 + __u32   flags;  /* no flags currently defined */
 + __u32   group;  /* device-defined */
 + __u64   attr;   /* group-defined */
 + __u64   addr;   /* userspace address of attr data */
 +};
 +
 +/* ioctl for vm fd */
 +#define KVM_CREATE_DEVICE  _IOWR(KVMIO,  0xe0, struct kvm_create_device)

This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Scott Wood

On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:

On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device  
state,

 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.

 Both device types and individual attributes can be tested without  
having

 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.

 Signed-off-by: Scott Wood scottw...@freescale.com

Some comments below...

 diff --git a/Documentation/virtual/kvm/api.txt  
b/Documentation/virtual/kvm/api.txt

 index 976eb65..77328aa 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with  
contents from the data
  written, then `n_invalid' invalid entries, invalidating any  
previously

  valid entries found.

 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability;


Yes, it does (see below).


you add it in a later patch.


Maybe you're thinking of KVM_CAP_IRQ_MPIC?


 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created

Is this still a possible error code?


If you mean ENOSPC, probably not -- it'd be replaced with whatever  
errors can come out of creating a file descriptor.



 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
u64 dac[KVMPPC_BOOKE_MAX_DAC];
  };

 +#define KVMPPC_IRQCHIP_NONE   0
 +#define KVMPPC_IRQCHIP_MPIC   1

This define should go in the patch that adds the MPIC device.

  struct kvm_vcpu_arch {
ulong host_stack;
u32 host_pid;
 @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
  	unsigned long magic_page_pa; /* phys addr to map the magic page  
to */
  	unsigned long magic_page_ea; /* effect. addr to map the magic  
page to */


 +  int irqchip_type;
 +  void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?


Oops... These were patch shuffling accidents and will be removed from  
the next iteration.



 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 16b4595..bdfa526 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
tasklet_kill(vcpu-arch.tasklet);

kvmppc_remove_vcpu_debugfs(vcpu);
 +
 +  switch (vcpu-arch.irqchip_type) {
 +  case KVMPPC_IRQCHIP_MPIC:
 +  mpic_put(vcpu-arch.irqchip_priv);
 +  break;
 +  }

This is going to break bisection, since you don't define mpic_put() in
this patch.


Sigh.  Something got messed up; I'll try to sort it out and resubmit.


 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 74d0ff3..20ce2d2 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_EPR 86
  #define KVM_CAP_ARM_PSCI 87
  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 +#define KVM_CAP_DEVICE_CTRL 89


See, here's the capability. :-)


  /*
 + * Device control API, available with KVM_CAP_DEVICE_CTRL
 + */
 +#define KVM_CREATE_DEVICE_TEST1
 +
 +struct kvm_create_device {
 +  __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 +  __u32   fd; /* out: device handle */
 +  __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +struct kvm_device_attr {
 +  __u32   flags;  /* no flags currently defined */
 +  __u32   group;  /* device-defined */
 +  __u64   attr;   /* group-defined */
 +  __u64   addr;   /* userspace address of attr data */
 +};
 +
 +/* ioctl for vm fd */
 +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct  
kvm_create_device)


This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.


That's actually why I moved it to a new 

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/03/2013 01:30 AM, Scott Wood wrote:

On 04/02/2013 01:59:57 AM, tiejun.chen wrote:

On 04/02/2013 06:47 AM, Scott Wood wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
  }
  #endif

+static int kvm_ioctl_create_device(struct kvm *kvm,
+   struct kvm_create_device *cd)
+{
+bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
+
+switch (cd-type) {
+default:
+return -ENODEV;
+}


Even after apply patch 5, looks here still misses something like:

if (test)
WARN_ON_ONCE(!cd-type);


Why?  How does userspace passing in a bad type value mean the kernel needs to
report internal badness, why is a value of zero worse than any other bad value,
and why only when the test flag is set?


I just mean we need do something here since looks the 'test' variable is defined 
but unused, right? But please correct this as you expect :)


And if the userspace can't guarantee cd-type is never zero, we should return 
-ENODEV as well after that switch().


Tiejun
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/03/2013 09:34 AM, Scott Wood wrote:

On 04/02/2013 08:28:01 PM, tiejun.chen wrote:

On 04/03/2013 01:30 AM, Scott Wood wrote:

On 04/02/2013 01:59:57 AM, tiejun.chen wrote:

On 04/02/2013 06:47 AM, Scott Wood wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
  }
  #endif

+static int kvm_ioctl_create_device(struct kvm *kvm,
+   struct kvm_create_device *cd)
+{
+bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
+
+switch (cd-type) {
+default:
+return -ENODEV;
+}


Even after apply patch 5, looks here still misses something like:

if (test)
WARN_ON_ONCE(!cd-type);


Why?  How does userspace passing in a bad type value mean the kernel needs to
report internal badness, why is a value of zero worse than any other bad value,
and why only when the test flag is set?


I just mean we need do something here since looks the 'test' variable is
defined but unused, right? But please correct this as you expect :)


Yes, it's unused in this patch, but is used after patch 5 is applied.  I didn't
think it was worth adding a temporary unused annotation, since this part of the
kernel doesn't use -Werror.


Yes, its accepted in !-Werror case if we shouldn't warn something as you said.

Tiejun

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 0/6] device control and in-kernel MPIC

2013-04-02 Thread Scott Wood
Fixed some patch shuffling errors and some minor issues.

Scott Wood (6):
  kvm: add device control API
  kvm/ppc/mpic: import hw/openpic.c from QEMU
  kvm/ppc/mpic: remove some obviously unneeded code
  kvm/ppc/mpic: adapt to kernel style and environment
  kvm/ppc/mpic: in-kernel MPIC emulation
  kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

 Documentation/virtual/kvm/api.txt  |   78 ++
 Documentation/virtual/kvm/devices/README   |1 +
 Documentation/virtual/kvm/devices/mpic.txt |   37 +
 arch/powerpc/include/asm/kvm_host.h|   16 +-
 arch/powerpc/include/asm/kvm_ppc.h |9 +
 arch/powerpc/kvm/Kconfig   |5 +
 arch/powerpc/kvm/Makefile  |2 +
 arch/powerpc/kvm/booke.c   |   12 +-
 arch/powerpc/kvm/mpic.c| 1784 
 arch/powerpc/kvm/powerpc.c |   38 +-
 include/linux/kvm_host.h   |2 +
 include/uapi/linux/kvm.h   |   37 +
 virt/kvm/kvm_main.c|   40 +
 13 files changed, 2051 insertions(+), 10 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/README
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt
 create mode 100644 arch/powerpc/kvm/mpic.c

-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 3/6] kvm/ppc/mpic: remove some obviously unneeded code

2013-04-02 Thread Scott Wood
Remove some parts of the code that are obviously QEMU or Raven specific
before fixing style issues, to reduce the style issues that need to be
fixed.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 arch/powerpc/kvm/mpic.c |  344 ---
 1 file changed, 344 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 57655b9..d6d70a4 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -22,39 +22,6 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
-/*
- *
- * Based on OpenPic implementations:
- * - Intel GW80314 I/O companion chip developer's manual
- * - Motorola MPC8245  MPC8540 user manuals.
- * - Motorola MCP750 (aka Raven) programmer manual.
- * - Motorola Harrier programmer manuel
- *
- * Serial interrupts, as implemented in Raven chipset are not supported yet.
- *
- */
-#include hw.h
-#include ppc/mac.h
-#include pci/pci.h
-#include openpic.h
-#include sysbus.h
-#include pci/msi.h
-#include qemu/bitops.h
-#include ppc.h
-
-//#define DEBUG_OPENPIC
-
-#ifdef DEBUG_OPENPIC
-static const int debug_openpic = 1;
-#else
-static const int debug_openpic = 0;
-#endif
-
-#define DPRINTF(fmt, ...) do { \
-if (debug_openpic) { \
-printf(fmt , ## __VA_ARGS__); \
-} \
-} while (0)
 
 #define MAX_CPU 32
 #define MAX_SRC 256
@@ -82,21 +49,6 @@ static const int debug_openpic = 0;
 #define OPENPIC_CPU_REG_START0x2
 #define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000)
 
-/* Raven */
-#define RAVEN_MAX_CPU  2
-#define RAVEN_MAX_EXT 48
-#define RAVEN_MAX_IRQ 64
-#define RAVEN_MAX_TMR  MAX_TMR
-#define RAVEN_MAX_IPI  MAX_IPI
-
-/* Interrupt definitions */
-#define RAVEN_FE_IRQ (RAVEN_MAX_EXT)   /* Internal functional IRQ */
-#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1)   /* Error IRQ */
-#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2)   /* First timer IRQ */
-#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR)   /* First IPI 
IRQ */
-/* First doorbell IRQ */
-#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
-
 typedef struct FslMpicInfo {
int max_ext;
 } FslMpicInfo;
@@ -138,44 +90,6 @@ static FslMpicInfo fsl_mpic_42 = {
 #define ILR_INTTGT_CINT   0x01 /* critical */
 #define ILR_INTTGT_MCP0x02 /* machine check */
 
-/* The currently supported INTTGT values happen to be the same as QEMU's
- * openpic output codes, but don't depend on this.  The output codes
- * could change (unlikely, but...) or support could be added for
- * more INTTGT values.
- */
-static const int inttgt_output[][2] = {
-   {ILR_INTTGT_INT, OPENPIC_OUTPUT_INT},
-   {ILR_INTTGT_CINT, OPENPIC_OUTPUT_CINT},
-   {ILR_INTTGT_MCP, OPENPIC_OUTPUT_MCK},
-};
-
-static int inttgt_to_output(int inttgt)
-{
-   int i;
-
-   for (i = 0; i  ARRAY_SIZE(inttgt_output); i++) {
-   if (inttgt_output[i][0] == inttgt) {
-   return inttgt_output[i][1];
-   }
-   }
-
-   fprintf(stderr, %s: unsupported inttgt %d\n, __func__, inttgt);
-   return OPENPIC_OUTPUT_INT;
-}
-
-static int output_to_inttgt(int output)
-{
-   int i;
-
-   for (i = 0; i  ARRAY_SIZE(inttgt_output); i++) {
-   if (inttgt_output[i][1] == output) {
-   return inttgt_output[i][0];
-   }
-   }
-
-   abort();
-}
-
 #define MSIIR_OFFSET   0x140
 #define MSIIR_SRS_SHIFT29
 #define MSIIR_SRS_MASK (0x7  MSIIR_SRS_SHIFT)
@@ -1265,228 +1179,36 @@ static uint64_t openpic_cpu_read(void *opaque, hwaddr 
addr, unsigned len)
return openpic_cpu_read_internal(opaque, addr, (addr  0x1f000)  12);
 }
 
-static const MemoryRegionOps openpic_glb_ops_le = {
-   .write = openpic_gbl_write,
-   .read = openpic_gbl_read,
-   .endianness = DEVICE_LITTLE_ENDIAN,
-   .impl = {
-.min_access_size = 4,
-.max_access_size = 4,
-},
-};
-
 static const MemoryRegionOps openpic_glb_ops_be = {
.write = openpic_gbl_write,
.read = openpic_gbl_read,
-   .endianness = DEVICE_BIG_ENDIAN,
-   .impl = {
-.min_access_size = 4,
-.max_access_size = 4,
-},
-};
-
-static const MemoryRegionOps openpic_tmr_ops_le = {
-   .write = openpic_tmr_write,
-   .read = openpic_tmr_read,
-   .endianness = DEVICE_LITTLE_ENDIAN,
-   .impl = {
-.min_access_size = 4,
-.max_access_size = 4,
-},
 };
 
 static const MemoryRegionOps openpic_tmr_ops_be = {
.write = openpic_tmr_write,
.read = openpic_tmr_read,
-   .endianness = DEVICE_BIG_ENDIAN,
-   .impl = {
-.min_access_size = 4,
-.max_access_size = 4,
-},
-};
-
-static const MemoryRegionOps openpic_cpu_ops_le = {
-   

[RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-02 Thread Scott Wood
Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 Documentation/virtual/kvm/api.txt   |8 ++
 arch/powerpc/include/asm/kvm_host.h |8 ++
 arch/powerpc/include/asm/kvm_ppc.h  |2 ++
 arch/powerpc/kvm/booke.c|4 ++-
 arch/powerpc/kvm/mpic.c |   49 +++
 arch/powerpc/kvm/powerpc.c  |   26 +++
 include/uapi/linux/kvm.h|1 +
 7 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT 0
+#define KVMPPC_IRQ_MPIC1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
ulong host_stack;
u32 host_pid;
@@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
unsigned long magic_page_pa; /* phys addr to map the magic page to */
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+   int irq_type;   /* one of KVM_IRQ_* */
+   struct openpic *mpic;   /* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..f54707f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
  struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
if (update_epr == true) {
if (vcpu-arch.epr_flags  KVMPPC_EPR_USER)
kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-   else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL)
+   else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL) {
+   BUG_ON(vcpu-arch.irq_type != KVMPPC_IRQ_MPIC);
kvmppc_mpic_set_epr(vcpu);
+   }
}
 
new_msr = msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8cda2fa..caffe3b 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct 
irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-   struct openpic *opp = vcpu-arch.irqchip_priv;
+   struct openpic *opp = vcpu-arch.mpic;
int cpu = vcpu-vcpu_id;
unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-   BUG_ON(opp-mmio_mapped);
-   opp-mmio_mapped = false;
-
-   kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
+   if (opp-mmio_mapped) {
+   opp-mmio_mapped = false;
+   kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
+   }
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+u32 cpu)
+{
+   struct openpic *opp = mpic_filp-private_data;
+   int ret = 0;

[RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-02 Thread Scott Wood
Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood scottw...@freescale.com
---
v3: mpic_put - kvmppc_mpic_put

 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h|8 +-
 arch/powerpc/include/asm/kvm_ppc.h |7 +
 arch/powerpc/kvm/Kconfig   |5 +
 arch/powerpc/kvm/Makefile  |2 +
 arch/powerpc/kvm/booke.c   |   10 +-
 arch/powerpc/kvm/mpic.c|  814 +---
 arch/powerpc/kvm/powerpc.c |   12 +-
 include/linux/kvm_host.h   |2 +
 include/uapi/linux/kvm.h   |9 +
 virt/kvm/kvm_main.c|9 +
 11 files changed, 714 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt 
b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+=
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+  Base address of the 256 KiB MPIC register space.  Must be
+  naturally aligned.  A value of zero disables the mapping.
+  Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+Access an MPIC register, as if the access were made from the guest. 
+attr is the byte offset into the MPIC register space.  Accesses
+must be 4-byte aligned.
+
+MSIs may be signaled by using this attribute group to write
+to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+IRQ input line for each standard openpic source.  0 is inactive and 1
+is active, regardless of interrupt sense.
+
+For edge-triggered interrupts:  Writing 1 is considered an activating
+edge, and writing 0 is ignored.  Reading returns 1 if a previously
+signaled edge has not been acknowledged, and 0 otherwise.
+
+attr is the IRQ number.  IRQ numbers for standard sources are the
+byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC   4
 #define KVMPPC_BOOKE_MAX_DAC   2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE0 /* EPR not supported */
+#define KVMPPC_EPR_USER1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL  2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
u32 dbcr0;
u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
u8 sane;
u8 cpu_type;
u8 hcall_needed;
-   u8 epr_enabled;
+   u8 epr_flags; /* KVMPPC_EPR_xxx */
u8 epr_needed;
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
  struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig 

[RFC PATCH v3 4/6] kvm/ppc/mpic: adapt to kernel style and environment

2013-04-02 Thread Scott Wood
Remove braces that Linux style doesn't permit, remove space after
'*' that Lindent added, keep error/debug strings contiguous, etc.

Substitute type names, debug prints, etc.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 arch/powerpc/kvm/mpic.c |  445 ++-
 1 file changed, 208 insertions(+), 237 deletions(-)

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index d6d70a4..1df67ae 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -42,22 +42,22 @@
 #define OPENPIC_TMR_REG_SIZE 0x220
 #define OPENPIC_MSI_REG_START0x1600
 #define OPENPIC_MSI_REG_SIZE 0x200
-#define OPENPIC_SUMMARY_REG_START   0x3800
-#define OPENPIC_SUMMARY_REG_SIZE0x800
+#define OPENPIC_SUMMARY_REG_START0x3800
+#define OPENPIC_SUMMARY_REG_SIZE 0x800
 #define OPENPIC_SRC_REG_START0x1
 #define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20)
 #define OPENPIC_CPU_REG_START0x2
-#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000)
+#define OPENPIC_CPU_REG_SIZE (0x100 + ((MAX_CPU - 1) * 0x1000))
 
-typedef struct FslMpicInfo {
+struct fsl_mpic_info {
int max_ext;
-} FslMpicInfo;
+};
 
-static FslMpicInfo fsl_mpic_20 = {
+static struct fsl_mpic_info fsl_mpic_20 = {
.max_ext = 12,
 };
 
-static FslMpicInfo fsl_mpic_42 = {
+static struct fsl_mpic_info fsl_mpic_42 = {
.max_ext = 12,
 };
 
@@ -100,44 +100,43 @@ static int get_current_cpu(void)
 {
CPUState *cpu_single_cpu;
 
-   if (!cpu_single_env) {
+   if (!cpu_single_env)
return -1;
-   }
 
cpu_single_cpu = ENV_GET_CPU(cpu_single_env);
return cpu_single_cpu-cpu_index;
 }
 
-static uint32_t openpic_cpu_read_internal(void *opaque, hwaddr addr, int idx);
-static void openpic_cpu_write_internal(void *opaque, hwaddr addr,
+static uint32_t openpic_cpu_read_internal(void *opaque, gpa_t addr, int idx);
+static void openpic_cpu_write_internal(void *opaque, gpa_t addr,
   uint32_t val, int idx);
 
-typedef enum IRQType {
+enum irq_type {
IRQ_TYPE_NORMAL = 0,
IRQ_TYPE_FSLINT,/* FSL internal interrupt -- level only */
IRQ_TYPE_FSLSPECIAL,/* FSL timer/IPI interrupt, edge, no polarity */
-} IRQType;
+};
 
-typedef struct IRQQueue {
+struct irq_queue {
/* Round up to the nearest 64 IRQs so that the queue length
 * won't change when moving between 32 and 64 bit hosts.
 */
unsigned long queue[BITS_TO_LONGS((MAX_IRQ + 63)  ~63)];
int next;
int priority;
-} IRQQueue;
+};
 
-typedef struct IRQSource {
+struct irq_source {
uint32_t ivpr;  /* IRQ vector/priority register */
uint32_t idr;   /* IRQ destination register */
uint32_t destmask;  /* bitmap of CPU destinations */
int last_cpu;
int output; /* IRQ level, e.g. OPENPIC_OUTPUT_INT */
int pending;/* TRUE if IRQ is pending */
-   IRQType type;
+   enum irq_type type;
bool level:1;   /* level-triggered */
-   bool nomask:1;  /* critical interrupts ignore mask on some FSL 
MPICs */
-} IRQSource;
+   bool nomask:1;  /* critical interrupts ignore mask on some FSL MPICs */
+};
 
 #define IVPR_MASK_SHIFT   31
 #define IVPR_MASK_MASK(1  IVPR_MASK_SHIFT)
@@ -158,22 +157,19 @@ typedef struct IRQSource {
 #define IDR_EP  0x8000 /* external pin */
 #define IDR_CI  0x4000 /* critical interrupt */
 
-typedef struct IRQDest {
+struct irq_dest {
int32_t ctpr;   /* CPU current task priority */
-   IRQQueue raised;
-   IRQQueue servicing;
+   struct irq_queue raised;
+   struct irq_queue servicing;
qemu_irq *irqs;
 
/* Count of IRQ sources asserting on non-INT outputs */
uint32_t outputs_active[OPENPIC_OUTPUT_NB];
-} IRQDest;
-
-typedef struct OpenPICState {
-   SysBusDevice busdev;
-   MemoryRegion mem;
+};
 
+struct openpic {
/* Behavior control */
-   FslMpicInfo *fsl;
+   struct fsl_mpic_info *fsl;
uint32_t model;
uint32_t flags;
uint32_t nb_irqs;
@@ -186,9 +182,6 @@ typedef struct OpenPICState {
uint32_t brr1;
uint32_t mpic_mode_mask;
 
-   /* Sub-regions */
-   MemoryRegion sub_io_mem[6];
-
/* Global registers */
uint32_t frr;   /* Feature reporting register */
uint32_t gcr;   /* Global configuration register  */
@@ -196,9 +189,9 @@ typedef struct OpenPICState {
uint32_t spve;  /* Spurious vector register */
uint32_t tfrr;  /* Timer frequency reporting register */
/* Source registers */
-   IRQSource src[MAX_IRQ];
+   struct irq_source src[MAX_IRQ];
/* Local registers per output pin */
-   IRQDest dst[MAX_CPU];
+   struct irq_dest 

[RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU

2013-04-02 Thread Scott Wood
This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e (Update version for
1.4.0-rc0), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 arch/powerpc/kvm/mpic.c | 1686 +++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *   2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245  MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include hw.h
+#include ppc/mac.h
+#include pci/pci.h
+#include openpic.h
+#include sysbus.h
+#include pci/msi.h
+#include qemu/bitops.h
+#include ppc.h
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+if (debug_openpic) { \
+printf(fmt , ## __VA_ARGS__); \
+} \
+} while (0)
+
+#define MAX_CPU 32
+#define MAX_SRC 256
+#define MAX_TMR 4
+#define MAX_IPI 4
+#define MAX_MSI 8
+#define MAX_IRQ (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID 0x03   /* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT (1  0)
+#define OPENPIC_FLAG_ILR  (2  0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START0x0
+#define OPENPIC_GLB_REG_SIZE 0x10F0
+#define OPENPIC_TMR_REG_START0x10F0
+#define OPENPIC_TMR_REG_SIZE 0x220
+#define OPENPIC_MSI_REG_START0x1600
+#define OPENPIC_MSI_REG_SIZE 0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE0x800
+#define OPENPIC_SRC_REG_START0x1
+#define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START0x2
+#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU  2
+#define RAVEN_MAX_EXT 48
+#define RAVEN_MAX_IRQ 64
+#define RAVEN_MAX_TMR  MAX_TMR
+#define RAVEN_MAX_IPI  MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ (RAVEN_MAX_EXT)   /* Internal functional IRQ */
+#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1)   /* Error IRQ */
+#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2)   /* First timer IRQ */
+#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR)   /* First IPI 
IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+   int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+   .max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+   .max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT16
+#define FRR_NCPU_SHIFT 8
+#define FRR_VID_SHIFT  0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC  0x/* Generic Vendor ID */
+
+#define GCR_RESET0x8000
+#define GCR_MODE_PASS0x
+#define GCR_MODE_MIXED   0x2000
+#define GCR_MODE_PROXY   0x6000
+
+#define TBCR_CI   0x8000   /* count inhibit */
+#define TCCR_TOG  0x8000   /* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT  31
+#define IDR_EP_MASK   (1  IDR_EP_SHIFT)

[RFC PATCH v3 1/6] kvm: add device control API

2013-04-02 Thread Scott Wood
Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood scottw...@freescale.com
---
v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.

NOTE: I had some difficulty figuring out what ioctl numbers I should
assign...  it seems that at one point care was taken to keep vcpu and
vm ioctls separate, but some overlap exists now (despite not exhausing
the ioctl space).  Some of that was my fault, but not all of it. :-)
I moved to a new ioctl range for device control -- please let me know
if there's something else you'd prefer I do.
---
 Documentation/virtual/kvm/api.txt|   70 ++
 Documentation/virtual/kvm/devices/README |1 +
 include/uapi/linux/kvm.h |   27 
 virt/kvm/kvm_main.c  |   31 +
 4 files changed, 129 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from 
the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+  be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
+   __u32   fd; /* out: device handle */
+   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+  (e.g. read-only attribute, or attribute that only makes
+  sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the devices directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+   __u32   flags;  /* no flags currently defined */
+   __u32   group;  /* device-defined */
+   __u64   attr;   /* group-defined */
+   __u64   addr;   /* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  addr is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README 
b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 000..34a6983

Re: KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery

2013-04-02 Thread Tomoki Sekiyama
Hi,

Thank you for testing the patch.

Yangminqiang yangminqi...@huawei.com wrote:
 Hi Tomoki
 
 I tried your smart patch cpu isolation and direct interrupt delivery,  
   http://article.gmane.org/gmane.linux.kernel/1353803
 
 got  output when I run qemu
   kvm_set_slave_cpu: Invalid argument
 
 So I wonder
 * Did I  misuse your patches? 
 * How is the offlined CPU assigned? or the Guest OS will automaticly detect
 and use it?

Currently it is hard-coded in the patch for qemu-kvm just for testing:

diff -Narup a/qemu-kvm-1.0/qemu-kvm-x86.c b/qemu-kvm-1.0/qemu-kvm-x86.c
--- a/qemu-kvm-1.0/qemu-kvm-x86.c   2011-12-04 19:38:06.0 +0900
+++ b/qemu-kvm-1.0/qemu-kvm-x86.c   2012-09-06 20:19:44.828163734 +0900
@@ -139,12 +139,28 @@ static int kvm_enable_tpr_access_reporti
 return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, tac);
 }

+static int kvm_set_slave_cpu(CPUState *env)
+{
+int r, slave = env-cpu_index == 0 ? 2 : env-cpu_index == 1 ? 3 : -1;

`slave' is the offlined CPU ID assigned, and `env-cpu_index' is
the virtual CPU ID. You need to modify here and recompile qemu-kvm
(or just offline cpu 2 and 3 for a 2vcpus guest ;) ).

Thanks,
Tomoki Sekiyama
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: kvm_set_slave_cpu: Invalid argument when trying direct interrupt delivery

2013-04-02 Thread Tomoki Sekiyama
Hi,

Thank you for testing the patch.

Yangminqiang yangminqi...@huawei.com wrote:
 Hi Tomoki
 
 I tried your smart patch cpu isolation and direct interrupt delivery,  
   http://article.gmane.org/gmane.linux.kernel/1353803
 
 got  output when I run qemu
   kvm_set_slave_cpu: Invalid argument
 
 So I wonder
 * Did I  misuse your patches? 
 * How is the offlined CPU assigned? or the Guest OS will automaticly detect
 and use it?

Currently it is hard-coded in the patch for qemu-kvm just for testing:

diff -Narup a/qemu-kvm-1.0/qemu-kvm-x86.c b/qemu-kvm-1.0/qemu-kvm-x86.c
--- a/qemu-kvm-1.0/qemu-kvm-x86.c   2011-12-04 19:38:06.0 +0900
+++ b/qemu-kvm-1.0/qemu-kvm-x86.c   2012-09-06 20:19:44.828163734 +0900
@@ -139,12 +139,28 @@ static int kvm_enable_tpr_access_reporti
 return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, tac);
 }

+static int kvm_set_slave_cpu(CPUState *env)
+{
+int r, slave = env-cpu_index == 0 ? 2 : env-cpu_index == 1 ? 3 : -1;

`slave' is the offlined CPU ID assigned, and `env-cpu_index' is
the virtual CPU ID. You need to modify here and recompile qemu-kvm
(or just offline cpu 2 and 3 for a 2vcpus guest ;) ).

Thanks,
Tomoki Sekiyama
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Paul Mackerras
On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
 On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
 On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
  +4.79 KVM_CREATE_DEVICE
  +
  +Capability: KVM_CAP_DEVICE_CTRL
 
 I notice this patch doesn't add this capability;
 
 Yes, it does (see below).
 
 you add it in a later patch.
 
 Maybe you're thinking of KVM_CAP_IRQ_MPIC?

No, I was referring to the addition to kvm_dev_ioctl_check_extension()
of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
userspace queries the KVM_CAP_DEVICE_CTRL capability.

  +/* ioctl for vm fd */
  +#define KVM_CREATE_DEVICE   _IOWR(KVMIO,  0xe0, struct
 kvm_create_device)
 
 This define should go with the other VM ioctls, otherwise the next
 person to add a VM ioctl will probably miss it and reuse the 0xe0
 code.
 
 That's actually why I moved it to a new section, with device control
 ioctls getting their own range, as the legacy device model and
 some other things did.  0xe0 is not the next ioctl that would be
 used for either vm or vcpu.  The ioctl numbering is actually already
 a mess, with sometimes care being taken to keep vcpu and vm ioctls
 from overlapping, but on other places overlapping does happen.  I'm
 not sure what exactly I should do here.

Well, even if you are using a new range, I still think that
KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
ioctls.  I guess it's ultimately up to the maintainers.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:13 -0500, Scott Wood wrote:
 On 04/02/2013 04:16:11 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 15:54 -0500, Stuart Yoder wrote:
   The number of windows is always power of 2 (and max is 256).  And  
  to reduce
   PAMU cache pressure you want to use the fewest number of windows
   you can.So, I don't see practically how we could transparently
   steal entries to
   add the MSIs. Either user space knows to leave empty windows for
   MSIs and by convention the kernel knows which windows those are (as
   in option #A) or explicitly tell the kernel which windows (as in  
  option #B).
  
  Ok, apparently I don't understand the API.  Is it something like
  userspace calls GET_ATTR and finds out that there are 256 available
  windows, userspace determines that it needs 8 for RAM and then it has  
  an
  MSI device, so it needs to call SET_ATTR and ask for 16?  That seems
  prone to exploitation by the first userspace to allocate it's  
  aperture,
 
 What exploitation?
 
 It's not as if there is a pool of 256 global windows that users  
 allocate from.  The subwindow count is just how finely divided the  
 aperture is.  The only way one user will affect another is through  
 cache contention (which is why we want the minimum number of subwindows  
 that we can get away with).
 
  but I'm also not sure why userspace could specify the (non-power of 2)
  number of windows it needs for RAM, then VFIO would see that the  
  devices
  attached have MSI and add those windows and align to a power of 2.
 
 If you double the subwindow count without userspace knowing, you have  
 to double the aperture as well (and you may need to grow up or down  
 depending on alignment).  This means you also need to halve the maximum  
 aperture that userspace can request.  And you need to expose a  
 different number of maximum subwindows in the IOMMU API based on  
 whether we might have MSIs of this type.  It's ugly and awkward, and  
 removes the possibility for userspace to place the MSIs in some unused  
 slot in the middle, or not use MSIs at all.

Ok, I missed this in Stuart's example:

Total aperture: 512MB
# of windows: 8

win gphys/
#   iovaphys  size
---   
0   0x  0xX_XX00  64MB
1   0x0400  0xX_XX00  64MB
2   0x0800  0xX_XX00  64MB
3   0x0C00  0xX_XX00  64MB
4   0x1000  0xf_fe044000  4KB// msi bank 1
  ^^
5   0x1400  0xf_fe045000  4KB// msi bank 2
  ^^
6   0x1800  0xf_fe046000  4KB// msi bank 3
  ^^
7- -  disabled

So even though the MSI banks are 4k in this example, they're still on
64MB boundaries.  If userspace were to leave this as 256 windows, each
would be 2MB and we'd use 128 of them to map the same memory as these
4x64MB windows and thrash the iotlb harder.  The picture is becoming
clearer.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote:
 On 04/02/2013 04:32:04 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote:
   On 04/02/2013 03:32:17 PM, Alex Williamson wrote:
On x86 the interrupt remapper handles this transparently when MSI
is enabled and userspace never gets direct access to the device  
  MSI
address/data registers.
  
   x86 has a totally different mechanism here, as far as I understand  
  --
   even before you get into restrictions on mappings.
  
  So what control will userspace have over programming the actually MSI
  vectors on PAMU?
 
 Not sure what you mean -- PAMU doesn't get explicitly involved in  
 MSIs.  It's just another 4K page mapping (per relevant MSI bank).  If  
 you want isolation, you need to make sure that an MSI group is only  
 used by one VFIO group, and that you're on a chip that has alias pages  
 with just one MSI bank register each (newer chips do, but the first  
 chip to have a PAMU didn't).

How does a user figure this out?

This could also be done as another type2 ioctl extension.
  
   Again, what is type2, specifically?  If someone else is adding  
  their
   own IOMMU that is kind of, sort of like PAMU, how would they know if
   it's close enough?  What assumptions can a user make when they see  
  that
   they're dealing with type2?
  
  Naming always has and always will be a problem.  I assume this is  
  named
  type2 rather than PAMU because it's trying to expose a generic  
  windowed
  IOMMU fitting the IOMMU API.
 
 But how closely is the MSI situation related to a generic windowed  
 IOMMU, then?  We could just as well have a highly flexible IOMMU in  
 terms of arbitrary 4K page mappings, but still handle MSIs as pages to  
 be mapped rather than a translation table.  Or we could have a windowed  
 IOMMU that has an MSI translation table.
 
  Like type1, it doesn't really make sense
  to name it IOMMU API because that's a kernel internal interface and
  we're designing a userspace interface that just happens to use that.
  Tagging it to a piece of hardware makes it less reusable.
 
 Well, that's my point.  Is it reusable at all, anyway?  If not, then  
 giving it a more obscure name won't change that.  If it is reusable,  
 then where is the line drawn between things that are PAMU-specific or  
 MPIC-specific and things that are part of the generic windowed IOMMU  
 abstraction?
 
   Type1 is arbitrary.  It might as well be named brown and this one  
  can be
  blue.
 
 The difference is that type1 seems to refer to hardware that can do  
 arbitrary 4K page mappings, possibly constrained by an aperture but  
 nothing else.  More than one IOMMU can reasonably fit that.  The odds  
 that another IOMMU would have exactly the same restrictions as PAMU  
 seem smaller in comparison.
 
 In any case, if you had to deal with some Intel-only quirk, would it  
 make sense to call it a type1 attribute?  I'm not advocating one way  
 or the other on whether an abstraction is viable here (though Stuart  
 seems to think it's highly unlikely anything but a PAMU will comply),  
 just that if it is to be abstracted rather than a hardware-specific  
 interface, we need to document what is and is not part of the  
 abstraction.  Otherwise a non-PAMU-specific user won't know what they  
 can rely on, and someone adding support for a new windowed IOMMU won't  
 know if theirs is close enough, or they need to introduce a type3.

So Alexey named the SPAPR IOMMU something related to spapr...
surprisingly enough.  I'm fine with that.  If you think it's unique
enough, name it something appropriately.  I haven't seen the code and
don't know the architecture sufficiently to have an opinion.

What's the value to userspace in determining which windows are  
  used
by which banks?
  
   That depends on who programs the MSI config space address.  What is
   important is userspace controlling which iovas will be dedicated to
   this, in case it wants to put something else there.
  
  So userspace is programming the MSI vectors, targeting a user  
  programmed
  iova?  But an iova selects a window and I thought there were some  
  number
  of MSI banks and we don't really know which ones we'll need...  still
  confused.
 
 Userspace would also need a way to find out the page offset and data  
 value.  That may be an argument in favor of having the two ioctls  
 Stuart later suggested (get MSI count, and map MSI).

Connecting the user set iova and host kernel assigned irq number is
where I'm still lost, but I'll follow-up with that question in the other
thread.

 Would there be  
 any complication in the VFIO code from tracking a mapping that doesn't  
 have a userspace virtual address associated with it?

Only the VFIO iommu driver tracks mappings, the QEMU userspace component
doesn't (replies on the memory API for type1), nor does any of the
kernel framework code.

   There's going to be special stuff no matter 

Re: RFC: vfio API changes needed for powerpc

2013-04-02 Thread Alex Williamson
On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote:
 On 04/02/2013 04:38:45 PM, Alex Williamson wrote:
  On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote:
   On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood  
  scottw...@freescale.com wrote:
C.  Explicit mapping using normal DMA map.  The last idea  
  is that
we would introduce a new ioctl to give user-space an fd  
  to
the MSI bank, which could be mmapped.  The flow would be
something like this:
   -for each group user space calls new ioctl
 VFIO_GROUP_GET_MSI_FD
   -user space mmaps the fd, getting a vaddr
   -user space does a normal DMA map for desired iova
This approach makes everything explicit, but adds a new  
  ioctl
applicable most likely only to the PAMU (type2 iommu).
   
And the DMA_MAP of that mmap then allows userspace to select the  
  window
used?  This one seems like a lot of overhead, adding a new  
  ioctl, new
fd, mmap, special mapping path, etc.
   
   
There's going to be special stuff no matter what.  This would  
  keep it
separated from the IOMMU map code.
   
I'm not sure what you mean by overhead here... the runtime  
  overhead of
setting things up is not particularly relevant as long as it's  
  reasonable.
If you mean development and maintenance effort, keeping things  
  well
separated should help.
  
   We don't need to change DMA_MAP.  If we can simply add a new type  
  2
   ioctl that allows user space to set which windows are MSIs, it  
  seems vastly
   less complex than an ioctl to supply a new fd, mmap of it, etc.
  
   So maybe 2 ioctls:
   VFIO_IOMMU_GET_MSI_COUNT
 
 Do you mean a count of actual MSIs or a count of MSI banks used by the  
 whole VFIO group?

I hope the latter, which would clarify how this is distinct from
DEVICE_GET_IRQ_INFO.  Is hotplug even on the table?  Presumably
dynamically adding a device could bring along additional MSI banks?

   VFIO_IOMMU_MAP_MSI(iova, size)
 
 Not sure how you mean size to be used -- for MPIC it would be 4K per  
 bank, and you can only map one bank at a time (which bank you're  
 mapping should be a parameter, if only so that the kernel doesn't have  
 to keep iteration state for you).
 
  How are MSIs related to devices on PAMU?
 
 PAMU doesn't care about MSIs.  The relation of individual MSIs to a  
 device is standard PCI stuff.  Each MSI bank (which is part of the  
 MPIC, not PAMU) can hold numerous MSIs.  The VFIO user would want to  
 map all MSI banks that are in use by any of the devices in the group.   
 Ideally we'd let the VFIO grouping influence the allocation of MSIs.

The current VFIO MSI support has the host handling everything about MSI.
The user never programs an MSI vector to the physical device, they set
up everything through ioctl.  On interrupt, we simply trigger an eventfd
and leave it to things like KVM irqfd or QEMU to do the right thing in a
virtual machine.

Here the MSI vector has to go through a PAMU window to hit the correct
MSI bank.  So that means it has some component of the iova involved,
which we're proposing here is controlled by userspace (whether that
vector uses an offset from 0x1000 or 0x depending on which
window slot is used to make the MSI bank).  I assume we're still working
in a model where the physical interrupt fires into the host and a
host-based interrupt handler triggers an eventfd, right?  So that means
the vector also has host components so we trigger the correct ISR.  How
is that coordinated?

Would is be possible for userspace to simply leave room for MSI bank
mapping (how much room could be determined by something like
VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can
DMA_MAP starting at the 0x0 address of the aperture, growing up, and
VFIO will map banks on demand at the top of the aperture, growing down?
Wouldn't that avoid a lot of issues with userspace needing to know
anything about MSI banks (other than count) and coordinating irq numbers
and enabling handlers?

  On x86 MSI count is very
  device specific, which means it wold be a VFIO_DEVICE_* ioctl  
  (actually
  VFIO_DEVICE_GET_IRQ_INFO does this for us on x86).  The trouble with  
  it
  being a device ioctl is that you need to get the device FD, but the
  IOMMU protection needs to be established before you can get that... so
  there's an ordering problem if you need it from the device before
  configuring the IOMMU.  Thanks,
 
 What do you mean by IOMMU protection needs to be established?   
 Wouldn't we just start with no mappings in place?

If no mappings blocks all DMA, sure, that's fine.  Once the VFIO device
FD is accessible by userspace we have to protect the host against DMA.
If any IOMMU_SET_ATTR calls temporarily disable DMA protection, that
could be exploitable.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a 

Re: [PATCH v6 6/6] KVM: Use eoi to track RTC interrupt delivery status

2013-04-02 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 12:21:05AM +, Zhang, Yang Z wrote:
 Gleb Natapov wrote on 2013-04-02:
  On Fri, Mar 29, 2013 at 03:25:16AM +, Zhang, Yang Z wrote:
  Paolo Bonzini wrote on 2013-03-26:
  Il 22/03/2013 06:24, Yang Zhang ha scritto:
  +static void rtc_irq_ack_eoi(struct kvm_vcpu *vcpu,
  +struct rtc_status *rtc_status, int irq)
  +{
  +if (irq != RTC_GSI)
  +return;
  +
  +if (test_and_clear_bit(vcpu-vcpu_id, rtc_status-dest_map))
  +--rtc_status-pending_eoi;
  +
  +WARN_ON(rtc_status-pending_eoi  0);
  +}
  
  This is the only case where you're passing the struct rtc_status instead
  of the struct kvm_ioapic.  Please use the latter, and make it the first
  argument.
  
  @@ -244,7 +268,14 @@ static int ioapic_deliver(struct kvm_ioapic *ioapic,
  int
  irq)
   irqe.level = 1;
   irqe.shorthand = 0;
  -return kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe, NULL);
  +if (irq == RTC_GSI) {
  +ret = kvm_irq_delivery_to_apic(ioapic-kvm, NULL, irqe,
  +ioapic-rtc_status.dest_map);
  +ioapic-rtc_status.pending_eoi = ret;
  
  I think you should either add a
  
  BUG_ON(ioapic-rtc_status.pending_eoi != 0);
  or use ioapic-rtc_status.pending_eoi += ret (or both).
  
  There may malicious guest to write EOI more than once. And the pending_eoi
  will be negative. But it should not be a bug. Just WARN_ON is enough. And we
  already do it in ack_eoi. So don't need to do duplicated thing here.
  
  Since we track vcpus that already called EOI and decrement pending_eoi
  only once for each vcpu malicious guest cannot trigger it, but we
  already do WARN_ON() in rtc_irq_ack_eoi(), so I am not sure we need
  another one here. += will be correct (since pending_eoi == 0 here), but
  confusing since it makes an impression that pending_eoi may not be zero.
 Yes, I also make the wrong impression.
 With previous implementation, the pening_eoi may not be zero: Calculate the 
 destination vcpu via parse IOAPIC entry, and if using lowest priority deliver 
 mode, set all possible vcpus in dest_map even it doesn't receive it finally. 
 At same time, a malicious guest can send IPI with same vector of RTC to those 
 vcpus who is in dest_map but not have RTC interrupt. Then the pending_eoi 
 will be negative.
 Now, we set the dest_map with the vcpus who really received the interrupt. 
 The above case cannot happen. So as you and Paolo suggested, it is better to 
 use +=.
 
I am not suggesting that it is better to use +=. We can add
BUG_ON(ioapic-rtc_status.pending_eoi != 0); but no need to resend
patches just for that.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check

2013-04-02 Thread Nicholas A. Bellinger
On Tue, 2013-04-02 at 16:27 +0300, Michael S. Tsirkin wrote:
 On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote:
  On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: 
   Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto:
On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote:
I think it's the right thing to do, but maybe not the right place
to do this, need to reset after all IO is done, before
ring memory is write protected.
   
Our emails are crossing each other unfortunately, but I want to
reinforce this: ring memory is not write protected.

Understood.  However, AFAICT the act of write protecting these ranges
for ROM generates the offending callbacks to vhost_set_memory().

The part that I'm missing is if ring memory is not being write protected
by make_bios_readonly_intel(), why are the vhost_set_memory() calls
being invoked..?
   
   Because mappings change for the region that contains the ring.  vhost
   doesn't know yet that the changes do not affect ring memory,
   vhost_set_memory() is called exactly to ascertain that.
   
  
  Hi Paolo  Co,
  
  Here's a bit more information on what is going on with the same
  cpu_physical_memory_map() failure in vhost_verify_ring_mappings()..
  
  So as before, at the point that seabios is marking memory as readonly
  for ROM in src/shadow.c:make_bios_readonly_intel() with the following
  call:
  
  Calling pci_config_writeb(0x31): bdf: 0x pam: 0x005b
  
  the memory API update hook triggers back into vhost_region_del() code,
  and following occurs:
  
  Entering vhost_region_del section: 0x7fd30a213b60 offset_within_region: 
  0xc size: 2146697216 readonly: 0
  vhost_region_del: is_rom: 0, rom_device: 0
  vhost_region_del: readable: 1
  vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648
  vhost_region_del: name: pc.ram
  Entering vhost_set_memory, section: 0x7fd30a213b60 add: 0, dev-started: 1
  Entering verify_ring_mappings: start_addr 0x000c size: 
  2146697216
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
  verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 
  5124
  address_space_map: addr: 0xed000, plen: 5124
  address_space_map: l: 4096, len: 5124
  phys_page_find got PHYS_MAP_NODE_NIL ..
  address_space_map: section: 0x7fd30fabaed0 memory_region_is_ram: 0 
  readonly: 0
  address_space_map: section: 0x7fd30fabaed0 offset_within_region: 0x0 
  section size: 18446744073709551615
  Unable to map ring buffer for ring 2, l: 4096
  
  So the interesting part is that phys_page_find() is not able to locate
  the corresponding page for vq-ring_phys: 0xed000 from the
  vhost_region_del() callback with section-offset_within_region:
  0xc..
  
  Is there any case where this would not be considered a bug..? 
  
  register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
  register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
  register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
  Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 
  0xc size: 32768 readonly: 1
  vhost_region_add: is_rom: 0, rom_device: 0
  vhost_region_add: readable: 1
  vhost_region_add: ram_addr 0x, addr: 0x   0 
  size: 2147483648
  vhost_region_add: name: pc.ram
  Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1
  Entering verify_ring_mappings: start_addr 0x000c size: 32768
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
  verify_ring_mappings: Got !ranges_overlap, skipping
  register_multipage : d: 0x7fd30f7d0ed0 section: 0x7fd30a2139b0
  Entering vhost_region_add section: 0x7fd30a213aa0 offset_within_region: 
  0xc8000 size: 2146664448 readonly: 0
  vhost_region_add: is_rom: 0, rom_device: 0
  vhost_region_add: readable: 1
  vhost_region_add: ram_addr 0x, addr: 0x   0 
  size: 2147483648
  vhost_region_add: name: pc.ram
  Entering vhost_set_memory, section: 0x7fd30a213aa0 add: 1, dev-started: 1
  Entering verify_ring_mappings: start_addr 0x000c8000 size: 
  2146664448
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0x0 ring_size: 0
  verify_ring_mappings: ring_phys 0xed000 ring_size: 5124
  verify_ring_mappings: calling cpu_physical_memory_map ring_phys: 0xed000 l: 
  5124
  address_space_map: addr: 0xed000, plen: 5124
  address_space_map: l: 4096, len: 5124
  address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 1 
  readonly: 0
  address_space_map: section: 0x7fd30fabb020 offset_within_region: 0xc8000 
  section size: 2146664448
  address_space_map: l: 4096, len: 1028
  address_space_map: section: 0x7fd30fabb020 memory_region_is_ram: 

RE: [PATCH 3/3] ARM: EXYNOS5250: Register architected timers

2013-04-02 Thread Kukjin Kim
Alexander Graf wrote:
 
 On 04/02/2013 12:44 PM, Kukjin Kim wrote:
  Alexander Graf wrote:
  When running on an exynos 5250 SoC, we don't initialize the architected
  timers. The chip however supports architected timers.
 
  Yes, exynos5250 can support, mct(multi core timer) is used though.
 
  When we don't initialize them, KVM will try to access them and run into
  NULL pointer dereferences attempting to do so.
 
  Yes, right.
 
  This patch is really more of a hack than a real fix, but does get me
  working with KVM on Arndale.
 
  Hmm, if you think, this is _really_ a hack, you need to add some
 comments
  about that for clearance, and since the mct.c file has been moved into
  drivers/clocksource/, this should be re-worked.
 
  BTW, I discussed about this with Thomas and Giridhar just now, we
 reached
  this 3rd patch could be dropped because the correct way is to add a dts
  node for arch timer which patch 2nd is already doing after 3.9-rc1
 because
  of CLOCKSOURCE_OF_DECLARE macro.
 
  So if you' OK above, let me know so that I can take only 1st and 2nd
  patches to support KVM on exynos5250.
 
 I'd say go ahead and take them and I'll verify whether things work on
 your tree :).
 
OK, I will.

 What's the git repo of your branch?
 
You can test with my for-next branch but this series can be seen tomorrow
night(KST) in my public tree.

Any problems, please let me know.

Thanks.

- Kukjin

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 WIP 3/3] disable vhost_verify_ring_mappings check

2013-04-02 Thread Nicholas A. Bellinger
On Tue, 2013-04-02 at 21:04 -0700, Nicholas A. Bellinger wrote:
 On Tue, 2013-04-02 at 16:27 +0300, Michael S. Tsirkin wrote:
  On Mon, Apr 01, 2013 at 06:05:47PM -0700, Nicholas A. Bellinger wrote:
   On Fri, 2013-03-29 at 09:14 +0100, Paolo Bonzini wrote: 
Il 29/03/2013 03:53, Nicholas A. Bellinger ha scritto:
 On Thu, 2013-03-28 at 06:13 -0400, Paolo Bonzini wrote:
 I think it's the right thing to do, but maybe not the right place
 to do this, need to reset after all IO is done, before
 ring memory is write protected.

 Our emails are crossing each other unfortunately, but I want to
 reinforce this: ring memory is not write protected.
 
 Understood.  However, AFAICT the act of write protecting these ranges
 for ROM generates the offending callbacks to vhost_set_memory().
 
 The part that I'm missing is if ring memory is not being write 
 protected
 by make_bios_readonly_intel(), why are the vhost_set_memory() calls
 being invoked..?

Because mappings change for the region that contains the ring.  vhost
doesn't know yet that the changes do not affect ring memory,
vhost_set_memory() is called exactly to ascertain that.


SNIP

  
  Is it possible that what is going on here,
  is that we had a region at address 0x0 size 0x8000,
  and now a chunk from it is being made readonly,
  and to this end the whole old region is removed
  then new ones are added?
 
 Yes, I believe this is exactly what is happening..
 
  
  If yes maybe the problem is that we don't use the atomic
  begin/commit ops in the memory API.
  Maybe the following will help?
  Completely untested, posting just to give you the idea:
  
 
 Mmmm, one question on how vhost_region_del() + vhost_region_add() +
 vhost_commit() should work..
 
 Considering the following when the same seabios code snippet:
 
pci_config_writeb(0x31): bdf: 0x pam: 0x005b
 
 is executed to mark an pc.ram area 0xc as readonly:
 
 Entering vhost_begin 
 Entering vhost_region_del section: 0x7fd037a4bb60 offset_within_region: 
 0xc size: 2146697216 readonly: 0
 vhost_region_del: is_rom: 0, rom_device: 0
 vhost_region_del: readable: 1
 vhost_region_del: ram_addr 0x0, addr: 0x0 size: 2147483648
 vhost_region_del: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd037a4bb60 add: 0, dev-started: 1
 vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc
 Entering vhost_region_add section: 0x7fd037a4baa0 offset_within_region: 
 0xc size: 32768 readonly: 1
 vhost_region_add is readonly !!!
 vhost_region_add: is_rom: 0, rom_device: 0
 vhost_region_add: readable: 1
 vhost_region_add: ram_addr 0x, addr: 0x   0 size: 
 2147483648
 vhost_region_add: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd037a4baa0 add: 1, dev-started: 1
 vhost_dev_assign_memory();  reg-guest_phys_addr: 
 0xc
 vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc
 Entering vhost_region_add section: 0x7fd037a4baa0 offset_within_region: 
 0xc8000 size: 2146664448 readonly: 0
 vhost_region_add: is_rom: 0, rom_device: 0
 vhost_region_add: readable: 1
 vhost_region_add: ram_addr 0x, addr: 0x   0 size: 
 2147483648
 vhost_region_add: name: pc.ram
 Entering vhost_set_memory, section: 0x7fd037a4baa0 add: 1, dev-started: 1
 vhost_set_memory: Setting dev-memory_changed = true for start_addr: 0xc8000
 phys_page_find got PHYS_MAP_NODE_NIL ..
 Entering vhost_commit 
 
 Note that originally we'd see the cpu_physical_memory_map() failure in
 vhost_verify_ring_mappings() after the first -region_del() above.
 
 Adding a hardcoded cpu_physical_memory_map() testcase in vhost_commit()
 for phys_addr=0xed000, len=5124 (vq ring) does locate the correct
 *section from address_space_map(), which correct points to the section
 generated by the last vhost_region_add() above:
 
 Entering vhost_commit 
 address_space_map: addr: 0xed000, plen: 5124
 address_space_map: l: 4096, len: 5124
 address_space_map: section: 0x7f41b325f020 memory_region_is_ram: 1 readonly: 0
 address_space_map: section: 0x7f41b325f020 offset_within_region: 0xc8000 
 section size: 2146664448
 address_space_map: l: 4096, len: 1028
 address_space_map: section: 0x7f41b325f020 memory_region_is_ram: 1 readonly: 0
 address_space_map: section: 0x7f41b325f020 offset_within_region: 0xc8000 
 section size: 2146664448
 address_space_map: Calling qemu_ram_ptr_length: raddr: 0x   ed000 
 rlen: 5124
 address_space_map: After qemu_ram_ptr_length: raddr: 0x   ed000 rlen: 
 5124
 cpu_physical_memory_map(0xed000) got l: 5124
 
 So, does using a -commit callback for MemoryListener  mean that
 vhost_verify_ring_mappings() is OK to be called only from the final
 -commit callback, and not from each -region_del + -region_add
 callback..?   Eg: I seem to recall something about
 vhost_verify_ring_mappings() being called during each 

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/02/2013 06:47 AM, Scott Wood wrote:

Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood scottw...@freescale.com
---
  Documentation/virtual/kvm/api.txt|   70 ++
  Documentation/virtual/kvm/devices/README |1 +
  arch/powerpc/include/asm/kvm_host.h  |6 +++
  arch/powerpc/include/asm/kvm_ppc.h   |2 +
  arch/powerpc/kvm/powerpc.c   |7 +++
  include/uapi/linux/kvm.h |   27 
  virt/kvm/kvm_main.c  |   31 +
  7 files changed, 144 insertions(+)
  create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 976eb65..77328aa 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from 
the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.

+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+  be instantiated multiple times
+  ENOSPC: Too many devices have been created
+
+  Other error conditions may be defined by individual device types.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
+   __u32   fd; /* out: device handle */
+   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+  (e.g. read-only attribute, or attribute that only makes
+  sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the devices directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+   __u32   flags;  /* no flags currently defined */
+   __u32   group;  /* device-defined */
+   __u64   attr;   /* group-defined */
+   __u64   addr;   /* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  addr is ignored.

  4.77 KVM_ARM_VCPU_INIT

diff --git a/Documentation/virtual/kvm/devices/README 
b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 000..34a6983
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/README
@@ -0,0 +1 @@
+This directory contains specific device bindings for KVM_CAP_DEVICE_CTRL.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..e0caae2 100644
--- 

Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Alexander Graf

On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Thursday, March 28, 2013 10:06 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421; Bhushan
 Bharat-R65777
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
 
 
 On 21.03.2013, at 07:25, Bharat Bhushan wrote:
 
 From: Bharat Bhushan bharat.bhus...@freescale.com
 
 This patch adds the debug stub support on booke/bookehv.
 Now QEMU debug stub can use hw breakpoint, watchpoint and software
 breakpoint to debug guest.
 
 Debug registers are saved/restored on vcpu_put()/vcpu_get().
 Also the debug registers are saved restored only if guest is using
 debug resources.
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v2:
 - save/restore in vcpu_get()/vcpu_put()
 - some more minor cleanup based on review comments.
 
 arch/powerpc/include/asm/kvm_host.h |   10 ++
 arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
 arch/powerpc/kvm/booke.c|  252 
 ---
 arch/powerpc/kvm/e500_emulate.c |   10 ++
 4 files changed, 272 insertions(+), 22 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_host.h
 b/arch/powerpc/include/asm/kvm_host.h
 index f4ba881..8571952 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
 u32 mmucfg;
 u32 epr;
 u32 crit_save;
 +   /* guest debug registers*/
 struct kvmppc_booke_debug_reg dbg_reg;
 +   /* shadow debug registers */
 +   struct kvmppc_booke_debug_reg shadow_dbg_reg;
 +   /* host debug registers*/
 +   struct kvmppc_booke_debug_reg host_dbg_reg;
 +   /*
 +* Flag indicating that debug registers are used by guest
 +* and requires save restore.
 +   */
 +   bool debug_save_restore;
 #endif
 gpa_t paddr_accessed;
 gva_t vaddr_accessed;
 diff --git a/arch/powerpc/include/uapi/asm/kvm.h
 b/arch/powerpc/include/uapi/asm/kvm.h
 index 15f9a00..d7ce449 100644
 --- a/arch/powerpc/include/uapi/asm/kvm.h
 +++ b/arch/powerpc/include/uapi/asm/kvm.h
 @@ -25,6 +25,7 @@
 /* Select powerpc specific features in linux/kvm.h */ #define
 __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
 +#define __KVM_HAVE_GUEST_DEBUG
 
 struct kvm_regs {
 __u64 pc;
 @@ -267,7 +268,24 @@ struct kvm_fpu {
 __u64 fpr[32];
 };
 
 +/*
 + * Defines for h/w breakpoint, watchpoint (read, write or both) and
 + * software breakpoint.
 + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
 + * for KVM_DEBUG_EXIT.
 + */
 +#define KVMPPC_DEBUG_NONE  0x0
 +#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
 +#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
 +#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
 struct kvm_debug_exit_arch {
 +   __u64 address;
 +   /*
 +* exiting to userspace because of h/w breakpoint, watchpoint
 +* (read, write or both) and software breakpoint.
 +*/
 +   __u32 status;
 +   __u32 reserved;
 };
 
 /* for KVM_SET_GUEST_DEBUG */
 @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
  * Type denotes h/w breakpoint, read watchpoint, write
  * watchpoint or watchpoint (both read and write).
  */
 -#define KVMPPC_DEBUG_NOTYPE0x0
 -#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
 -#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
 -#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
 __u32 type;
 __u32 reserved;
 } bp[16];
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index
 1de93a8..bf20056 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct kvm_vcpu
 *vcpu) #endif }
 
 +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
 +   /* Synchronize guest's desire to get debug interrupts into shadow
 +MSR */ #ifndef CONFIG_KVM_BOOKE_HV
 +   vcpu-arch.shadow_msr = ~MSR_DE;
 +   vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
 +
 +   /* Force enable debug interrupts when user space wants to debug */
 +   if (vcpu-guest_debug) {
 +#ifdef CONFIG_KVM_BOOKE_HV
 +   /*
 +* Since there is no shadow MSR, sync MSR_DE into the guest
 +* visible MSR. Do not allow guest to change MSR[DE].
 +*/
 +   vcpu-arch.shared-msr |= MSR_DE;
 +   mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);
 
 This mtspr should really just be a bit or in shadow_mspr when guest_debug 
 gets
 enabled. It should automatically get synchronized as soon as the next
 vpcu_load() happens.
 
 I think this is not required here as shadow_dbsr already have MSRP_DEP set.
 
 Will setup shadow_msrp when setting guest_debug and clear shadow_msrp when 
 guest_debug is cleared.
 But that will also not be sufficient as it not sure when vcpu_load() will be 
 called after the shadow_msrp is changed. So 

RE: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, April 02, 2013 1:57 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support
 
 
 On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:
 
 
 
  -Original Message-
  From: Alexander Graf [mailto:ag...@suse.de]
  Sent: Thursday, March 28, 2013 10:06 PM
  To: Bhushan Bharat-R65777
  Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421;
  Bhushan
  Bharat-R65777
  Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
  support
 
 
  On 21.03.2013, at 07:25, Bharat Bhushan wrote:
 
  From: Bharat Bhushan bharat.bhus...@freescale.com
 
  This patch adds the debug stub support on booke/bookehv.
  Now QEMU debug stub can use hw breakpoint, watchpoint and software
  breakpoint to debug guest.
 
  Debug registers are saved/restored on vcpu_put()/vcpu_get().
  Also the debug registers are saved restored only if guest is using
  debug resources.
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
  v2:
  - save/restore in vcpu_get()/vcpu_put()
  - some more minor cleanup based on review comments.
 
  arch/powerpc/include/asm/kvm_host.h |   10 ++
  arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
  arch/powerpc/kvm/booke.c|  252 
  -
 --
  arch/powerpc/kvm/e500_emulate.c |   10 ++
  4 files changed, 272 insertions(+), 22 deletions(-)
 
  diff --git a/arch/powerpc/include/asm/kvm_host.h
  b/arch/powerpc/include/asm/kvm_host.h
  index f4ba881..8571952 100644
  --- a/arch/powerpc/include/asm/kvm_host.h
  +++ b/arch/powerpc/include/asm/kvm_host.h
  @@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
u32 mmucfg;
u32 epr;
u32 crit_save;
  + /* guest debug registers*/
struct kvmppc_booke_debug_reg dbg_reg;
  + /* shadow debug registers */
  + struct kvmppc_booke_debug_reg shadow_dbg_reg;
  + /* host debug registers*/
  + struct kvmppc_booke_debug_reg host_dbg_reg;
  + /*
  +  * Flag indicating that debug registers are used by guest
  +  * and requires save restore.
  + */
  + bool debug_save_restore;
  #endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
  diff --git a/arch/powerpc/include/uapi/asm/kvm.h
  b/arch/powerpc/include/uapi/asm/kvm.h
  index 15f9a00..d7ce449 100644
  --- a/arch/powerpc/include/uapi/asm/kvm.h
  +++ b/arch/powerpc/include/uapi/asm/kvm.h
  @@ -25,6 +25,7 @@
  /* Select powerpc specific features in linux/kvm.h */ #define
  __KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
  +#define __KVM_HAVE_GUEST_DEBUG
 
  struct kvm_regs {
__u64 pc;
  @@ -267,7 +268,24 @@ struct kvm_fpu {
__u64 fpr[32];
  };
 
  +/*
  + * Defines for h/w breakpoint, watchpoint (read, write or both) and
  + * software breakpoint.
  + * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
  + * for KVM_DEBUG_EXIT.
  + */
  +#define KVMPPC_DEBUG_NONE0x0
  +#define KVMPPC_DEBUG_BREAKPOINT  (1UL  1)
  +#define KVMPPC_DEBUG_WATCH_WRITE (1UL  2)
  +#define KVMPPC_DEBUG_WATCH_READ  (1UL  3)
  struct kvm_debug_exit_arch {
  + __u64 address;
  + /*
  +  * exiting to userspace because of h/w breakpoint, watchpoint
  +  * (read, write or both) and software breakpoint.
  +  */
  + __u32 status;
  + __u32 reserved;
  };
 
  /* for KVM_SET_GUEST_DEBUG */
  @@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
 * Type denotes h/w breakpoint, read watchpoint, write
 * watchpoint or watchpoint (both read and write).
 */
  -#define KVMPPC_DEBUG_NOTYPE  0x0
  -#define KVMPPC_DEBUG_BREAKPOINT  (1UL  1)
  -#define KVMPPC_DEBUG_WATCH_WRITE (1UL  2)
  -#define KVMPPC_DEBUG_WATCH_READ  (1UL  3)
__u32 type;
__u32 reserved;
} bp[16];
  diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
  index
  1de93a8..bf20056 100644
  --- a/arch/powerpc/kvm/booke.c
  +++ b/arch/powerpc/kvm/booke.c
  @@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct
  kvm_vcpu
  *vcpu) #endif }
 
  +static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
  + /* Synchronize guest's desire to get debug interrupts into shadow
  +MSR */ #ifndef CONFIG_KVM_BOOKE_HV
  + vcpu-arch.shadow_msr = ~MSR_DE;
  + vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
  +
  + /* Force enable debug interrupts when user space wants to debug */
  + if (vcpu-guest_debug) {
  +#ifdef CONFIG_KVM_BOOKE_HV
  + /*
  +  * Since there is no shadow MSR, sync MSR_DE into the guest
  +  * visible MSR. Do not allow guest to change MSR[DE].
  +  */
  + vcpu-arch.shared-msr |= MSR_DE;
  + mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);
 
  This mtspr should really just be a bit or in shadow_mspr when
  guest_debug gets enabled. It should automatically get synchronized as
  soon as the next
  

Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Alexander Graf

On 04/02/2013 04:09 PM, Bhushan Bharat-R65777 wrote:



-Original Message-
From: Alexander Graf [mailto:ag...@suse.de]
Sent: Tuesday, April 02, 2013 1:57 PM
To: Bhushan Bharat-R65777
Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421
Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support


On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:




-Original Message-
From: Alexander Graf [mailto:ag...@suse.de]
Sent: Thursday, March 28, 2013 10:06 PM
To: Bhushan Bharat-R65777
Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421;
Bhushan
Bharat-R65777
Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
support


On 21.03.2013, at 07:25, Bharat Bhushan wrote:


From: Bharat Bhushanbharat.bhus...@freescale.com

This patch adds the debug stub support on booke/bookehv.
Now QEMU debug stub can use hw breakpoint, watchpoint and software
breakpoint to debug guest.

Debug registers are saved/restored on vcpu_put()/vcpu_get().
Also the debug registers are saved restored only if guest is using
debug resources.

Signed-off-by: Bharat Bhushanbharat.bhus...@freescale.com
---
v2:
- save/restore in vcpu_get()/vcpu_put()
- some more minor cleanup based on review comments.

arch/powerpc/include/asm/kvm_host.h |   10 ++
arch/powerpc/include/uapi/asm/kvm.h |   22 +++-
arch/powerpc/kvm/booke.c|  252 -

--

arch/powerpc/kvm/e500_emulate.c |   10 ++
4 files changed, 272 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h
b/arch/powerpc/include/asm/kvm_host.h
index f4ba881..8571952 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -504,7 +504,17 @@ struct kvm_vcpu_arch {
u32 mmucfg;
u32 epr;
u32 crit_save;
+   /* guest debug registers*/
struct kvmppc_booke_debug_reg dbg_reg;
+   /* shadow debug registers */
+   struct kvmppc_booke_debug_reg shadow_dbg_reg;
+   /* host debug registers*/
+   struct kvmppc_booke_debug_reg host_dbg_reg;
+   /*
+* Flag indicating that debug registers are used by guest
+* and requires save restore.
+   */
+   bool debug_save_restore;
#endif
gpa_t paddr_accessed;
gva_t vaddr_accessed;
diff --git a/arch/powerpc/include/uapi/asm/kvm.h
b/arch/powerpc/include/uapi/asm/kvm.h
index 15f9a00..d7ce449 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -25,6 +25,7 @@
/* Select powerpc specific features inlinux/kvm.h  */ #define
__KVM_HAVE_SPAPR_TCE #define __KVM_HAVE_PPC_SMT
+#define __KVM_HAVE_GUEST_DEBUG

struct kvm_regs {
__u64 pc;
@@ -267,7 +268,24 @@ struct kvm_fpu {
__u64 fpr[32];
};

+/*
+ * Defines for h/w breakpoint, watchpoint (read, write or both) and
+ * software breakpoint.
+ * These are used as type in KVM_SET_GUEST_DEBUG ioctl and status
+ * for KVM_DEBUG_EXIT.
+ */
+#define KVMPPC_DEBUG_NONE  0x0
+#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
+#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
+#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
struct kvm_debug_exit_arch {
+   __u64 address;
+   /*
+* exiting to userspace because of h/w breakpoint, watchpoint
+* (read, write or both) and software breakpoint.
+*/
+   __u32 status;
+   __u32 reserved;
};

/* for KVM_SET_GUEST_DEBUG */
@@ -279,10 +297,6 @@ struct kvm_guest_debug_arch {
 * Type denotes h/w breakpoint, read watchpoint, write
 * watchpoint or watchpoint (both read and write).
 */
-#define KVMPPC_DEBUG_NOTYPE0x0
-#define KVMPPC_DEBUG_BREAKPOINT(1UL  1)
-#define KVMPPC_DEBUG_WATCH_WRITE   (1UL  2)
-#define KVMPPC_DEBUG_WATCH_READ(1UL  3)
__u32 type;
__u32 reserved;
} bp[16];
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index
1de93a8..bf20056 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -133,6 +133,30 @@ static void kvmppc_vcpu_sync_fpu(struct
kvm_vcpu
*vcpu) #endif }

+static void kvmppc_vcpu_sync_debug(struct kvm_vcpu *vcpu) {
+   /* Synchronize guest's desire to get debug interrupts into shadow
+MSR */ #ifndef CONFIG_KVM_BOOKE_HV
+   vcpu-arch.shadow_msr= ~MSR_DE;
+   vcpu-arch.shadow_msr |= vcpu-arch.shared-msr  MSR_DE; #endif
+
+   /* Force enable debug interrupts when user space wants to debug */
+   if (vcpu-guest_debug) {
+#ifdef CONFIG_KVM_BOOKE_HV
+   /*
+* Since there is no shadow MSR, sync MSR_DE into the guest
+* visible MSR. Do not allow guest to change MSR[DE].
+*/
+   vcpu-arch.shared-msr |= MSR_DE;
+   mtspr(SPRN_MSRP, mfspr(SPRN_MSRP) | MSRP_DEP);

This mtspr should really just be a bit or in shadow_mspr when
guest_debug gets 

Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub support

2013-04-02 Thread Scott Wood

On 04/02/2013 09:09:34 AM, Bhushan Bharat-R65777 wrote:



 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Tuesday, April 02, 2013 1:57 PM
 To: Bhushan Bharat-R65777
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood Scott-B07421
 Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub  
support



 On 29.03.2013, at 07:04, Bhushan Bharat-R65777 wrote:

 
 
  -Original Message-
  From: Alexander Graf [mailto:ag...@suse.de]
  Sent: Thursday, March 28, 2013 10:06 PM
  To: Bhushan Bharat-R65777
  Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; Wood  
Scott-B07421;

  Bhushan
  Bharat-R65777
  Subject: Re: [PATCH 4/4 v2] KVM: PPC: Add userspace debug stub
  support
 
 
  How does the normal debug register switching code work in Linux?
  Can't we just reuse that? Or rely on it to restore working state  
when

  another process gets scheduled in?
 
  Good point, I can see debug registers loading in function  
__switch_to()-

 switch_booke_debug_regs() in file arch/powerpc/kernel/process.c.
  So as long as assume that host will not use debug resources we  
can rely on
 this restore. But I am not sure that this is a fare assumption. As  
Scott earlier

 mentioned someone can use debug resource for kernel debugging also.

 Someone in the kernel can also use floating point registers. But  
then it's his

 responsibility to clean up the mess he leaves behind.

I am neither convinced by what you said and nor even have much reason  
to oppose :)


Scott,
	I remember you mentioned that host can use debug resources, you  
comment on this ?


I thought the conclusion we reached was that it was OK as long as KVM  
waits until it actually needs the debug resources to mess with the  
registers.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Paul Mackerras
On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.
 
 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
 
 Both device types and individual attributes can be tested without having
 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.
 
 Signed-off-by: Scott Wood scottw...@freescale.com

Some comments below...

 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 976eb65..77328aa 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents 
 from the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.
  
 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability; you add it in a later
patch.  Since this patch adds the KVM_CREATE_DEVICE ioctl, it probably
should add the KVM_CAP_DEVICE_CTRL capability too.


 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created

Is this still a possible error code?

 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
   u64 dac[KVMPPC_BOOKE_MAX_DAC];
  };
  
 +#define KVMPPC_IRQCHIP_NONE  0
 +#define KVMPPC_IRQCHIP_MPIC  1

This define should go in the patch that adds the MPIC device.

  struct kvm_vcpu_arch {
   ulong host_stack;
   u32 host_pid;
 @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
   unsigned long magic_page_pa; /* phys addr to map the magic page to */
   unsigned long magic_page_ea; /* effect. addr to map the magic page to */
  
 + int irqchip_type;
 + void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?

 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 16b4595..bdfa526 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
   tasklet_kill(vcpu-arch.tasklet);
  
   kvmppc_remove_vcpu_debugfs(vcpu);
 +
 + switch (vcpu-arch.irqchip_type) {
 + case KVMPPC_IRQCHIP_MPIC:
 + mpic_put(vcpu-arch.irqchip_priv);
 + break;
 + }

This is going to break bisection, since you don't define mpic_put() in
this patch.

 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 74d0ff3..20ce2d2 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_EPR 86
  #define KVM_CAP_ARM_PSCI 87
  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 +#define KVM_CAP_DEVICE_CTRL 89
  
  #ifdef KVM_CAP_IRQ_ROUTING
  
 @@ -909,6 +910,32 @@ struct kvm_s390_ucas_mapping {
  #define KVM_ARM_SET_DEVICE_ADDR_IOW(KVMIO,  0xab, struct 
 kvm_arm_device_addr)
  
  /*
 + * Device control API, available with KVM_CAP_DEVICE_CTRL
 + */
 +#define KVM_CREATE_DEVICE_TEST   1
 +
 +struct kvm_create_device {
 + __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 + __u32   fd; /* out: device handle */
 + __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +struct kvm_device_attr {
 + __u32   flags;  /* no flags currently defined */
 + __u32   group;  /* device-defined */
 + __u64   attr;   /* group-defined */
 + __u64   addr;   /* userspace address of attr data */
 +};
 +
 +/* ioctl for vm fd */
 +#define KVM_CREATE_DEVICE  _IOWR(KVMIO,  0xe0, struct kvm_create_device)

This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info 

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Scott Wood

On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:

On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device  
state,

 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.

 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.

 Both device types and individual attributes can be tested without  
having

 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.

 Signed-off-by: Scott Wood scottw...@freescale.com

Some comments below...

 diff --git a/Documentation/virtual/kvm/api.txt  
b/Documentation/virtual/kvm/api.txt

 index 976eb65..77328aa 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with  
contents from the data
  written, then `n_invalid' invalid entries, invalidating any  
previously

  valid entries found.

 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL

I notice this patch doesn't add this capability;


Yes, it does (see below).


you add it in a later patch.


Maybe you're thinking of KVM_CAP_IRQ_MPIC?


 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +  ENOSPC: Too many devices have been created

Is this still a possible error code?


If you mean ENOSPC, probably not -- it'd be replaced with whatever  
errors can come out of creating a file descriptor.



 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -370,6 +370,9 @@ struct kvmppc_booke_debug_reg {
u64 dac[KVMPPC_BOOKE_MAX_DAC];
  };

 +#define KVMPPC_IRQCHIP_NONE   0
 +#define KVMPPC_IRQCHIP_MPIC   1

This define should go in the patch that adds the MPIC device.

  struct kvm_vcpu_arch {
ulong host_stack;
u32 host_pid;
 @@ -549,6 +552,9 @@ struct kvm_vcpu_arch {
  	unsigned long magic_page_pa; /* phys addr to map the magic page  
to */
  	unsigned long magic_page_ea; /* effect. addr to map the magic  
page to */


 +  int irqchip_type;
 +  void *irqchip_priv;

Since you add this (irqchip_priv) only to remove it in a later patch
and replace it by a device-specific pointer, why bother adding it
here?  And why not give irqchip_type the name it ultimately ends up
with?


Oops... These were patch shuffling accidents and will be removed from  
the next iteration.



 diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
 index 16b4595..bdfa526 100644
 --- a/arch/powerpc/kvm/powerpc.c
 +++ b/arch/powerpc/kvm/powerpc.c
 @@ -459,6 +459,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
tasklet_kill(vcpu-arch.tasklet);

kvmppc_remove_vcpu_debugfs(vcpu);
 +
 +  switch (vcpu-arch.irqchip_type) {
 +  case KVMPPC_IRQCHIP_MPIC:
 +  mpic_put(vcpu-arch.irqchip_priv);
 +  break;
 +  }

This is going to break bisection, since you don't define mpic_put() in
this patch.


Sigh.  Something got messed up; I'll try to sort it out and resubmit.


 diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
 index 74d0ff3..20ce2d2 100644
 --- a/include/uapi/linux/kvm.h
 +++ b/include/uapi/linux/kvm.h
 @@ -668,6 +668,7 @@ struct kvm_ppc_smmu_info {
  #define KVM_CAP_PPC_EPR 86
  #define KVM_CAP_ARM_PSCI 87
  #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
 +#define KVM_CAP_DEVICE_CTRL 89


See, here's the capability. :-)


  /*
 + * Device control API, available with KVM_CAP_DEVICE_CTRL
 + */
 +#define KVM_CREATE_DEVICE_TEST1
 +
 +struct kvm_create_device {
 +  __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 +  __u32   fd; /* out: device handle */
 +  __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +struct kvm_device_attr {
 +  __u32   flags;  /* no flags currently defined */
 +  __u32   group;  /* device-defined */
 +  __u64   attr;   /* group-defined */
 +  __u64   addr;   /* userspace address of attr data */
 +};
 +
 +/* ioctl for vm fd */
 +#define KVM_CREATE_DEVICE	  _IOWR(KVMIO,  0xe0, struct  
kvm_create_device)


This define should go with the other VM ioctls, otherwise the next
person to add a VM ioctl will probably miss it and reuse the 0xe0
code.


That's actually why I moved it to a new 

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/03/2013 01:30 AM, Scott Wood wrote:

On 04/02/2013 01:59:57 AM, tiejun.chen wrote:

On 04/02/2013 06:47 AM, Scott Wood wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
  }
  #endif

+static int kvm_ioctl_create_device(struct kvm *kvm,
+   struct kvm_create_device *cd)
+{
+bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
+
+switch (cd-type) {
+default:
+return -ENODEV;
+}


Even after apply patch 5, looks here still misses something like:

if (test)
WARN_ON_ONCE(!cd-type);


Why?  How does userspace passing in a bad type value mean the kernel needs to
report internal badness, why is a value of zero worse than any other bad value,
and why only when the test flag is set?


I just mean we need do something here since looks the 'test' variable is defined 
but unused, right? But please correct this as you expect :)


And if the userspace can't guarantee cd-type is never zero, we should return 
-ENODEV as well after that switch().


Tiejun
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread tiejun.chen

On 04/03/2013 09:34 AM, Scott Wood wrote:

On 04/02/2013 08:28:01 PM, tiejun.chen wrote:

On 04/03/2013 01:30 AM, Scott Wood wrote:

On 04/02/2013 01:59:57 AM, tiejun.chen wrote:

On 04/02/2013 06:47 AM, Scott Wood wrote:

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff71541..ed033c0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2158,6 +2158,17 @@ out:
  }
  #endif

+static int kvm_ioctl_create_device(struct kvm *kvm,
+   struct kvm_create_device *cd)
+{
+bool test = cd-flags  KVM_CREATE_DEVICE_TEST;
+
+switch (cd-type) {
+default:
+return -ENODEV;
+}


Even after apply patch 5, looks here still misses something like:

if (test)
WARN_ON_ONCE(!cd-type);


Why?  How does userspace passing in a bad type value mean the kernel needs to
report internal badness, why is a value of zero worse than any other bad value,
and why only when the test flag is set?


I just mean we need do something here since looks the 'test' variable is
defined but unused, right? But please correct this as you expect :)


Yes, it's unused in this patch, but is used after patch 5 is applied.  I didn't
think it was worth adding a temporary unused annotation, since this part of the
kernel doesn't use -Werror.


Yes, its accepted in !-Werror case if we shouldn't warn something as you said.

Tiejun

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-02 Thread Scott Wood
Enabling this capability connects the vcpu to the designated in-kernel
MPIC.  Using explicit connections between vcpus and irqchips allows
for flexibility, but the main benefit at the moment is that it
simplifies the code -- KVM doesn't need vm-global state to remember
which MPIC object is associated with this vm, and it doesn't need to
care about ordering between irqchip creation and vcpu creation.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 Documentation/virtual/kvm/api.txt   |8 ++
 arch/powerpc/include/asm/kvm_host.h |8 ++
 arch/powerpc/include/asm/kvm_ppc.h  |2 ++
 arch/powerpc/kvm/booke.c|4 ++-
 arch/powerpc/kvm/mpic.c |   49 +++
 arch/powerpc/kvm/powerpc.c  |   26 +++
 include/uapi/linux/kvm.h|1 +
 7 files changed, 92 insertions(+), 6 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index d52f3f9..4c326ae 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
+
+6.6 KVM_CAP_IRQ_MPIC
+
+Architectures: ppc
+Parameters: args[0] is the MPIC device fd
+args[1] is the MPIC CPU number for this vcpu
+
+This capability connects the vcpu to an in-kernel MPIC device.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index 7e7aef9..2a2e235 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
+#define KVMPPC_IRQ_DEFAULT 0
+#define KVMPPC_IRQ_MPIC1
+
+struct openpic;
+
 struct kvm_vcpu_arch {
ulong host_stack;
u32 host_pid;
@@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
unsigned long magic_page_pa; /* phys addr to map the magic page to */
unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
+   int irq_type;   /* one of KVM_IRQ_* */
+   struct openpic *mpic;   /* KVM_IRQ_MPIC */
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
struct kvm_vcpu_arch_shared shregs;
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 3b63b97..f54707f 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
  struct kvm_config_tlb *cfg);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index cddc6b3..7d00222 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
*vcpu,
if (update_epr == true) {
if (vcpu-arch.epr_flags  KVMPPC_EPR_USER)
kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
-   else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL)
+   else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL) {
+   BUG_ON(vcpu-arch.irq_type != KVMPPC_IRQ_MPIC);
kvmppc_mpic_set_epr(vcpu);
+   }
}
 
new_msr = msr_mask;
diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
index 8cda2fa..caffe3b 100644
--- a/arch/powerpc/kvm/mpic.c
+++ b/arch/powerpc/kvm/mpic.c
@@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, struct 
irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
-   struct openpic *opp = vcpu-arch.irqchip_priv;
+   struct openpic *opp = vcpu-arch.mpic;
int cpu = vcpu-vcpu_id;
unsigned long flags;
 
@@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
-   BUG_ON(opp-mmio_mapped);
-   opp-mmio_mapped = false;
-
-   kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
+   if (opp-mmio_mapped) {
+   opp-mmio_mapped = false;
+   kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
+   }
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
@@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
.release = kvm_mpic_release,
 };
 
+int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
+u32 cpu)
+{
+   struct openpic *opp = mpic_filp-private_data;
+   int ret = 0;

[RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-02 Thread Scott Wood
Hook the MPIC code up to the KVM interfaces, add locking, etc.

TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support

Signed-off-by: Scott Wood scottw...@freescale.com
---
v3: mpic_put - kvmppc_mpic_put

 Documentation/virtual/kvm/devices/mpic.txt |   37 ++
 arch/powerpc/include/asm/kvm_host.h|8 +-
 arch/powerpc/include/asm/kvm_ppc.h |7 +
 arch/powerpc/kvm/Kconfig   |5 +
 arch/powerpc/kvm/Makefile  |2 +
 arch/powerpc/kvm/booke.c   |   10 +-
 arch/powerpc/kvm/mpic.c|  814 +---
 arch/powerpc/kvm/powerpc.c |   12 +-
 include/linux/kvm_host.h   |2 +
 include/uapi/linux/kvm.h   |9 +
 virt/kvm/kvm_main.c|9 +
 11 files changed, 714 insertions(+), 201 deletions(-)
 create mode 100644 Documentation/virtual/kvm/devices/mpic.txt

diff --git a/Documentation/virtual/kvm/devices/mpic.txt 
b/Documentation/virtual/kvm/devices/mpic.txt
new file mode 100644
index 000..79e000a
--- /dev/null
+++ b/Documentation/virtual/kvm/devices/mpic.txt
@@ -0,0 +1,37 @@
+MPIC interrupt controller
+=
+
+Device types supported:
+  KVM_DEV_TYPE_FSL_MPIC_20 Freescale MPIC v2.0
+  KVM_DEV_TYPE_FSL_MPIC_42 Freescale MPIC v4.2
+
+Only one MPIC instance, of any type, may be instantiated.  The created
+MPIC will act as the system interrupt controller, connecting to each
+vcpu's interrupt inputs.
+
+Groups:
+  KVM_DEV_MPIC_GRP_MISC
+  Attributes:
+KVM_DEV_MPIC_BASE_ADDR (rw, 64-bit)
+  Base address of the 256 KiB MPIC register space.  Must be
+  naturally aligned.  A value of zero disables the mapping.
+  Reset value is zero.
+
+  KVM_DEV_MPIC_GRP_REGISTER (rw, 32-bit)
+Access an MPIC register, as if the access were made from the guest. 
+attr is the byte offset into the MPIC register space.  Accesses
+must be 4-byte aligned.
+
+MSIs may be signaled by using this attribute group to write
+to the relevant MSIIR.
+
+  KVM_DEV_MPIC_GRP_IRQ_ACTIVE (rw, 32-bit)
+IRQ input line for each standard openpic source.  0 is inactive and 1
+is active, regardless of interrupt sense.
+
+For edge-triggered interrupts:  Writing 1 is considered an activating
+edge, and writing 0 is ignored.  Reading returns 1 if a previously
+signaled edge has not been acknowledged, and 0 otherwise.
+
+attr is the IRQ number.  IRQ numbers for standard sources are the
+byte offset of the relevant IVPR from EIVPR0, divided by 32.
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index e34f8fe..7e7aef9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -359,6 +359,11 @@ struct kvmppc_slb {
 #define KVMPPC_BOOKE_MAX_IAC   4
 #define KVMPPC_BOOKE_MAX_DAC   2
 
+/* KVMPPC_EPR_USER takes precedence over KVMPPC_EPR_KERNEL */
+#define KVMPPC_EPR_NONE0 /* EPR not supported */
+#define KVMPPC_EPR_USER1 /* exit to userspace to fill EPR */
+#define KVMPPC_EPR_KERNEL  2 /* in-kernel irqchip */
+
 struct kvmppc_booke_debug_reg {
u32 dbcr0;
u32 dbcr1;
@@ -522,7 +527,7 @@ struct kvm_vcpu_arch {
u8 sane;
u8 cpu_type;
u8 hcall_needed;
-   u8 epr_enabled;
+   u8 epr_flags; /* KVMPPC_EPR_xxx */
u8 epr_needed;
 
u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
@@ -589,5 +594,6 @@ struct kvm_vcpu_arch {
 #define KVM_MMIO_REG_FQPR  0x0060
 
 #define __KVM_HAVE_ARCH_WQP
+#define __KVM_HAVE_CREATE_DEVICE
 
 #endif /* __POWERPC_KVM_HOST_H__ */
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index f589307..3b63b97 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -164,6 +164,8 @@ extern int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu);
 
 extern int kvm_vm_ioctl_get_htab_fd(struct kvm *kvm, struct kvm_get_htab_fd *);
 
+int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq);
+
 /*
  * Cuts out inst bits with ordering according to spec.
  * That means the leftmost bit is zero. All given bits are included.
@@ -245,6 +247,9 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id, union 
kvmppc_one_reg *);
 
 void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 pid);
 
+struct openpic;
+void kvmppc_mpic_put(struct openpic *opp);
+
 #ifdef CONFIG_KVM_BOOK3S_64_HV
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
@@ -270,6 +275,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
u32 epr)
 #endif
 }
 
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
  struct kvm_config_tlb *cfg);
 int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig 

[RFC PATCH v3 2/6] kvm/ppc/mpic: import hw/openpic.c from QEMU

2013-04-02 Thread Scott Wood
This is QEMU's hw/openpic.c from commit
abd8d4a4d6dfea7ddea72f095f993e1de941614e (Update version for
1.4.0-rc0), run through Lindent with no other changes to ease merging
future changes between Linux and QEMU.  Remaining style issues
(including those introduced by Lindent) will be fixed in a later patch.

Signed-off-by: Scott Wood scottw...@freescale.com
---
 arch/powerpc/kvm/mpic.c | 1686 +++
 1 file changed, 1686 insertions(+)
 create mode 100644 arch/powerpc/kvm/mpic.c

diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
new file mode 100644
index 000..57655b9
--- /dev/null
+++ b/arch/powerpc/kvm/mpic.c
@@ -0,0 +1,1686 @@
+/*
+ * OpenPIC emulation
+ *
+ * Copyright (c) 2004 Jocelyn Mayer
+ *   2011 Alexander Graf
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+/*
+ *
+ * Based on OpenPic implementations:
+ * - Intel GW80314 I/O companion chip developer's manual
+ * - Motorola MPC8245  MPC8540 user manuals.
+ * - Motorola MCP750 (aka Raven) programmer manual.
+ * - Motorola Harrier programmer manuel
+ *
+ * Serial interrupts, as implemented in Raven chipset are not supported yet.
+ *
+ */
+#include hw.h
+#include ppc/mac.h
+#include pci/pci.h
+#include openpic.h
+#include sysbus.h
+#include pci/msi.h
+#include qemu/bitops.h
+#include ppc.h
+
+//#define DEBUG_OPENPIC
+
+#ifdef DEBUG_OPENPIC
+static const int debug_openpic = 1;
+#else
+static const int debug_openpic = 0;
+#endif
+
+#define DPRINTF(fmt, ...) do { \
+if (debug_openpic) { \
+printf(fmt , ## __VA_ARGS__); \
+} \
+} while (0)
+
+#define MAX_CPU 32
+#define MAX_SRC 256
+#define MAX_TMR 4
+#define MAX_IPI 4
+#define MAX_MSI 8
+#define MAX_IRQ (MAX_SRC + MAX_IPI + MAX_TMR)
+#define VID 0x03   /* MPIC version ID */
+
+/* OpenPIC capability flags */
+#define OPENPIC_FLAG_IDR_CRIT (1  0)
+#define OPENPIC_FLAG_ILR  (2  0)
+
+/* OpenPIC address map */
+#define OPENPIC_GLB_REG_START0x0
+#define OPENPIC_GLB_REG_SIZE 0x10F0
+#define OPENPIC_TMR_REG_START0x10F0
+#define OPENPIC_TMR_REG_SIZE 0x220
+#define OPENPIC_MSI_REG_START0x1600
+#define OPENPIC_MSI_REG_SIZE 0x200
+#define OPENPIC_SUMMARY_REG_START   0x3800
+#define OPENPIC_SUMMARY_REG_SIZE0x800
+#define OPENPIC_SRC_REG_START0x1
+#define OPENPIC_SRC_REG_SIZE (MAX_SRC * 0x20)
+#define OPENPIC_CPU_REG_START0x2
+#define OPENPIC_CPU_REG_SIZE 0x100 + ((MAX_CPU - 1) * 0x1000)
+
+/* Raven */
+#define RAVEN_MAX_CPU  2
+#define RAVEN_MAX_EXT 48
+#define RAVEN_MAX_IRQ 64
+#define RAVEN_MAX_TMR  MAX_TMR
+#define RAVEN_MAX_IPI  MAX_IPI
+
+/* Interrupt definitions */
+#define RAVEN_FE_IRQ (RAVEN_MAX_EXT)   /* Internal functional IRQ */
+#define RAVEN_ERR_IRQ(RAVEN_MAX_EXT + 1)   /* Error IRQ */
+#define RAVEN_TMR_IRQ(RAVEN_MAX_EXT + 2)   /* First timer IRQ */
+#define RAVEN_IPI_IRQ(RAVEN_TMR_IRQ + RAVEN_MAX_TMR)   /* First IPI 
IRQ */
+/* First doorbell IRQ */
+#define RAVEN_DBL_IRQ(RAVEN_IPI_IRQ + (RAVEN_MAX_CPU * RAVEN_MAX_IPI))
+
+typedef struct FslMpicInfo {
+   int max_ext;
+} FslMpicInfo;
+
+static FslMpicInfo fsl_mpic_20 = {
+   .max_ext = 12,
+};
+
+static FslMpicInfo fsl_mpic_42 = {
+   .max_ext = 12,
+};
+
+#define FRR_NIRQ_SHIFT16
+#define FRR_NCPU_SHIFT 8
+#define FRR_VID_SHIFT  0
+
+#define VID_REVISION_1_2   2
+#define VID_REVISION_1_3   3
+
+#define VIR_GENERIC  0x/* Generic Vendor ID */
+
+#define GCR_RESET0x8000
+#define GCR_MODE_PASS0x
+#define GCR_MODE_MIXED   0x2000
+#define GCR_MODE_PROXY   0x6000
+
+#define TBCR_CI   0x8000   /* count inhibit */
+#define TCCR_TOG  0x8000   /* toggles when decrement to zero */
+
+#define IDR_EP_SHIFT  31
+#define IDR_EP_MASK   (1  IDR_EP_SHIFT)

[RFC PATCH v3 1/6] kvm: add device control API

2013-04-02 Thread Scott Wood
Currently, devices that are emulated inside KVM are configured in a
hardcoded manner based on an assumption that any given architecture
only has one way to do it.  If there's any need to access device state,
it is done through inflexible one-purpose-only IOCTLs (e.g.
KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
cumbersome and depletes a limited numberspace.

This API provides a mechanism to instantiate a device of a certain
type, returning an ID that can be used to set/get attributes of the
device.  Attributes may include configuration parameters (e.g.
register base address), device state, operational commands, etc.  It
is similar to the ONE_REG API, except that it acts on devices rather
than vcpus.

Both device types and individual attributes can be tested without having
to create the device or get/set the attribute, without the need for
separately managing enumerated capabilities.

Signed-off-by: Scott Wood scottw...@freescale.com
---
v3: remove some changes that were merged into this patch by accident,
and fix the error documentation for KVM_CREATE_DEVICE.

NOTE: I had some difficulty figuring out what ioctl numbers I should
assign...  it seems that at one point care was taken to keep vcpu and
vm ioctls separate, but some overlap exists now (despite not exhausing
the ioctl space).  Some of that was my fault, but not all of it. :-)
I moved to a new ioctl range for device control -- please let me know
if there's something else you'd prefer I do.
---
 Documentation/virtual/kvm/api.txt|   70 ++
 Documentation/virtual/kvm/devices/README |1 +
 include/uapi/linux/kvm.h |   27 
 virt/kvm/kvm_main.c  |   31 +
 4 files changed, 129 insertions(+)
 create mode 100644 Documentation/virtual/kvm/devices/README

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 976eb65..d52f3f9 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents from 
the data
 written, then `n_invalid' invalid entries, invalidating any previously
 valid entries found.
 
+4.79 KVM_CREATE_DEVICE
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: vm ioctl
+Parameters: struct kvm_create_device (in/out)
+Returns: 0 on success, -1 on error
+Errors:
+  ENODEV: The device type is unknown or unsupported
+  EEXIST: Device already created, and this type of device may not
+  be instantiated multiple times
+
+  Other error conditions may be defined by individual device types or
+  have their standard meanings.
+
+Creates an emulated device in the kernel.  The file descriptor returned
+in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
+
+If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
+device type is supported (not necessarily whether it can be created
+in the current vm).
+
+Individual devices should not define flags.  Attributes should be used
+for specifying any behavior that is not implied by the device type
+number.
+
+struct kvm_create_device {
+   __u32   type;   /* in: KVM_DEV_TYPE_xxx */
+   __u32   fd; /* out: device handle */
+   __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
+};
+
+4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+  EPERM:  The attribute cannot (currently) be accessed this way
+  (e.g. read-only attribute, or attribute that only makes
+  sense when the device is in a different state)
+
+  Other error conditions may be defined by individual device types.
+
+Gets/sets a specified piece of device configuration and/or state.  The
+semantics are device-specific.  See individual device documentation in
+the devices directory.  As with ONE_REG, the size of the data
+transferred is defined by the particular attribute.
+
+struct kvm_device_attr {
+   __u32   flags;  /* no flags currently defined */
+   __u32   group;  /* device-defined */
+   __u64   attr;   /* group-defined */
+   __u64   addr;   /* userspace address of attr data */
+};
+
+4.81 KVM_HAS_DEVICE_ATTR
+
+Capability: KVM_CAP_DEVICE_CTRL
+Type: device ioctl
+Parameters: struct kvm_device_attr
+Returns: 0 on success, -1 on error
+Errors:
+  ENXIO:  The group or attribute is unknown/unsupported for this device
+
+Tests whether a device supports a particular attribute.  A successful
+return indicates the attribute is implemented.  It does not necessarily
+indicate that the attribute can be read or written in the device's
+current state.  addr is ignored.
 
 4.77 KVM_ARM_VCPU_INIT
 
diff --git a/Documentation/virtual/kvm/devices/README 
b/Documentation/virtual/kvm/devices/README
new file mode 100644
index 000..34a6983

Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-02 Thread Paul Mackerras
On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
 On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
 On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
  +4.79 KVM_CREATE_DEVICE
  +
  +Capability: KVM_CAP_DEVICE_CTRL
 
 I notice this patch doesn't add this capability;
 
 Yes, it does (see below).
 
 you add it in a later patch.
 
 Maybe you're thinking of KVM_CAP_IRQ_MPIC?

No, I was referring to the addition to kvm_dev_ioctl_check_extension()
of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
userspace queries the KVM_CAP_DEVICE_CTRL capability.

  +/* ioctl for vm fd */
  +#define KVM_CREATE_DEVICE   _IOWR(KVMIO,  0xe0, struct
 kvm_create_device)
 
 This define should go with the other VM ioctls, otherwise the next
 person to add a VM ioctl will probably miss it and reuse the 0xe0
 code.
 
 That's actually why I moved it to a new section, with device control
 ioctls getting their own range, as the legacy device model and
 some other things did.  0xe0 is not the next ioctl that would be
 used for either vm or vcpu.  The ioctl numbering is actually already
 a mess, with sometimes care being taken to keep vcpu and vm ioctls
 from overlapping, but on other places overlapping does happen.  I'm
 not sure what exactly I should do here.

Well, even if you are using a new range, I still think that
KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
ioctls.  I guess it's ultimately up to the maintainers.

Paul.
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html