Re: qemu help documentation

2013-02-15 Thread Stefan Hajnoczi
On Thu, Feb 14, 2013 at 02:22:51PM +0100, Paolo Pedaletti wrote:
 I have trouble to get full list of the output of
 qemu help
 inside kvm when I switch to second console CTRL-ALT-2
 
 I can't find the full list even inside source code (apt-get source
 qemu-kvm) and neither inside binary file (grep blockarg qemu-*)
 
 Is it possible to redirect the output of  help on an external
 file? Or paging it?
 
 This because (the main problem is) I'm trying to get full kernel
 message at boot, but inside KVM window it's not possible to scroll
 up ( goal: 
 http://pedalinux.blogspot.it/2013/02/physical-to-virtual-step-by-step.html
 ) or to dump outside terminal output.

Try Ctrl+PageUp.

If that doesn't work you can put the monitor on stdio like this:

  $ qemu-system-x86_64 -monitor stdio ...

Then you can interact from your shell and scroll back up as usual.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/18] KVM/MIPS32: Routines to handle specific traps/exceptions while executing the guest.

2013-02-15 Thread Sanjay Lal

On Feb 6, 2013, at 8:20 AM, Gleb Natapov wrote:

 On Wed, Nov 21, 2012 at 06:34:09PM -0800, Sanjay Lal wrote:
 +static gpa_t kvm_trap_emul_gva_to_gpa_cb(gva_t gva)
 +{
 +gpa_t gpa;
 +uint32_t kseg = KSEGX(gva);
 +
 +if ((kseg == CKSEG0) || (kseg == CKSEG1))
 You seems to be using KVM_GUEST_KSEGX variants on gva in all other
 places. Why not here?

This function is invoked to handle 2 scenarios:
(1) Parse the boot code config tables setup by QEMU's Malta emulation. The 
pointers in the tables are actual KSEG0 addresses (unmapped, cached) and not 
Guest KSEG0 addresses.

(2) Handle I/O accesses by the guest.  On MIPS platforms, I/O device registers 
are mapped into the KSEG1 address space (unmapped, uncached).  Again like (1) 
these are actual KSEG1 addresses, which cause an exception and are passed onto 
QEMU for I/O emulation.

Regards
Sanjay

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH qom-cpu-next 0/6] QOM CPUState, part 8: CPU_COMMON continued

2013-02-15 Thread Andreas Färber
Am 14.02.2013 18:49, schrieb Andreas Färber:
 Am 01.02.2013 13:38, schrieb Andreas Färber:
 Hello,

 This series moves more fields from CPU_COMMON / CPU*State to CPUState,
 allowing access from target-independent code.

 The final patch in this series will help solve some issues (in particular
 avoid a dependency on CPU_COMMON TLB refactoring for now) but opens a can
 of worms: Since it is initialized in derived instance_init functions,
 functions cannot randomly be changed to operate on CPUState and be called
 from CPUState's instance_init or they will crash due to NULL env_ptr.
 
 The questionable patch in this series has been acked by rth, so if
 nobody objects, I'll queue it on qom-cpu-next tonight, [...]

Applied:
https://github.com/afaerber/qemu-cpu/commits/qom-cpu-next

Andreas


 For those of you that may have been following the CPU refactorings closely,
 I have now split off part of former qom-cpu-8 branch into qom-cpu-9.
 This series thereby applies directly to qom-cpu-next,
 whereas qom-cpu-9 depends on the pending s390x pull, my m68k cleanups and
 may be changed for VMState changes cooking elsewhere to keep i386 v5 compat.

 Available for testing at:
 git://github.com/afaerber/qemu-cpu.git qom-cpu-8.v1
 https://github.com/afaerber/qemu-cpu/commits/qom-cpu-8.v1

 Regards,
 Andreas

 Changes from previews:
 * Drop #ifdefs for user-only CPUState fields.
 * Defer interrupt-related changes to part 9.

 Andreas Färber (6):
   cpu: Move host_tid field to CPUState
   cpu: Move running field to CPUState
   cpu: Move exit_request field to CPUState
   cpu: Move current_tb field to CPUState
   cputlb: Pass CPUState to cpu_unlink_tb()
   cpu: Add CPUArchState pointer to CPUState

  cpu-exec.c  |   21 -
  cputlb.c|6 --
  dump.c  |8 ++--
  exec.c  |6 --
  gdbstub.c   |   14 +-
  hw/apic_common.c|2 +-
  hw/apic_internal.h  |2 +-
  hw/kvmvapic.c   |   13 -
  hw/spapr_hcall.c|5 +++--
  include/exec/cpu-defs.h |5 -
  include/exec/exec-all.h |4 +++-
  include/exec/gdbstub.h  |5 ++---
  include/qom/cpu.h   |   11 +++
  kvm-all.c   |6 +++---
  linux-user/main.c   |   37 ++---
  linux-user/syscall.c|4 +++-
  qom/cpu.c   |2 ++
  target-alpha/cpu.c  |2 ++
  target-arm/cpu.c|2 ++
  target-cris/cpu.c   |2 ++
  target-i386/cpu.c   |1 +
  target-i386/kvm.c   |4 ++--
  target-lm32/cpu.c   |2 ++
  target-m68k/cpu.c   |2 ++
  target-microblaze/cpu.c |2 ++
  target-mips/cpu.c   |2 ++
  target-openrisc/cpu.c   |2 ++
  target-ppc/translate_init.c |2 ++
  target-s390x/cpu.c  |2 ++
  target-sh4/cpu.c|2 ++
  target-sparc/cpu.c  |2 ++
  target-unicore32/cpu.c  |2 ++
  target-xtensa/cpu.c |2 ++
  translate-all.c |   36 +++-
  translate-all.h |2 +-
  35 Dateien geändert, 149 Zeilen hinzugefügt(+), 73 Zeilen entfernt(-)
 


-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


PIO/MMIO access from guest

2013-02-15 Thread Gábor PÉK
Hi all,

I would like to access the configuration registers of my passthrough
device from an unmodified guest under KVM via MMIO/PIO directly. Is
there any option to configure the guest in this way? I know that you can
do this with the ioports=[''] option in case of Xen, but I could not
find any such option until now for KVM. I also read that

[+] PIO can be passed directly via VMCS I/O bitmaps
[+] MMIO can be passed directly via mapping device BARs to guest
[+] Some PIO/MMIO accesses must be trapped (PCI config space)

Can I do this with some configurations without recompiling the kernel?

Thank you!
-gabor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/9] virtio: new API for addition of buffers, scatterlist changes

2013-02-15 Thread Paolo Bonzini
Il 14/02/2013 10:23, Paolo Bonzini ha scritto:
  How about this as a first step?
  
  virtio_ring: virtqueue_add_sgs, to add multiple sgs.
  
  virtio_scsi and virtio_blk can really use these, to avoid their current
  hack of copying the whole sg array.
  
  Signed-off-by: Ruty Russell ru...@rustcorp.com.au 
 It's much better than the other prototype you had posted, but I still
 dislike this...  You pay for additional counting of scatterlists when
 the caller knows the number of buffers; and the nested loops aren't
 free, either.

Another problem is that you cannot pass truncated scatterlists.  You
must ensure there is an end marker on the last item.  I'm not sure if
the kernel ensures that, given that for_each_sg takes explicitly the
number of scatterlist elements; and it is not as trivial as
sg_mark_end(foo + nsg - 1); if the upper layers hand you a chained
scatterlist.

Paolo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/18] KVM/MIPS32: MMU/TLB operations for the Guest.

2013-02-15 Thread Sanjay Lal

On Feb 6, 2013, at 7:08 AM, Gleb Natapov wrote:

 
 +static void kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
 +{
 +pfn_t pfn;
 +
 +if (kvm-arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
 +return;
 +
 +pfn =kvm_mips_gfn_to_pfn(kvm, gfn);
 This call should be in srcu read section since it access memory slots which
 are srcu protected. You should test with RCU debug enabled.

kvm_mips_gfn_to_pfn just maps to gfn_to_pfn. I don't see an instance where 
gfn_to_pfn is in a scru read section?

 
 
 +
 +uint32_t kvm_get_inst(uint32_t *opc, struct kvm_vcpu *vcpu)
 +{
 +uint32_t inst;
 +struct mips_coproc *cop0 __unused = vcpu-arch.cop0;
 +int index;
 +ulong paddr, flags;
 +
 +if (KVM_GUEST_KSEGX((ulong) opc)  KVM_GUEST_KSEG0 ||
 +KVM_GUEST_KSEGX((ulong) opc) == KVM_GUEST_KSEG23) {
 +local_irq_save(flags);
 +index = kvm_mips_host_tlb_lookup(vcpu, (ulong) opc);
 +if (index = 0) {
 +inst = *(opc);
 Here and in some more places below you access __user memory. Shouldn't you
 use get_user() to access it? What prevents the kernel crash by access fault 
 here
 if userspace remaps the memory to be non-readable? Hmm, may be it uses
 guest translation here so it cannot happen, but still, sparse will not
 be happy and kvm_mips_translate_guest_kseg0_to_hpa() case below uses
 host translation anyway.
 
Actually, I don't need the __user declaration in most cases, since KVM/MIPS 
handles mapping the page (if needed) and does not rely on the usual kernel 
mechanisms.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 09/18] KVM/MIPS32: COP0 accesses profiling.

2013-02-15 Thread Sanjay Lal

On Feb 6, 2013, at 8:17 AM, Gleb Natapov wrote:

 On Wed, Nov 21, 2012 at 06:34:07PM -0800, Sanjay Lal wrote:
 
 +int kvm_mips_dump_stats(struct kvm_vcpu *vcpu)
 +{
 +int i, j __unused;
 +#ifdef CONFIG_KVM_MIPS_DEBUG_COP0_COUNTERS
 +printk(\nKVM VCPU[%d] COP0 Access Profile:\n, vcpu-vcpu_id);
 +for (i = 0; i  N_MIPS_COPROC_REGS; i++) {
 +for (j = 0; j  N_MIPS_COPROC_SEL; j++) {
 +if (vcpu-arch.cop0-stat[i][j])
 +printk(%s[%d]: %lu\n, kvm_cop0_str[i], j,
 +   vcpu-arch.cop0-stat[i][j]);
 +}
 +}
 +#endif
 +
 +return 0;
 +}
 You need to use ftrace event for that. Much more flexible with perf
 integration and no need to recompile to enabled/disable.
 
 --
   Gleb.

Agreed, I'll start using trace for keeping track of COP0 accesses.

Regards
Sanjay

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 07/18] KVM/MIPS32: MMU/TLB operations for the Guest.

2013-02-15 Thread Gleb Natapov
On Fri, Feb 15, 2013 at 01:19:29PM -0500, Sanjay Lal wrote:
 
 On Feb 6, 2013, at 7:08 AM, Gleb Natapov wrote:
 
  
  +static void kvm_mips_map_page(struct kvm *kvm, gfn_t gfn)
  +{
  +  pfn_t pfn;
  +
  +  if (kvm-arch.guest_pmap[gfn] != KVM_INVALID_PAGE)
  +  return;
  +
  +  pfn =kvm_mips_gfn_to_pfn(kvm, gfn);
  This call should be in srcu read section since it access memory slots which
  are srcu protected. You should test with RCU debug enabled.
 
 kvm_mips_gfn_to_pfn just maps to gfn_to_pfn. I don't see an instance where 
 gfn_to_pfn is in a scru read section?
 
Where are you looking?  On x86 if vcpu is not in a guest mode it is in
srcu read section. The lock is taken immediately after exit and released
before entry. This is because x86 access memory slots a lot. Other
arches, that do not access memory slots as much, take the lock around
each individual access. As far as I see this is the only place in MIPS
kvm where this is needed.

  
  
  +
  +uint32_t kvm_get_inst(uint32_t *opc, struct kvm_vcpu *vcpu)
  +{
  +  uint32_t inst;
  +  struct mips_coproc *cop0 __unused = vcpu-arch.cop0;
  +  int index;
  +  ulong paddr, flags;
  +
  +  if (KVM_GUEST_KSEGX((ulong) opc)  KVM_GUEST_KSEG0 ||
  +  KVM_GUEST_KSEGX((ulong) opc) == KVM_GUEST_KSEG23) {
  +  local_irq_save(flags);
  +  index = kvm_mips_host_tlb_lookup(vcpu, (ulong) opc);
  +  if (index = 0) {
  +  inst = *(opc);
  Here and in some more places below you access __user memory. Shouldn't you
  use get_user() to access it? What prevents the kernel crash by access fault 
  here
  if userspace remaps the memory to be non-readable? Hmm, may be it uses
  guest translation here so it cannot happen, but still, sparse will not
  be happy and kvm_mips_translate_guest_kseg0_to_hpa() case below uses
  host translation anyway.
  
 Actually, I don't need the __user declaration in most cases, since KVM/MIPS 
 handles mapping the page (if needed) and does not rely on the usual kernel 
 mechanisms.
Yes I see that you check/populate tlb for non KVM_GUEST_KSEG0 access,
for kvm_mips_translate_guest_kseg0_to_hpa() you do not, but now I notice
that you are not using the address directly but uses CKSEG0ADDR() on it,
which, as far as I can tell, gives you kernel mapping for the page,
correct? Why this is better than using get_user()? To save tlb entries?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: Protect vfio_dev_present against device_del

2013-02-15 Thread Alex Williamson
vfio_dev_present is meant to give us a wait_event callback so that we
can block removing a device from vfio until it becomes unused.  The
root of this check depends on being able to get the iommu group from
the device.  Unfortunately if the BUS_NOTIFY_DEL_DEVICE notifier has
fired then the device-group reference is no longer searchable and we
fail the lookup.

We don't need to go to such extents for this though.  We have a
reference to the device, from which we can acquire a reference to the
group.  We can then use the group reference to search for the device
and properly block removal.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 drivers/vfio/vfio.c |   33 -
 1 file changed, 12 insertions(+), 21 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 12c264d..8e6dcec 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -642,33 +642,16 @@ int vfio_add_group_dev(struct device *dev,
 }
 EXPORT_SYMBOL_GPL(vfio_add_group_dev);
 
-/* Test whether a struct device is present in our tracking */
-static bool vfio_dev_present(struct device *dev)
+/* Given a referenced group, check if it contains the device */
+static bool vfio_dev_present(struct vfio_group *group, struct device *dev)
 {
-   struct iommu_group *iommu_group;
-   struct vfio_group *group;
struct vfio_device *device;
 
-   iommu_group = iommu_group_get(dev);
-   if (!iommu_group)
-   return false;
-
-   group = vfio_group_get_from_iommu(iommu_group);
-   if (!group) {
-   iommu_group_put(iommu_group);
-   return false;
-   }
-
device = vfio_group_get_device(group, dev);
-   if (!device) {
-   vfio_group_put(group);
-   iommu_group_put(iommu_group);
+   if (!device)
return false;
-   }
 
vfio_device_put(device);
-   vfio_group_put(group);
-   iommu_group_put(iommu_group);
return true;
 }
 
@@ -682,10 +665,18 @@ void *vfio_del_group_dev(struct device *dev)
struct iommu_group *iommu_group = group-iommu_group;
void *device_data = device-device_data;
 
+   /*
+* The group exists so long as we have a device reference.  Get
+* a group reference and use it to scan for the device going away.
+*/
+   vfio_group_get(group);
+
vfio_device_put(device);
 
/* TODO send a signal to encourage this to be released */
-   wait_event(vfio.release_q, !vfio_dev_present(dev));
+   wait_event(vfio.release_q, !vfio_dev_present(group, dev));
+
+   vfio_group_put(group);
 
iommu_group_put(iommu_group);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: whitelist pcieport

2013-02-15 Thread Alex Williamson
pcieport does nice things like manage AER and we know it doesn't do
DMA or expose any user accessible devices on the host.  It also keeps
the Memory, I/O, and Busmaster bits enabled, which is pretty handy
when trying to use anyting below it.  Devices owned by pcieport cannot
be given to users via vfio, but we can tolerate them not being owned
by vfio-pci.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 drivers/vfio/vfio.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 8e6dcec..28e2d5b 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -442,7 +442,7 @@ static struct vfio_device *vfio_group_get_device(struct 
vfio_group *group,
  * a device.  It's not always practical to leave a device within a group
  * driverless as it could get re-bound to something unsafe.
  */
-static const char * const vfio_driver_whitelist[] = { pci-stub };
+static const char * const vfio_driver_whitelist[] = { pci-stub, pcieport };
 
 static bool vfio_whitelisted_driver(struct device_driver *drv)
 {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio-pci: Manage user power state transitions

2013-02-15 Thread Alex Williamson
We give the user access to change the power state of the device but
certain transitions result in an uninitialized state which the user
cannot resolve.  To fix this we need to mark the PowerState field of
the PMCSR register read-only and effect the requested change on behalf
of the user.  This has the added benefit that pdev-current_state
remains accurate while controlled by the user.

The primary example of this bug is a QEMU guest doing a reboot where
the device it put into D3 on shutdown and becomes unusable on the next
boot because the device did a soft reset on D3-D0 (NoSoftRst-).

Signed-off-by: Alex Williamson alex.william...@redhat.com
---
 drivers/vfio/pci/vfio_pci_config.c |   25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index f1dde2c..81567f8 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -587,12 +587,30 @@ static int __init init_pci_cap_basic_perm(struct 
perm_bits *perm)
return 0;
 }
 
+static int vfio_pm_config_write(struct vfio_pci_device *vdev, int pos,
+   int count, struct perm_bits *perm,
+   int offset, __le32 val)
+{
+   count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+   if (count  0)
+   return count;
+
+   if (offset == PCI_PM_CTRL) {
+   uint8_t state = le32_to_cpu(val)  PCI_PM_CTRL_STATE_MASK;
+   pci_set_power_state(vdev-pdev, state);
+   }
+
+   return count;
+}
+
 /* Permissions for the Power Management capability */
 static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 {
if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
return -ENOMEM;
 
+   perm-writefn = vfio_pm_config_write;
+
/*
 * We always virtualize the next field so we can remove
 * capabilities from the chain if we want to.
@@ -600,10 +618,11 @@ static int __init init_pci_cap_pm_perm(struct perm_bits 
*perm)
p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
 
/*
-* Power management is defined *per function*,
-* so we let the user write this
+* Power management is defined *per function*, so we can let
+* the user change power state, but we trap and initiate the
+* change ourselves, so the state bits are read-only.
 */
-   p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+   p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
return 0;
 }
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.9-rc1 0/3] KVM/ARM core fixes for 3.9-rc1

2013-02-15 Thread Marc Zyngier
Here's a number of patches to cope with the churn that occured in
kvm-next, and breaks the compilation in next-20130215.

No big deal, just regular breakage...

Tested on TC2.

Marc Zyngier (3):
  ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region
  ARM: KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS
  ARM: KVM: fix compilation after removal of user_alloc from struct
kvm_memory_slot

 arch/arm/include/asm/kvm_host.h | 2 +-
 arch/arm/kvm/arm.c  | 4 ++--
 arch/arm/kvm/mmu.c  | 5 -
 3 files changed, 3 insertions(+), 8 deletions(-)

-- 
1.8.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.9-rc1 3/3] ARM: KVM: fix compilation after removal of user_alloc from struct kvm_memory_slot

2013-02-15 Thread Marc Zyngier
Commit 7a905b1 (KVM: Remove user_alloc from struct kvm_memory_slot)
broke KVM/ARM by removing the user_alloc field from a public structure.

As we only used this field to alert the user that we didn't support
this operation mode, there is no harm in discarding this bit of code
without any remorse.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/mmu.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
index f30e131..99e07c7 100644
--- a/arch/arm/kvm/mmu.c
+++ b/arch/arm/kvm/mmu.c
@@ -633,11 +633,6 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct 
kvm_run *run)
}
 
memslot = gfn_to_memslot(vcpu-kvm, gfn);
-   if (!memslot-user_alloc) {
-   kvm_err(non user-alloc memslots not supported\n);
-   ret = -EINVAL;
-   goto out_unlock;
-   }
 
ret = user_mem_abort(vcpu, fault_ipa, gfn, memslot, fault_status);
if (ret == 0)
-- 
1.8.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.9-rc1 2/3] ARM: KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS

2013-02-15 Thread Marc Zyngier
Commit bbacc0c (KVM: Rename KVM_MEMORY_SLOTS - KVM_USER_MEM_SLOTS)
broke KVM/ARM by changing a global #define.

Apply the same change to fix the compilation breakage.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/include/asm/kvm_host.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index dfe9886..d1736a5 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -26,7 +26,7 @@
 #include asm/kvm_arch_timer.h
 
 #define KVM_MAX_VCPUS CONFIG_KVM_ARM_MAX_VCPUS
-#define KVM_MEMORY_SLOTS 32
+#define KVM_USER_MEM_SLOTS 32
 #define KVM_PRIVATE_MEM_SLOTS 4
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
 #define KVM_HAVE_ONE_REG
-- 
1.8.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3.9-rc1 1/3] ARM: KVM: fix kvm_arch_{prepare,commit}_memory_region

2013-02-15 Thread Marc Zyngier
Commit f82a8cfe9 (KVM: struct kvm_memory_slot.user_alloc - bool)
broke the ARM KVM port by changing the prototype of two global
functions.

Apply the same change to fix the compilation breakage.

Signed-off-by: Marc Zyngier marc.zyng...@arm.com
---
 arch/arm/kvm/arm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 9ada554..5a93698 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -232,7 +232,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_memory_slot *memslot,
   struct kvm_memory_slot old,
   struct kvm_userspace_memory_region *mem,
-  int user_alloc)
+  bool user_alloc)
 {
return 0;
 }
@@ -240,7 +240,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 void kvm_arch_commit_memory_region(struct kvm *kvm,
   struct kvm_userspace_memory_region *mem,
   struct kvm_memory_slot old,
-  int user_alloc)
+  bool user_alloc)
 {
 }
 
-- 
1.8.1.2


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: randconfig errors

2013-02-15 Thread Christoffer Dall
On Fri, Feb 15, 2013 at 2:22 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 15/02/13 19:16, Christoffer Dall wrote:
 On Fri, Feb 15, 2013 at 1:47 PM, Marc Zyngier marc.zyng...@arm.com wrote:
 On Fri, 15 Feb 2013 19:22:37 +0100, Marc Zyngier marc.zyng...@arm.com
 wrote:
 On Fri, 15 Feb 2013 11:25:56 -0600, Rob Herring robherri...@gmail.com
 wrote:
 Errors from randconfig builds restricted to multi-platform configs on
 the current linux-next. I know I've seen a fix on the first one from
 Arnd, but have not investigated any of these.




 /var/lib/jenkins/jobs/linux-randconfig/workspace/include/linux/kvm_host.h:333:34:
 error: 'KVM_USER_MEM_SLOTS' undeclared here (not in a function)

 This one comes from bbacc0c (KVM: Rename KVM_MEMORY_SLOTS -
 KVM_USER_MEM_SLOTS).
 I'll queue a patch to fix the breakage...

 Looks like there's a few more, and what was brewing in kvm-next had some
 side effects.

 Christoffer, I'll send the patches separately.

 I was kind of hoping that the KVM guys would have pulled the KVM/ARM
 stuff into kvm/next and it could have been fixed up there The
 patches are really trivial though iirc.

 Indeed they are. Probably an oversight from the KVM guys. Sent them
 anyway...

I saw them, it all looks good.

I'm cc'ing the kvm list here just for info.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Helping with KVM development

2013-02-15 Thread Brandon Dwiel

Hi,

I'm a computer engineering PhD student and I have the opportunity to
contribute something to KVM as a class project. I found a few
interesting items on the KVM TODO list (http://www.linux-kvm.org/page/TODO)
that I could do. If I choose to do
this I would start immediately and have about 2 months to finish. I
wouldn't choose this project if I thought I would have to rely on the
help of you guys, but I think my biggest challenge will be diving in and
getting started. Hopefully you can help with that.

My focus is on computer architecture so the x86 topics really interested
me. Here are the specific ones that I think I could do. I would start by
implementing one, and if that ends up not being substantial enough for a
final project, I'd do another.

1. Improve mmu page eviction algorithm (currently FIFO, change to
approximate LRU).
2. On-demand register access, really, copying all registers all the
time is gross.
3. Implement mmx and sse memory move instructions; useful for guests
that use multimedia extensions for accessing vga (partially done)
4. Implement an operation queue for the emulator. The emulator often
calls userspace to perform a read or a write, but due to inversion
of control it actually restarts instead of continuing. The queue
would allow it to replay all previous operations until it reaches
the point it last stopped.
5. convert more instructions to direct dispatch (function pointer in
decode table)
6. move init_emulate_ctxt() into x86_decode_insn() and other emulator
entry points

Have any of these already been implemented? It seems number 2 and
possibly 1 already have been. I think this list is quite outdated, so is
there something else along these lines of these updates that I could
help with?

Thanks,
Brandon
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 02/12] KVM: PPC: E500: Explicitly mark shadow maps invalid

2013-02-15 Thread Scott Wood

On 02/14/2013 06:16:18 PM, Alexander Graf wrote:

When we invalidate shadow TLB maps on the host, we don't mark them
as not valid. But we should.

Fix this by removing the E500_TLB_VALID from their flags when
invalidating.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/e500_tlb.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index d38ad63..8efb2ac 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -204,9 +204,13 @@ static void inval_gtlbe_on_host(struct  
kvmppc_vcpu_e500 *vcpu_e500,

 {
struct kvm_book3e_206_tlb_entry *gtlbe =
get_entry(vcpu_e500, tlbsel, esel);
+   struct tlbe_ref *ref = vcpu_e500-gtlb_priv[tlbsel][esel].ref;

-   if (tlbsel == 1 
-   vcpu_e500-gtlb_priv[1][esel].ref.flags  E500_TLB_BITMAP) {
+   /* Don't bother with unmapped entries */
+   if (!(ref-flags  E500_TLB_VALID))
+   return;


This is broken as pointed out here:
http://patchwork.ozlabs.org/patch/220356/

...and note that I'm still seeing problems even after that fix, which  
I'll try to debug today.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:

From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.


Would the device config API I posted a couple days ago work for you?


The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).


Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING
for interrupt injection, what if there's a race with the user changing
other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
being presented as a generic API.


+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc


Why just ppc?


+struct kvm_irq_sources {
+   __u32 irq;
+   __u32 nr_irqs;
+   __u64 __user *irqbuf;
+};


Please no pointers in UAPI -- this would require a compat wrapper with
32-bit user and 64-bit kernel.


+/* irqbuf entries are laid out like this: */
+#define KVM_IRQ_SERVER_SHIFT   0
+#define KVM_IRQ_SERVER_MASK0xULL
+#define KVM_IRQ_PRIORITY_SHIFT 32
+#define KVM_IRQ_PRIORITY_MASK  0xff
+#define KVM_IRQ_LEVEL_SENSITIVE(1ULL  40)
+#define KVM_IRQ_MASKED (1ULL  41)
+#define KVM_IRQ_PENDING(1ULL  42)


What does server mean?  Do you mean laid out like this for XICS?
Let's please have a clean separation between what is generic and what is
implementation-specific.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] vfio-pci: Manage user power state transitions

2013-02-15 Thread Alex Williamson
We give the user access to change the power state of the device but
certain transitions result in an uninitialized state which the user
cannot resolve.  To fix this we need to mark the PowerState field of
the PMCSR register read-only and effect the requested change on behalf
of the user.  This has the added benefit that pdev-current_state
remains accurate while controlled by the user.

The primary example of this bug is a QEMU guest doing a reboot where
the device it put into D3 on shutdown and becomes unusable on the next
boot because the device did a soft reset on D3-D0 (NoSoftRst-).

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

v2: sparse found a type error when calling pci_set_power_state.  Fixed here

 drivers/vfio/pci/vfio_pci_config.c |   25 ++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index f1dde2c..1fd1895 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -587,12 +587,30 @@ static int __init init_pci_cap_basic_perm(struct 
perm_bits *perm)
return 0;
 }
 
+static int vfio_pm_config_write(struct vfio_pci_device *vdev, int pos,
+   int count, struct perm_bits *perm,
+   int offset, __le32 val)
+{
+   count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+   if (count  0)
+   return count;
+
+   if (offset == PCI_PM_CTRL) {
+   pci_power_t state = le32_to_cpu(val)  PCI_PM_CTRL_STATE_MASK;
+   pci_set_power_state(vdev-pdev, state);
+   }
+
+   return count;
+}
+
 /* Permissions for the Power Management capability */
 static int __init init_pci_cap_pm_perm(struct perm_bits *perm)
 {
if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_PM]))
return -ENOMEM;
 
+   perm-writefn = vfio_pm_config_write;
+
/*
 * We always virtualize the next field so we can remove
 * capabilities from the chain if we want to.
@@ -600,10 +618,11 @@ static int __init init_pci_cap_pm_perm(struct perm_bits 
*perm)
p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
 
/*
-* Power management is defined *per function*,
-* so we let the user write this
+* Power management is defined *per function*, so we can let
+* the user change power state, but we trap and initiate the
+* change ourselves, so the state bits are read-only.
 */
-   p_setd(perm, PCI_PM_CTRL, NO_VIRT, ALL_WRITE);
+   p_setd(perm, PCI_PM_CTRL, NO_VIRT, ~PCI_PM_CTRL_STATE_MASK);
return 0;
 }
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR, for
 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also takes an
 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type value
 is defined, that which indicates the XICS architecture.
 
 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.

Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.

 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean no
 interrupt and 2 is used for IPIs.  Internally these are represented in
 blocks of 1024, called ICS (interrupt controller source) entities, but
 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a range
 of interrupt sources, creating them if they don't already exist.  The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
 interrupt sources (they are required to already exist).
 
 Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING

These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.  We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration, the only time userspace needs to
use these is at initialization, when it does a SET_SOURCES to create
the interrupt sources it wants the VM to have.

 for interrupt injection, what if there's a race with the user changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.

 +4.80 KVM_CREATE_IRQCHIP_ARGS
 +
 +Capability: KVM_CAP_IRQCHIP_ARGS
 +Architectures: ppc
 
 Why just ppc?

Just because only PPC has code to handle it at this point.  I would
hope that ARM and others could pick it up.  Maybe I should make it:

+Architectures: all (but so far only implemented on ppc)

or something.

 +struct kvm_irq_sources {
 +__u32 irq;
 +__u32 nr_irqs;
 +__u64 __user *irqbuf;
 +};
 
 Please no pointers in UAPI -- this would require a compat wrapper with
 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
which is generally less efficient, but since it's mainly for migration
that probably doesn't matter.  In fact, I could probably use the
existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the
`chip_id' field for `irq' and the `pad' field for `nr_irqs'.

 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT0
 +#define KVM_IRQ_SERVER_MASK 0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT  32
 +#define KVM_IRQ_PRIORITY_MASK   0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL  40)
 +#define KVM_IRQ_MASKED  (1ULL  41)
 +#define KVM_IRQ_PENDING (1ULL  42)
 
 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that destination rather than server.
You're right, server is confusing, but it just means where the
interrupt is sent to be handled.  It has nothing particularly to do
with server computers.

 Let's please have a clean separation between what is generic and what is
 implementation-specific.

I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when.  That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.

MPIC would still need an extra interface for userspace to save and
restore 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR,  
for

 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also  
takes an

 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type  
value

 is defined, that which indicates the XICS architecture.

 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.


I don't think it makes such an assumption.  The MPIC device has  
physical registers, so it exposes them, but it also exposes things that  
are not physical registers (e.g. the per-IRQ input state).  The generic  
device control layer leaves interpretation of attributes up to the  
device.


I think it would be easier to fit XICS into the device control api  
model than to fit MPIC into this model, not to mention what would  
happen if we later want to emulate some other type of device -- x86  
already has at least one non-irqchip emulated device (i8254).



Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.


I wasn't there either.  It's fine to have discussions at such events,  
but it should not preclude others from having input, or better ideas  
from being considered afterward.



 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean  
no
 interrupt and 2 is used for IPIs.  Internally these are  
represented in
 blocks of 1024, called ICS (interrupt controller source) entities,  
but

 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a  
range
 of interrupt sources, creating them if they don't already exist.   
The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range  
of

 interrupt sources (they are required to already exist).

 Why is it userspace's job to control these?  If you use  
KVM_IRQ_PENDING


These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.


In that case, the state it describes is insufficient for MPIC.


We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration,


We don't yet, but would prefer not to assume that it'll never happen.

 for interrupt injection, what if there's a race with the user  
changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but  
this is

 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.


OK, KVM_IRQ_LINE is still used for interrupt injection.  I was hoping  
to avoid going through a standardized interface that forces a global  
interrupt numberspace.


How do MSIs get injected?

BTW, do you have any plans regarding irqfd?


 +struct kvm_irq_sources {
 +  __u32 irq;
 +  __u32 nr_irqs;
 +  __u64 __user *irqbuf;
 +};

 Please no pointers in UAPI -- this would require a compat wrapper  
with

 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,


It doesn't need to be a fixed size buffer.  You can still have  
pointers, but they need to be represented as a plain __u64 with users  
casting the pointer.



 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT  0
 +#define KVM_IRQ_SERVER_MASK   0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT32
 +#define KVM_IRQ_PRIORITY_MASK 0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE   (1ULL  40)
 +#define KVM_IRQ_MASKED(1ULL  41)
 +#define KVM_IRQ_PENDING   (1ULL  42)

 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
 On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
  On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
  From: Benjamin Herrenschmidt b...@kernel.crashing.org
  
  This adds in-kernel emulation of the XICS (eXternal Interrupt
  Controller Specification) interrupt controller specified by
 PAPR, for
  both HV and PR KVM guests.
  
  This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
  KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
  should use in-kernel interrupt controller emulation, but also
 takes an
  argument struct that contains the type of interrupt controller
  architecture and an optional parameter.  Currently only one
 type value
  is defined, that which indicates the XICS architecture.
 
  Would the device config API I posted a couple days ago work for you?
 
 I suppose it could be made to work.  It doesn't feel like a natural
 fit though, because your API seems to assume (AFAICT) that a device is
 manipulated via registers at specific physical addresses, so I would
 have to invent an artificial set of registers with addresses and bit
 layouts, that aren't otherwise required.  The XICS is operated from
 the guest side via hcalls, not via emulated MMIO.
 
 I don't think it makes such an assumption.  The MPIC device has
 physical registers, so it exposes them, but it also exposes things
 that are not physical registers (e.g. the per-IRQ input state).  The
 generic device control layer leaves interpretation of attributes up
 to the device.
 
 I think it would be easier to fit XICS into the device control api
 model than to fit MPIC into this model, not to mention what would
 happen if we later want to emulate some other type of device -- x86
 already has at least one non-irqchip emulated device (i8254).

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.  Further, the magic means that you can
only have one instance of the device, whereas you might want to model
the interrupt controller architecture as several devices.  You could
do that using several device types, but then the interconnections
between them would also be magic.

Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.

 Part of the reason I went with the API that I did is that that was
 what was agreed on at KVM Forum (as far as I can tell, not having been
 at the meeting).  Your device API seems to be quite different to that.
 
 I wasn't there either.  It's fine to have discussions at such
 events, but it should not preclude others from having input, or
 better ideas from being considered afterward.

Sure - I was just trying to fit in with the expressed wish of the
maintainer.

  The XICS emulation supports up to 1048560 interrupt sources.
  Interrupt source numbers below 16 are reserved; 0 is used to
 mean no
  interrupt and 2 is used for IPIs.  Internally these are
 represented in
  blocks of 1024, called ICS (interrupt controller source)
 entities, but
  that is not visible to userspace.
  
  Two other new ioctls allow userspace to control the interrupt
  sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
  destination cpu, level/edge sensitivity and pending state of a
 range
  of interrupt sources, creating them if they don't already
 exist.  The
  KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a
 range of
  interrupt sources (they are required to already exist).
 
  Why is it userspace's job to control these?  If you use
 KVM_IRQ_PENDING
 
 These are primarily there to support live migration.  For live
 migration, userspace needs to be able to extract the entire state of
 the virtual machine from the old guest, and then set the new guest to
 that exact same state.
 
 In that case, the state it describes is insufficient for MPIC.

Yes, MPIC has other random stuff in it besides interrupt control, so
that's not surprising.

 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,
 
 We don't yet, but would prefer not to assume that it'll never happen.
 
  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.


OK.  This is device-specific behavior, so you could define it  
differently for XICS than MPIC.  I suppose we could change it for MPIC  
as well, to leave an opening for the unlikely case where we'd want to  
model an MPIC that isn't directly connected to the CPUs.


How is the explicit request made in this patchset?


Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.


Yes.  I am unconvinced that such an abstraction is well-advised  
(especially after seeing existing generic interfaces that are clearly  
APIC-oriented).  This isn't like normal driver interfaces where we're  
abstracting away hardware differences to let generic code use a  
device.  Userspace knows what kind of device it wants, and how it wants  
it to integrate with the rest of the emulated system.  We'd have to go  
out of our way to apply the abstraction on *both* ends.  What do we get  
from that other than a chance that the abstraction leaks?  What  
significant code actually becomes common?  kvm_set_irq() is just a thin  
wrapper around the ioctl.



 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,

 We don't yet, but would prefer not to assume that it'll never  
happen.


  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 they were, there is appropriate locking in there to handle any  
races.


 OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
 hoping to avoid going through a standardized interface that forces a
 global interrupt numberspace.

Why?


The standardized interface doesn't make things any easier (as noted  
above, the caller is already mpic-specific code), and we'd have to come  
up with a scheme for flattening our interrupt numberspace (rather than  
introduce new attribute groups for things like IPI and timer  
interrupts).  It may still be necessary when it comes to irqfd,  
though...



 How do MSIs get injected?

Just like other interrupts - from the point of view of the interrupt
controller they're edge-triggered interrupt sources.


Ah right, I guess this is all set up via hcalls for XICS.

With MPIC exposing its registers via the device control api, everything  
just works -- the PCI device generates a write to the MPIC's memory  
region, the QEMU MPIC stub sends the write to the kernel as for any  
other MMIO access (this passthrough is also useful for debugging), the  
in-kernel MPIC sees the write to the generate an MSI register and  
does its thing.  Compare that to all special the MSI code for APIC...  
:-)



 BTW, do you have any plans regarding irqfd?

I'm going to look at that next.


Likewise...  We should probably coordinate our efforts so that at least  
the de-APICization of the code only has to get done once.



 What about interrupt controllers that allow multiple destinations?

The destination can be an identifier for a group of vcpus, or even a
bitmap -- that's why I made it 32 bits.


So you can have single delivery, or be limited to 32 vcpus, or have to  
implement some destination ID allocation scheme (which is more state  
that needs to be accessible somehow).



 More than 256 priorities?  Different levels of output (normal,
 critical, machine check)?  Programmable vector numbers?  Active
 high/low control?

There are plenty of bits free in the 64 bits per source that I have
allowed.  We can accommodate those things.


MPIC vector numbers take up 16 of the bits.  The architected interrupt  
level field is 8 bits, though only a handful of values are actually  
needed.  Add a couple binary flags, and it gets pretty tight if a third  
type of interrupt controller starts wanting something new.



 (BTW, I think having more
than 256 priorities would be insane - do you know of any actual
example that does?)


No, but hardware designers have been known to do insane things.


 The per-vcpu state isn't even part of this AFAICT.  It's an
 XICS-specific ONE_REG -- which is fine, but all that's left of the
 generic API is the get/set sources which is an imperfect match to
 our per-IRQ state and it's not clear how an 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.
 
 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.

 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.
 
 Yes.  I am unconvinced that such an abstraction is well-advised
 (especially after seeing existing generic interfaces that are
 clearly APIC-oriented).  This isn't like normal driver interfaces
 where we're abstracting away hardware differences to let generic
 code use a device.  Userspace knows what kind of device it wants,
 and how it wants it to integrate with the rest of the emulated
 system.  We'd have to go out of our way to apply the abstraction on
 *both* ends.  What do we get from that other than a chance that the
 abstraction leaks?  What significant code actually becomes common?
 kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

  OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
  hoping to avoid going through a standardized interface that forces a
  global interrupt numberspace.
 
 Why?
 
 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the interrupt
 controller they're edge-triggered interrupt sources.
 
 Ah right, I guess this is all set up via hcalls for XICS.
 
 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.
 
 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

 Yes, the names of the bitfields in the ICP state word are
 XICS-specific, but the concepts are pretty generic - current processor
 priority, identifier for interrupt awaiting service, pending IPI
 request priority, pending interrupt request priority.
 
 We don't have separate concepts of pending IPI request priority
 and pending interrupt request priority.  There can be multiple

Sorry, I meant pending interrupt request.  You do have that, it's
what you read from the interrupt acknowledge register.

 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set 

Re: [PATCH 02/12] KVM: PPC: E500: Explicitly mark shadow maps invalid

2013-02-15 Thread Scott Wood

On 02/14/2013 06:16:18 PM, Alexander Graf wrote:

When we invalidate shadow TLB maps on the host, we don't mark them
as not valid. But we should.

Fix this by removing the E500_TLB_VALID from their flags when
invalidating.

Signed-off-by: Alexander Graf ag...@suse.de
---
 arch/powerpc/kvm/e500_tlb.c |   13 ++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/e500_tlb.c b/arch/powerpc/kvm/e500_tlb.c
index d38ad63..8efb2ac 100644
--- a/arch/powerpc/kvm/e500_tlb.c
+++ b/arch/powerpc/kvm/e500_tlb.c
@@ -204,9 +204,13 @@ static void inval_gtlbe_on_host(struct  
kvmppc_vcpu_e500 *vcpu_e500,

 {
struct kvm_book3e_206_tlb_entry *gtlbe =
get_entry(vcpu_e500, tlbsel, esel);
+   struct tlbe_ref *ref = vcpu_e500-gtlb_priv[tlbsel][esel].ref;

-   if (tlbsel == 1 
-   vcpu_e500-gtlb_priv[1][esel].ref.flags  E500_TLB_BITMAP) {
+   /* Don't bother with unmapped entries */
+   if (!(ref-flags  E500_TLB_VALID))
+   return;


This is broken as pointed out here:
http://patchwork.ozlabs.org/patch/220356/

...and note that I'm still seeing problems even after that fix, which  
I'll try to debug today.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:

From: Benjamin Herrenschmidt b...@kernel.crashing.org

This adds in-kernel emulation of the XICS (eXternal Interrupt
Controller Specification) interrupt controller specified by PAPR, for
both HV and PR KVM guests.

This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
should use in-kernel interrupt controller emulation, but also takes an
argument struct that contains the type of interrupt controller
architecture and an optional parameter.  Currently only one type value
is defined, that which indicates the XICS architecture.


Would the device config API I posted a couple days ago work for you?


The XICS emulation supports up to 1048560 interrupt sources.
Interrupt source numbers below 16 are reserved; 0 is used to mean no
interrupt and 2 is used for IPIs.  Internally these are represented in
blocks of 1024, called ICS (interrupt controller source) entities, but
that is not visible to userspace.

Two other new ioctls allow userspace to control the interrupt
sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
destination cpu, level/edge sensitivity and pending state of a range
of interrupt sources, creating them if they don't already exist.  The
KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
interrupt sources (they are required to already exist).


Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING
for interrupt injection, what if there's a race with the user changing
other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
being presented as a generic API.


+4.80 KVM_CREATE_IRQCHIP_ARGS
+
+Capability: KVM_CAP_IRQCHIP_ARGS
+Architectures: ppc


Why just ppc?


+struct kvm_irq_sources {
+   __u32 irq;
+   __u32 nr_irqs;
+   __u64 __user *irqbuf;
+};


Please no pointers in UAPI -- this would require a compat wrapper with
32-bit user and 64-bit kernel.


+/* irqbuf entries are laid out like this: */
+#define KVM_IRQ_SERVER_SHIFT   0
+#define KVM_IRQ_SERVER_MASK0xULL
+#define KVM_IRQ_PRIORITY_SHIFT 32
+#define KVM_IRQ_PRIORITY_MASK  0xff
+#define KVM_IRQ_LEVEL_SENSITIVE(1ULL  40)
+#define KVM_IRQ_MASKED (1ULL  41)
+#define KVM_IRQ_PENDING(1ULL  42)


What does server mean?  Do you mean laid out like this for XICS?
Let's please have a clean separation between what is generic and what is
implementation-specific.

-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR, for
 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also takes an
 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type value
 is defined, that which indicates the XICS architecture.
 
 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.

Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.

 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean no
 interrupt and 2 is used for IPIs.  Internally these are represented in
 blocks of 1024, called ICS (interrupt controller source) entities, but
 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a range
 of interrupt sources, creating them if they don't already exist.  The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range of
 interrupt sources (they are required to already exist).
 
 Why is it userspace's job to control these?  If you use KVM_IRQ_PENDING

These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.  We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration, the only time userspace needs to
use these is at initialization, when it does a SET_SOURCES to create
the interrupt sources it wants the VM to have.

 for interrupt injection, what if there's a race with the user changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but this is
 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.

 +4.80 KVM_CREATE_IRQCHIP_ARGS
 +
 +Capability: KVM_CAP_IRQCHIP_ARGS
 +Architectures: ppc
 
 Why just ppc?

Just because only PPC has code to handle it at this point.  I would
hope that ARM and others could pick it up.  Maybe I should make it:

+Architectures: all (but so far only implemented on ppc)

or something.

 +struct kvm_irq_sources {
 +__u32 irq;
 +__u32 nr_irqs;
 +__u64 __user *irqbuf;
 +};
 
 Please no pointers in UAPI -- this would require a compat wrapper with
 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,
which is generally less efficient, but since it's mainly for migration
that probably doesn't matter.  In fact, I could probably use the
existing KVM_GET_IRQCHIP and KVM_SET_IRQCHIP ioctls, if I use the
`chip_id' field for `irq' and the `pad' field for `nr_irqs'.

 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT0
 +#define KVM_IRQ_SERVER_MASK 0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT  32
 +#define KVM_IRQ_PRIORITY_MASK   0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE (1ULL  40)
 +#define KVM_IRQ_MASKED  (1ULL  41)
 +#define KVM_IRQ_PENDING (1ULL  42)
 
 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that destination rather than server.
You're right, server is confusing, but it just means where the
interrupt is sent to be handled.  It has nothing particularly to do
with server computers.

 Let's please have a clean separation between what is generic and what is
 implementation-specific.

I believe that the interface is pretty cleanly generic - the model is
a set of interrupt sources and some per-vcpu state, with priorities to
decide which interrupts get delivered when.  That describes the basics
of just about any SMP-capable interrupt controller, including MPIC.

MPIC would still need an extra interface for userspace to save and
restore 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:

On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
 On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 
 This adds in-kernel emulation of the XICS (eXternal Interrupt
 Controller Specification) interrupt controller specified by PAPR,  
for

 both HV and PR KVM guests.
 
 This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
 KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
 should use in-kernel interrupt controller emulation, but also  
takes an

 argument struct that contains the type of interrupt controller
 architecture and an optional parameter.  Currently only one type  
value

 is defined, that which indicates the XICS architecture.

 Would the device config API I posted a couple days ago work for you?

I suppose it could be made to work.  It doesn't feel like a natural
fit though, because your API seems to assume (AFAICT) that a device is
manipulated via registers at specific physical addresses, so I would
have to invent an artificial set of registers with addresses and bit
layouts, that aren't otherwise required.  The XICS is operated from
the guest side via hcalls, not via emulated MMIO.


I don't think it makes such an assumption.  The MPIC device has  
physical registers, so it exposes them, but it also exposes things that  
are not physical registers (e.g. the per-IRQ input state).  The generic  
device control layer leaves interpretation of attributes up to the  
device.


I think it would be easier to fit XICS into the device control api  
model than to fit MPIC into this model, not to mention what would  
happen if we later want to emulate some other type of device -- x86  
already has at least one non-irqchip emulated device (i8254).



Part of the reason I went with the API that I did is that that was
what was agreed on at KVM Forum (as far as I can tell, not having been
at the meeting).  Your device API seems to be quite different to that.


I wasn't there either.  It's fine to have discussions at such events,  
but it should not preclude others from having input, or better ideas  
from being considered afterward.



 The XICS emulation supports up to 1048560 interrupt sources.
 Interrupt source numbers below 16 are reserved; 0 is used to mean  
no
 interrupt and 2 is used for IPIs.  Internally these are  
represented in
 blocks of 1024, called ICS (interrupt controller source) entities,  
but

 that is not visible to userspace.
 
 Two other new ioctls allow userspace to control the interrupt
 sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
 destination cpu, level/edge sensitivity and pending state of a  
range
 of interrupt sources, creating them if they don't already exist.   
The
 KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a range  
of

 interrupt sources (they are required to already exist).

 Why is it userspace's job to control these?  If you use  
KVM_IRQ_PENDING


These are primarily there to support live migration.  For live
migration, userspace needs to be able to extract the entire state of
the virtual machine from the old guest, and then set the new guest to
that exact same state.


In that case, the state it describes is insufficient for MPIC.


We have live migration working in qemu for
pSeries guests with in-kernel XICS emulation using this interface.
If you're not doing live migration,


We don't yet, but would prefer not to assume that it'll never happen.

 for interrupt injection, what if there's a race with the user  
changing
 other flags via MMIO?  Maybe this isn't an issue with XICS, but  
this is

 being presented as a generic API.

They're not used while the guest is running, as I said, but even if
they were, there is appropriate locking in there to handle any races.


OK, KVM_IRQ_LINE is still used for interrupt injection.  I was hoping  
to avoid going through a standardized interface that forces a global  
interrupt numberspace.


How do MSIs get injected?

BTW, do you have any plans regarding irqfd?


 +struct kvm_irq_sources {
 +  __u32 irq;
 +  __u32 nr_irqs;
 +  __u64 __user *irqbuf;
 +};

 Please no pointers in UAPI -- this would require a compat wrapper  
with

 32-bit user and 64-bit kernel.

Hmmm, you're right.  I suppose it will have to be a fixed-size buffer,


It doesn't need to be a fixed size buffer.  You can still have  
pointers, but they need to be represented as a plain __u64 with users  
casting the pointer.



 +/* irqbuf entries are laid out like this: */
 +#define KVM_IRQ_SERVER_SHIFT  0
 +#define KVM_IRQ_SERVER_MASK   0xULL
 +#define KVM_IRQ_PRIORITY_SHIFT32
 +#define KVM_IRQ_PRIORITY_MASK 0xff
 +#define KVM_IRQ_LEVEL_SENSITIVE   (1ULL  40)
 +#define KVM_IRQ_MASKED(1ULL  41)
 +#define KVM_IRQ_PENDING   (1ULL  42)

 What does server mean?  Do you mean laid out like this for XICS?

Sorry, I should have made that 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 05:59:11PM -0600, Scott Wood wrote:
 On 02/15/2013 05:18:31 PM, Paul Mackerras wrote:
 On Fri, Feb 15, 2013 at 02:05:41PM -0600, Scott Wood wrote:
  On 02/14/2013 06:01:08 PM, Paul Mackerras wrote:
  From: Benjamin Herrenschmidt b...@kernel.crashing.org
  
  This adds in-kernel emulation of the XICS (eXternal Interrupt
  Controller Specification) interrupt controller specified by
 PAPR, for
  both HV and PR KVM guests.
  
  This adds a new KVM_CREATE_IRQCHIP_ARGS ioctl, which is like
  KVM_CREATE_IRQCHIP in that it indicates that the virtual machine
  should use in-kernel interrupt controller emulation, but also
 takes an
  argument struct that contains the type of interrupt controller
  architecture and an optional parameter.  Currently only one
 type value
  is defined, that which indicates the XICS architecture.
 
  Would the device config API I posted a couple days ago work for you?
 
 I suppose it could be made to work.  It doesn't feel like a natural
 fit though, because your API seems to assume (AFAICT) that a device is
 manipulated via registers at specific physical addresses, so I would
 have to invent an artificial set of registers with addresses and bit
 layouts, that aren't otherwise required.  The XICS is operated from
 the guest side via hcalls, not via emulated MMIO.
 
 I don't think it makes such an assumption.  The MPIC device has
 physical registers, so it exposes them, but it also exposes things
 that are not physical registers (e.g. the per-IRQ input state).  The
 generic device control layer leaves interpretation of attributes up
 to the device.
 
 I think it would be easier to fit XICS into the device control api
 model than to fit MPIC into this model, not to mention what would
 happen if we later want to emulate some other type of device -- x86
 already has at least one non-irqchip emulated device (i8254).

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.  Further, the magic means that you can
only have one instance of the device, whereas you might want to model
the interrupt controller architecture as several devices.  You could
do that using several device types, but then the interconnections
between them would also be magic.

Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.

 Part of the reason I went with the API that I did is that that was
 what was agreed on at KVM Forum (as far as I can tell, not having been
 at the meeting).  Your device API seems to be quite different to that.
 
 I wasn't there either.  It's fine to have discussions at such
 events, but it should not preclude others from having input, or
 better ideas from being considered afterward.

Sure - I was just trying to fit in with the expressed wish of the
maintainer.

  The XICS emulation supports up to 1048560 interrupt sources.
  Interrupt source numbers below 16 are reserved; 0 is used to
 mean no
  interrupt and 2 is used for IPIs.  Internally these are
 represented in
  blocks of 1024, called ICS (interrupt controller source)
 entities, but
  that is not visible to userspace.
  
  Two other new ioctls allow userspace to control the interrupt
  sources.  The KVM_IRQCHIP_SET_SOURCES ioctl sets the priority,
  destination cpu, level/edge sensitivity and pending state of a
 range
  of interrupt sources, creating them if they don't already
 exist.  The
  KVM_IRQCHIP_GET_SOURCES ioctl returns that information for a
 range of
  interrupt sources (they are required to already exist).
 
  Why is it userspace's job to control these?  If you use
 KVM_IRQ_PENDING
 
 These are primarily there to support live migration.  For live
 migration, userspace needs to be able to extract the entire state of
 the virtual machine from the old guest, and then set the new guest to
 that exact same state.
 
 In that case, the state it describes is insufficient for MPIC.

Yes, MPIC has other random stuff in it besides interrupt control, so
that's not surprising.

 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,
 
 We don't yet, but would prefer not to assume that it'll never happen.
 
  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Scott Wood

On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:

I have no particular objection to the device control API per se, but
I have two objections to using it as the primary interface to the XICS
emulation.

First, I dislike the magical side-effect where creating a device of a
particular type (e.g. MPIC or XICS) automatically attaches it to the
interrupt lines of the vcpus.  I prefer an explicit request to do
in-kernel interrupt control.


OK.  This is device-specific behavior, so you could define it  
differently for XICS than MPIC.  I suppose we could change it for MPIC  
as well, to leave an opening for the unlikely case where we'd want to  
model an MPIC that isn't directly connected to the CPUs.


How is the explicit request made in this patchset?


Secondly, it means that we are completely abandoning any attempt to
define an abstract or generic interface to in-kernel interrupt
controller emulations.  Each device will have its own unique set of
attribute groups and its own unique userspace code to drive it, with
no commonality between them.


Yes.  I am unconvinced that such an abstraction is well-advised  
(especially after seeing existing generic interfaces that are clearly  
APIC-oriented).  This isn't like normal driver interfaces where we're  
abstracting away hardware differences to let generic code use a  
device.  Userspace knows what kind of device it wants, and how it wants  
it to integrate with the rest of the emulated system.  We'd have to go  
out of our way to apply the abstraction on *both* ends.  What do we get  
from that other than a chance that the abstraction leaks?  What  
significant code actually becomes common?  kvm_set_irq() is just a thin  
wrapper around the ioctl.



 We have live migration working in qemu for
 pSeries guests with in-kernel XICS emulation using this interface.
 If you're not doing live migration,

 We don't yet, but would prefer not to assume that it'll never  
happen.


  for interrupt injection, what if there's a race with the user
 changing
  other flags via MMIO?  Maybe this isn't an issue with XICS, but
 this is
  being presented as a generic API.
 
 They're not used while the guest is running, as I said, but even if
 they were, there is appropriate locking in there to handle any  
races.


 OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
 hoping to avoid going through a standardized interface that forces a
 global interrupt numberspace.

Why?


The standardized interface doesn't make things any easier (as noted  
above, the caller is already mpic-specific code), and we'd have to come  
up with a scheme for flattening our interrupt numberspace (rather than  
introduce new attribute groups for things like IPI and timer  
interrupts).  It may still be necessary when it comes to irqfd,  
though...



 How do MSIs get injected?

Just like other interrupts - from the point of view of the interrupt
controller they're edge-triggered interrupt sources.


Ah right, I guess this is all set up via hcalls for XICS.

With MPIC exposing its registers via the device control api, everything  
just works -- the PCI device generates a write to the MPIC's memory  
region, the QEMU MPIC stub sends the write to the kernel as for any  
other MMIO access (this passthrough is also useful for debugging), the  
in-kernel MPIC sees the write to the generate an MSI register and  
does its thing.  Compare that to all special the MSI code for APIC...  
:-)



 BTW, do you have any plans regarding irqfd?

I'm going to look at that next.


Likewise...  We should probably coordinate our efforts so that at least  
the de-APICization of the code only has to get done once.



 What about interrupt controllers that allow multiple destinations?

The destination can be an identifier for a group of vcpus, or even a
bitmap -- that's why I made it 32 bits.


So you can have single delivery, or be limited to 32 vcpus, or have to  
implement some destination ID allocation scheme (which is more state  
that needs to be accessible somehow).



 More than 256 priorities?  Different levels of output (normal,
 critical, machine check)?  Programmable vector numbers?  Active
 high/low control?

There are plenty of bits free in the 64 bits per source that I have
allowed.  We can accommodate those things.


MPIC vector numbers take up 16 of the bits.  The architected interrupt  
level field is 8 bits, though only a handful of values are actually  
needed.  Add a couple binary flags, and it gets pretty tight if a third  
type of interrupt controller starts wanting something new.



 (BTW, I think having more
than 256 priorities would be insane - do you know of any actual
example that does?)


No, but hardware designers have been known to do insane things.


 The per-vcpu state isn't even part of this AFAICT.  It's an
 XICS-specific ONE_REG -- which is fine, but all that's left of the
 generic API is the get/set sources which is an imperfect match to
 our per-IRQ state and it's not clear how an 

Re: [PATCH 3/9] KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controller

2013-02-15 Thread Paul Mackerras
On Fri, Feb 15, 2013 at 09:57:06PM -0600, Scott Wood wrote:
 On 02/15/2013 08:56:14 PM, Paul Mackerras wrote:
 I have no particular objection to the device control API per se, but
 I have two objections to using it as the primary interface to the XICS
 emulation.
 
 First, I dislike the magical side-effect where creating a device of a
 particular type (e.g. MPIC or XICS) automatically attaches it to the
 interrupt lines of the vcpus.  I prefer an explicit request to do
 in-kernel interrupt control.
 
 OK.  This is device-specific behavior, so you could define it
 differently for XICS than MPIC.  I suppose we could change it for
 MPIC as well, to leave an opening for the unlikely case where we'd
 want to model an MPIC that isn't directly connected to the CPUs.

You can have cascaded MPICs; I once had a powerbook that had one MPIC
cascaded into another.

 How is the explicit request made in this patchset?

The KVM_CREATE_IRQCHIP_ARGS ioctl says that you want emulation of a
specific interrupt controller architecture connected to the vcpus'
external interrupt inputs.  In that sense it's explicit, compared to a
generic create device ioctl that could be for any device.

 Secondly, it means that we are completely abandoning any attempt to
 define an abstract or generic interface to in-kernel interrupt
 controller emulations.  Each device will have its own unique set of
 attribute groups and its own unique userspace code to drive it, with
 no commonality between them.
 
 Yes.  I am unconvinced that such an abstraction is well-advised
 (especially after seeing existing generic interfaces that are
 clearly APIC-oriented).  This isn't like normal driver interfaces
 where we're abstracting away hardware differences to let generic
 code use a device.  Userspace knows what kind of device it wants,
 and how it wants it to integrate with the rest of the emulated
 system.  We'd have to go out of our way to apply the abstraction on
 *both* ends.  What do we get from that other than a chance that the
 abstraction leaks?  What significant code actually becomes common?
 kvm_set_irq() is just a thin wrapper around the ioctl.

I'd think there could be some code reduction in the live migration
code, but I'd need a qemu hacker to chime in here.

  OK, KVM_IRQ_LINE is still used for interrupt injection.  I was
  hoping to avoid going through a standardized interface that forces a
  global interrupt numberspace.
 
 Why?
 
 The standardized interface doesn't make things any easier (as noted
 above, the caller is already mpic-specific code), and we'd have to
 come up with a scheme for flattening our interrupt numberspace
 (rather than introduce new attribute groups for things like IPI and
 timer interrupts).  It may still be necessary when it comes to
 irqfd, though...

With 24 bits of interrupt source number, you don't have to flatten it
very far.  IPIs are handled separately and don't appear in the
interrupt source space.

  How do MSIs get injected?
 
 Just like other interrupts - from the point of view of the interrupt
 controller they're edge-triggered interrupt sources.
 
 Ah right, I guess this is all set up via hcalls for XICS.
 
 With MPIC exposing its registers via the device control api,
 everything just works -- the PCI device generates a write to the
 MPIC's memory region, the QEMU MPIC stub sends the write to the
 kernel as for any other MMIO access (this passthrough is also useful
 for debugging), the in-kernel MPIC sees the write to the generate
 an MSI register and does its thing.  Compare that to all special
 the MSI code for APIC... :-)

You're doing a round trip to userspace for every MPIC register access
by the guest?  Seriously?  If that's so, then why bother with
in-kernel emulation at all?  Once you're back in userspace, it's just
as fast to do the emulation there as in the kernel.

 There are plenty of bits free in the 64 bits per source that I have
 allowed.  We can accommodate those things.
 
 MPIC vector numbers take up 16 of the bits.  The architected
 interrupt level field is 8 bits, though only a handful of values are
 actually needed.  Add a couple binary flags, and it gets pretty
 tight if a third type of interrupt controller starts wanting
 something new.

There's enough space for MPIC to have 16 bits of vector and some
flags.  We don't need to overdesign this.

 Yes, the names of the bitfields in the ICP state word are
 XICS-specific, but the concepts are pretty generic - current processor
 priority, identifier for interrupt awaiting service, pending IPI
 request priority, pending interrupt request priority.
 
 We don't have separate concepts of pending IPI request priority
 and pending interrupt request priority.  There can be multiple

Sorry, I meant pending interrupt request.  You do have that, it's
what you read from the interrupt acknowledge register.

 interrupts awaiting service (or even in service, if different
 priorities).  We have both current task priority (which is a
 user-set