Re: VIA Eden X4

2016-01-05 Thread Alex Williamson
On Tue, 2016-01-05 at 12:50 +0530, Bandan Das wrote:
> "Matwey V. Kornilov"  writes:
> 
> > Hello,
> > 
> > According to WikiPedia VIA claims x86 hardware assisted
> > virtualization
> > for VIA Eden X4 CPU.
> > Does anybody know if it is supported by Linux KVM?
> > 
> 
> I can't say for sure but my guess is that it should work since VIA
> implements
> VT-x like virtualization extensions, so KVM will find VMX capable
> hardware.

I don't think it's that straightforward, ISTR the vmx capability on
previous VIA processors being broken, for instance:

https://bugs.launchpad.net/qemu/+bug/712416

Seems like Avi had investigated and found something pretty broken in
the implementation, but my google-foo isn't able to find the thread.  I
suspect nobody, maybe not even VIA, knows if it works on their latest
attempt.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio/iommu_type1: make use of info.flags

2016-01-04 Thread Alex Williamson
On Wed, 2015-12-23 at 13:08 +0100, Pierre Morel wrote:
> The flags entry is there to tell the user that some
> optional information is available.
> 
> Since we report the iova_pgsizes signal it to the user
> by setting the flags to VFIO_IOMMU_INFO_PGSIZES.
> 
> Signed-off-by: Pierre Morel 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> index 59d47cb..6f1ea3d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -995,7 +995,7 @@ static long vfio_iommu_type1_ioctl(void
> *iommu_data,
>   if (info.argsz < minsz)
>   return -EINVAL;
>  
> - info.flags = 0;
> + info.flags = VFIO_IOMMU_INFO_PGSIZES;
>  
>   info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  

Applied to my next branch for v4.5, thanks!

Alex

PS - I haven't lost your qemu overflow patch, sorry for the delay.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2016-01-04 Thread Alex Williamson
On Thu, 2015-12-31 at 16:50 +0800, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap MSI-X
> table in case that user get to touch this directly.
> 
> However, EEH mechanism can ensure that a given pci device
> can only shoot the MSIs assigned for its PE. So we think
> it's safe to expose the MSI-X table to userspace because
> the exposed MSI-X table can't be used to do harm to other
> memory space.
> 
> And with MSI-X table mmapped, some performance issues which
> are caused when PCI adapters have critical registers in the
> same page as the MSI-X table also can be resolved.
> 
> So this patch adds a Kconfig option, VFIO_PCI_MMAP_MSIX,
> to support for mmapping MSI-X table.
> 
> Signed-off-by: Yongji Xie 
> ---
>  drivers/vfio/pci/Kconfig|4 
>  drivers/vfio/pci/vfio_pci.c |6 --
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 02912f1..67b0a2c 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -23,6 +23,10 @@ config VFIO_PCI_MMAP
>   depends on VFIO_PCI
>   def_bool y if !S390
>  
> +config VFIO_PCI_MMAP_MSIX
> + depends on VFIO_PCI_MMAP
> + def_bool y if EEH

Does CONFIG_EEH necessarily mean the EEH is enabled?  Could the system
not support EEH or could EEH be disabled via kernel commandline
options?

> +
>  config VFIO_PCI_INTX
>   depends on VFIO_PCI
>   def_bool y if !S390
> diff --git a/drivers/vfio/pci/vfio_pci.c
> b/drivers/vfio/pci/vfio_pci.c
> index 09b3805..d536985 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -555,7 +555,8 @@ static long vfio_pci_ioctl(void *device_data,
>   IORESOURCE_MEM && (info.size >=
> PAGE_SIZE ||
>   pci_resource_page_aligned)) {
>   info.flags |=
> VFIO_REGION_INFO_FLAG_MMAP;
> - if (info.index == vdev->msix_bar) {
> + if
> (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) &&
> + info.index == vdev->msix_bar) {
>   ret =
> msix_sparse_mmap_cap(vdev, &caps);
>   if (ret)
>   return ret;
> @@ -967,7 +968,8 @@ static int vfio_pci_mmap(void *device_data,
> struct vm_area_struct *vma)
>   if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>   return -EINVAL;
>  
> - if (index == vdev->msix_bar) {
> + if (!IS_ENABLED(CONFIG_VFIO_PCI_MMAP_MSIX) &&
> + index == vdev->msix_bar) {
>   /*
>    * Disallow mmaps overlapping the MSI-X table; users
> don't
>    * get to touch this directly.  We could find
> somewhere

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/3] PCI: Add support for enforcing all MMIO BARs to be page aligned

2016-01-04 Thread Alex Williamson
On Thu, 2015-12-31 at 16:50 +0800, Yongji Xie wrote:
> When vfio passthrough a PCI device of which MMIO BARs
> are smaller than PAGE_SIZE, guest will not handle the
> mmio accesses to the BARs which leads to mmio emulations
> in host.
> 
> This is because vfio will not allow to passthrough one
> BAR's mmio page which may be shared with other BARs.
> 
> To solve this performance issue, this patch adds a kernel
> parameter "pci=resource_page_aligned=on" to enforce
> the alignments of all MMIO BARs to be at least PAGE_SIZE,
> so that one BAR's mmio page would not be shared with other
> BARs. We can also disable it through kernel parameter
> "pci=resource_page_aligned=off".

Shouldn't this somehow be associated with the realloc option?  I don't
think PCI code will attempt to reprogram anything unless it needs to
otherwise.

> For the default value of this parameter, we think it should be
> arch-independent, so we add a macro PCI_RESOURCE_PAGE_ALIGNED
> to change it. And we define this macro to enable this parameter
> by default on PPC64 platform which can easily hit this
> performance issue because its PAGE_SIZE is 64KB.
> 
> Signed-off-by: Yongji Xie 
> ---
>  Documentation/kernel-parameters.txt |4 
>  arch/powerpc/include/asm/pci.h  |   11 +++
>  drivers/pci/pci.c   |   17 +
>  drivers/pci/pci.h   |7 ++-
>  include/linux/pci.h |2 ++
>  5 files changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/kernel-parameters.txt 
> b/Documentation/kernel-parameters.txt
> index 742f69d..a53aaee 100644
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -2857,6 +2857,10 @@ bytes respectively. Such letter suffixes can also be 
> entirely omitted.
>   PAGE_SIZE is used as alignment.
>   PCI-PCI bridge can be specified, if resource
>   windows need to be expanded.
> + resource_page_aligned=  Enable/disable enforcing the alignment
> + of all PCI devices' memory resources to be
> + at least PAGE_SIZE.
> + Format: { "on" | "off" }
>   ecrc=   Enable/disable PCIe ECRC (transaction layer
>   end-to-end CRC checking).
>   bios: Use BIOS/firmware settings. This is the
> diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
> index 3453bd8..27bff59 100644
> --- a/arch/powerpc/include/asm/pci.h
> +++ b/arch/powerpc/include/asm/pci.h
> @@ -136,6 +136,17 @@ extern pgprot_t  pci_phys_mem_access_prot(struct file 
> *file,
>    unsigned long pfn,
>    unsigned long size,
>    pgprot_t prot);
> +#ifdef CONFIG_PPC64
> +
> +/* For PPC64, We enforce all PCI MMIO BARs to be page aligned
> + * by default. This would be helpful to improve performance
> + * when we passthrough a PCI device of which BARs are smaller
> + * than PAGE_SIZE(64KB). And we can use bootcmd
> + * "pci=resource_page_aligned=off" to disable it.
> + */
> +#define PCI_ENABLE_RESOURCE_PAGE_ALIGNED
> +
> +#endif

This should be done with something like HAVE_PCI_DEFAULT_RESOURCE_PAGE_
ALIGNED in arch/powerpc/include/asm

>  #define HAVE_ARCH_PCI_RESOURCE_TO_USER
>  extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 314db8c..9f14ba5 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -99,6 +99,13 @@ u8 pci_cache_line_size;
>   */
>  unsigned int pcibios_max_latency = 255;
>  
> +#ifdef PCI_ENABLE_RESOURCE_PAGE_ALIGNED
> +bool pci_resource_page_aligned = true;
> +#else
> +bool pci_resource_page_aligned;
> +#endif
> +EXPORT_SYMBOL(pci_resource_page_aligned);

Couldn't this be done in a single line with IS_ENABLED() macro?

Should this symbol be GPL-only?

> +
>  /* If set, the PCIe ARI capability will not be used. */
>  static bool pcie_ari_disabled;
>  
> @@ -4746,6 +4753,14 @@ static ssize_t pci_resource_alignment_store(struct 
> bus_type *bus,
>  BUS_ATTR(resource_alignment, 0644, pci_resource_alignment_show,
>   pci_resource_alignment_store);
>  
> +static void pci_resource_get_page_aligned(char *str)
> +{
> + if (!strncmp(str, "off", 3))
> + pci_resource_page_aligned = false;
> + else if (!strncmp(str, "on", 2))
> + pci_resource_page_aligned = true;
> +}
> +
>  static int __init pci_resource_alignment_sysfs_init(void)
>  {
>   return bus_create_file(&pci_bus_type,
> @@ -4859,6 +4874,8 @@ static int __init pci_setup(char *str)
>   } else if (!strncmp(str, "resource_alignment=", 19)) {
>   pci_set_resource_alignment_param(str + 19,
>     

Re: [RFC 0/2] VFIO SRIOV support

2015-12-24 Thread Alex Williamson
On Thu, 2015-12-24 at 07:22 +, Ilya Lesokhin wrote:
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Wednesday, December 23, 2015 6:28 PM
> > To: Ilya Lesokhin ; kvm@vger.kernel.org; linux-
> > p...@vger.kernel.org
> > Cc: bhelg...@google.com; Noa Osherovich ;
> > Haggai
> > Eran ; Or Gerlitz ;
> > Liran
> > Liss 
> > Subject: Re: [RFC 0/2] VFIO SRIOV support
> > 
> > On Wed, 2015-12-23 at 07:43 +, Ilya Lesokhin wrote:
> > > Hi Alex,
> > > Regarding driver_override, as far as I know you can only use it
> > > on
> > > devices that were already discovered. Since the devices do not
> > > exist
> > > before the call to pci_enable_sriov(...) and are already probed
> > > after
> > > the call  it wouldn't really help us. I would have to unbind them
> > > from
> > > their default driver and bind them to VFIO like solution a in my
> > > original mail.
> > 
> > If you allow them to be bound to their default driver, then you've
> > already
> > created the scenario of a user own PF creating host own VFs, which
> > I think is
> > unacceptable.  The driver_override can be set before drivers are
> > probed, the
> > fact that pci_enable_sriov() doesn't enable a hook for that is
> > something that
> > could be fixed.
> 
> That’s essentially the same as solution b in original mail which I
> was hoping to avoid.
> 
> > > You are right about the ownership problem and we would like to
> > > receive
> > > input regarding what is the correct way of solving this.
> > > But in the meantime I think our solution is quite useful even
> > > though
> > > if it requires root privileges. We hacked libvirt so that it
> > > would run
> > > qemu as root and without device cgroup.
> > > 
> > > In any case, don't you think that assigning those devices to VFIO
> > > should be safe? Does the VFIO driver makes any unsafe assumptions
> > > on
> > > the VF's that might allow a guest to crash the hypervisor?
> > > 
> > > I am somewhat concerned that the VM  could trigger some backdoor
> > > reset
> > > while the hypervisor is running pci_enable_sriov(...). But I'm
> > > not
> > > really sure how to solve it.
> > > I guess you have to either stop the guest entirely to enable
> > > sriov or
> > > make it privileged.
> > > 
> > > Regarding having the PF controlled by one user while the other
> > > VFs are
> > > controlled by other user, I actually think it might be an
> > > interesting
> > > use case.
> > 
> > It may be, but it needs to be an opt-in, not a security accident.
> >  The interface
> > between a PF and a VF is essential device specific and we don't
> > know exactly
> > how isolated each VF is from the PF.  In the typical scenario of
> > the PF being
> > owned by the host, we have a certain degree of trust in the host,
> > it's running
> > the VM after all, if it wanted to compromise it, it could.  We have
> > no implicit
> > reason to trust a PF running in a guest though.  Can the snoop VF
> > traffic, can
> > they generate DMA outside of the container of the PF using the VF?
> >  We
> > can't be sure.
> >  So unless you can make the default scenario be that VFs created by
> > a user
> > own PF are only available for use by that user, without relying on
> > userspace
> > to intervene, it seems like any potential usefulness is trumped by
> > a giant
> > security issue.  Thanks,
> 
> I don't understand the security issue, don't you need root permission
> for device assignment?

No.  A privileged entity needs to grant a user ownership of a group and
sufficient locked memory limits to make it useful, but then use of the
group does not require root permission.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] VFIO SRIOV support

2015-12-23 Thread Alex Williamson
On Wed, 2015-12-23 at 07:43 +, Ilya Lesokhin wrote:
> Hi Alex,
> Regarding driver_override, as far as I know you can only use it on
> devices that were already discovered. Since the devices do not exist
> before the call to pci_enable_sriov(...)
> and are already probed after the call  it wouldn't really help us. I
> would have to unbind them from their default driver and bind them to
> VFIO like solution a in my original mail.

If you allow them to be bound to their default driver, then you've
already created the scenario of a user own PF creating host own VFs,
which I think is unacceptable.  The driver_override can be set before
drivers are probed, the fact that pci_enable_sriov() doesn't enable a
hook for that is something that could be fixed.

> You are right about the ownership problem and we would like to
> receive input regarding what is the correct way of solving this. 
> But in the meantime I think our solution is quite useful even though
> if it requires root privileges. We hacked libvirt so that it would
> run qemu as root and without device cgroup.
> 
> In any case, don't you think that assigning those devices to VFIO
> should be safe? Does the VFIO driver makes any unsafe assumptions on
> the VF's that might allow a guest to crash the hypervisor?
> 
> I am somewhat concerned that the VM  could trigger some backdoor
> reset while the hypervisor is running pci_enable_sriov(...). But I'm
> not really sure how to solve it.
> I guess you have to either stop the guest entirely to enable sriov or
> make it privileged.
> 
> Regarding having the PF controlled by one user while the other VFs
> are controlled by other user, I actually think it might be an
> interesting use case.

It may be, but it needs to be an opt-in, not a security accident.  The
interface between a PF and a VF is essential device specific and we
don't know exactly how isolated each VF is from the PF.  In the typical
scenario of the PF being owned by the host, we have a certain degree of
trust in the host, it's running the VM after all, if it wanted to
compromise it, it could.  We have no implicit reason to trust a PF
running in a guest though.  Can the snoop VF traffic, can they generate
DMA outside of the container of the PF using the VF?  We can't be sure.
 So unless you can make the default scenario be that VFs created by a
user own PF are only available for use by that user, without relying on
userspace to intervene, it seems like any potential usefulness is
trumped by a giant security issue.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH kernel] vfio: Add explicit alignments in vfio_iommu_spapr_tce_create

2015-12-22 Thread Alex Williamson
On Fri, 2015-12-18 at 12:35 +1100, Alexey Kardashevskiy wrote:
> The vfio_iommu_spapr_tce_create struct has 4x32bit and 2x64bit fields
> which should have resulted in sizeof(fio_iommu_spapr_tce_create)
> equal
> to 32 bytes. However due to the gcc's default alignment, the actual
> size of this struct is 40 bytes.
> 
> This fills gaps with __resv1/2 fields.
> 
> This should not cause any change in behavior.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---

Applied to next for v4.5 with David's ack.  Thanks!

Alex

>  include/uapi/linux/vfio.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9fd7b5d..d117233 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -568,8 +568,10 @@ struct vfio_iommu_spapr_tce_create {
>   __u32 flags;
>   /* in */
>   __u32 page_shift;
> + __u32 __resv1;
>   __u64 window_size;
>   __u32 levels;
> + __u32 __resv2;
>   /* out */
>   __u64 start_addr;
>  };

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] VFIO: platform: reset: fix a warning message condition

2015-12-22 Thread Alex Williamson
On Thu, 2015-12-17 at 15:27 +0300, Dan Carpenter wrote:
> This loop ends with count set to -1 and not zero so the warning
> message
> isn't printed when it should be.  I've fixed this by change the
> postop
> to a preop.
> 
> Fixes: 0990822c9866 ('VFIO: platform: reset: AMD xgbe reset module')
> Signed-off-by: Dan Carpenter 

Applied to next for v4.5 with Eric's Reviewed-by.  Thanks!

Alex

> diff --git a/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> index da5356f..d4030d0 100644
> --- a/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> +++ b/drivers/vfio/platform/reset/vfio_platform_amdxgbe.c
> @@ -110,7 +110,7 @@ int vfio_platform_amdxgbe_reset(struct
> vfio_platform_device *vdev)
>   usleep_range(10, 15);
>  
>   count = 2000;
> - while (count-- && (ioread32(xgmac_regs->ioaddr + DMA_MR) &
> 1))
> + while (--count && (ioread32(xgmac_regs->ioaddr + DMA_MR) &
> 1))
>   usleep_range(500, 600);
>  
>   if (!count)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4] vfio: Include No-IOMMU mode

2015-12-22 Thread Alex Williamson
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
---

v4: Fix build without CONFIG_VFIO_NOIOMMU (oops).  Also avoid local
noiommu variable in vfio_create_group() to avoid scope confusion
with global of the same name.

 drivers/vfio/Kconfig|   15 
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  184 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 210 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 850d86c..da6e2ce 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,6 +31,21 @@ menuconfig VFIO
 
  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 56bf6db..2760a7b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;
 
-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;
 
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}
 
@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
-   iommu_group_put(pdev->dev.iommu_group);
+   vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev);
 
if (vfio_pci_is_vga(pdev)) {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b79..82f25cc 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -62,6 +62,7 @@ struct vfio_container {
struct rw_semaphore group_lock;
struct vfio_iommu_driver*iommu_driver;
void*iommu_data;
+   boolnoiommu;
 };
 
 struct vfio_unbound_dev {
@@ -84,6 +85,7 @@ struct vfio_group {
struct list_headunbound_list;
struct mutexunbound_lock;
atomic_topened;
+   boolnoiommu;
 };
 

[PATCH v3] vfio: Include No-IOMMU mode

2015-12-22 Thread Alex Williamson
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
---

v3: Version 2 was dropped from kernel v4.4 due to lack of a user.  We
now have a working DPDK port to this interface, so I'm proposing
it again for v4.5.  The changes since v2 can be found split out
in the dpdk archive here:

http://dpdk.org/ml/archives/dev/2015-December/030561.html

The problem was that the NOIOMMU extension was only advertised
once a group was attached to a container.  While we want the
no-iommu backed to be used exclusively for no-iommu groups, we
should still advertise it when the module option is enabled.
Handling the no-iommu iommu driver less as a special case
accomplishes this.  Also fixed a mismatch in naming between module
parameter and description and tagged a struct as const.


 drivers/vfio/Kconfig|   15 
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  181 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 207 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 850d86c..da6e2ce 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,6 +31,21 @@ menuconfig VFIO
 
  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 56bf6db..2760a7b 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;
 
-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;
 
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}
 
@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
-   iommu_group_put(pdev->dev.iommu_group);
+   vfio_iommu_group_put(pdev->dev.iommu_group, &pdev->dev);
kfree(vdev);
 
if (vfio_pci_is_vga(pdev)) {
diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6070b79..5c7ebf2 100644
-

Re: [RFC 0/2] VFIO SRIOV support

2015-12-22 Thread Alex Williamson
On Tue, 2015-12-22 at 15:42 +0200, Ilya Lesokhin wrote:
> Today the QEMU hypervisor allows assigning a physical device to a VM,
> facilitating driver development. However, it does not support
> enabling
> SR-IOV by the VM kernel driver. Our goal is to implement such
> support,
> allowing developers working on SR-IOV physical function drivers to
> work
> inside VMs as well.
> 
> This patch series implements the kernel side of our solution.  It
> extends
> the VFIO driver to support the PCIE SRIOV extended capability with
> following features:
> 1. The ability to probe SRIOV BAR sizes.
> 2. The ability to enable and disable sriov.
> 
> This patch series is going to be used by QEMU to expose sriov
> capabilities
> to VM. We already have an early prototype based on Knut Omang's
> patches for
> SRIOV[1]. 
> 
> Open issues:
> 1. Binding the new VFs to VFIO driver.
> Once the VM enables sriov it expects the new VFs to appear inside the
> VM.
> To this end we need to bind the new vfs to the VFIO driver and have
> QEMU
> grab them. We are currently achieve this goal using:
> echo $vendor $device > /sys/bus/pci/drivers/vfio-pci/new_id
> but we are not happy about this solution as a system might have
> another
> device with the same id that is unrelated to our VM.
> Other solution we've considered are:
>  a. Having user space unbind and then bind the VFs to VFIO.
>  Typically resulting in an unnecessary probing of the device.
>  b. Adding a driver argument to pci_enable_sriov(...) and have
> vfio call pci_enable_sriov with the vfio driver as argument.
> This solution avoids the unnecessary but is more intrusive.

You could use driver_override for this, but the open issue you haven't
listed is the ownership problem, VFs will be in separate iommu groups
and therefore create separate vfio groups.  How do those get associated
with the user so that we don't have one user controlling the VFs for
another user, or worse for the host kernel.  Whatever solution you come
up with needs to protect the host kernel, first and foremost.  It's not
sufficient to rely on userspace to grab the VFs and sequester them for
use only by that user, the host kernel needs to provide that security
automatically.  Thanks,

Alex

> 2. How to tell if it is safe to disable SRIOV?
> In the current implementation, a userspace can enable sriov, grab one
> of
> the VFs and then call disable sriov without releasing the
> device.  This
> will result in a deadlock where the user process is stuck inside
> disable
> sriov waiting for itself to release the device. Killing the process
> leaves
> it in a zombie state.
> We also get a strange warning saying:
> [  181.668492] WARNING: CPU: 22 PID: 3684 at kernel/sched/core.c:7497
> __might_sleep+0x77/0x80() 
> [  181.668502] do not call blocking ops when !TASK_RUNNING; state=1
> set at [] prepare_to_wait_event+0x63/0xf0
> 
> 3. How to expose the Supported Page Sizes and System Page Size
> registers in
> the SRIOV capability? 
> Presently the hypervisor initializes Supported Page Sizes once and
> assumes
> it doesn't change therefore we cannot allow user space to change this
> register at will. The first solution that comes to mind is to expose
> a
> device that only supports the page size selected by the hypervisor.
> Unfourtently, Per SR-IOV spec section 3.3.12, PFs are required to
> support
> 4-KB, 8-KB, 64-KB, 256-KB, 1-MB, and 4-MB page sizes. We currently
> map both
> registers as virtualized and read only and leave user space to worry
> about
> this problem.
> 
> 4. Other SRIOV capabilities.
> Do we want to hide capabilities we do not support in the SR-IOV
> Capabilities register? or leave it to the userspace application?
> 
> [1] https://github.com/knuto/qemu/tree/sriov_patches_v6
> 
> Ilya Lesokhin (2):
>   PCI: Expose iov_set_numvfs and iov_resource_size for modules.
>   VFIO: Add support for SRIOV extended capablity
> 
>  drivers/pci/iov.c  |   4 +-
>  drivers/vfio/pci/vfio_pci_config.c | 169
> +
>  include/linux/pci.h|   4 +
>  3 files changed, 159 insertions(+), 18 deletions(-)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] VFIO: capability chains

2015-12-17 Thread Alex Williamson
On Fri, 2015-12-18 at 13:05 +1100, Alexey Kardashevskiy wrote:
> On 11/24/2015 07:43 AM, Alex Williamson wrote:
> > Please see the commit log and comments in patch 1 for a general
> > explanation of the problems that this series tries to address.  The
> > general problem is that we have several cases where we want to
> > expose
> > variable sized information to the user, whether it's sparse mmaps
> > for
> > a region, as implemented here, or DMA mapping ranges of an IOMMU,
> > or
> > reserved MSI mapping ranges, etc.  Extending data structures is
> > hard;
> > extending them to report variable sized data is really hard.  After
> > considering several options, I think the best approach is to copy
> > how
> > PCI does capabilities.  This allows the ioctl to only expose the
> > capabilities that are relevant for them, avoids data structures
> > that
> > are too complicated to parse, and avoids creating a new ioctl each
> > time we think of something else that we'd like to report.  This
> > method
> > also doesn't preclude extensions to the fixed structure since the
> > offset of these capabilities is entirely dynamic.
> > 
> > Comments welcome, I'll also follow-up to the QEMU and KVM lists
> > with
> > an RFC making use of this for mmaps skipping over the MSI-X table.
> > Thanks,
> 
> Out of curiosity - could this information be exposed to the userspace
> via 
> /sys/bus/pci/devices/:xx:xx:x/vfio_? It seems not to change
> after 
> vfio_pci driver is bound to a device.

For what purpose?  vfio doesn't have a sysfs interface, why start one? 
Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned

2015-12-17 Thread Alex Williamson
On Thu, 2015-12-17 at 18:26 +0800, yongji xie wrote:
> 
> On 2015/12/17 4:04, Alex Williamson wrote:
> > On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> > > Current vfio-pci implementation disallows to mmap
> > > sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio
> > > page
> > > may be shared with other BARs.
> > > 
> > > But we should allow to mmap these sub-page MMIO BARs if all MMIO
> > > BARs
> > > are page aligned which leads the BARs' mmio page would not be
> > > shared
> > > with other BARs.
> > > 
> > > This patch adds support for this case and we also add a
> > > VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
> > > platform supports all MMIO BARs to be page aligned.
> > > 
> > > Signed-off-by: Yongji Xie 
> > > ---
> > >   drivers/vfio/pci/vfio_pci.c |   10 +-
> > >   drivers/vfio/pci/vfio_pci_private.h |5 +
> > >   include/uapi/linux/vfio.h   |2 ++
> > >   3 files changed, 16 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/vfio/pci/vfio_pci.c
> > > b/drivers/vfio/pci/vfio_pci.c
> > > index 32b88bd..dbcad99 100644
> > > --- a/drivers/vfio/pci/vfio_pci.c
> > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
> > >   if (vdev->reset_works)
> > >   info.flags |= VFIO_DEVICE_FLAGS_RESET;
> > >   
> > > + if (vfio_pci_bar_page_aligned())
> > > + info.flags |=
> > > VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
> > > +
> > >   info.num_regions = VFIO_PCI_NUM_REGIONS;
> > >   info.num_irqs = VFIO_PCI_NUM_IRQS;
> > >   
> > > @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
> > >    VFIO_REGION_INFO_FLAG_WRIT
> > > E;
> > >   if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
> > >   pci_resource_flags(pdev,
> > > info.index) &
> > > - IORESOURCE_MEM && info.size >=
> > > PAGE_SIZE)
> > > + IORESOURCE_MEM && (info.size >=
> > > PAGE_SIZE ||
> > > + vfio_pci_bar_page_aligned()))
> > >   info.flags |=
> > > VFIO_REGION_INFO_FLAG_MMAP;
> > >   break;
> > >   case VFIO_PCI_ROM_REGION_INDEX:
> > > @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
> > > struct vm_area_struct *vma)
> > >   return -EINVAL;
> > >   
> > >   phys_len = pci_resource_len(pdev, index);
> > > +
> > > + if (vfio_pci_bar_page_aligned())
> > > + phys_len = PAGE_ALIGN(phys_len);
> > > +
> > >   req_len = vma->vm_end - vma->vm_start;
> > >   pgoff = vma->vm_pgoff &
> > >   ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) -
> > > 1);
> > > diff --git a/drivers/vfio/pci/vfio_pci_private.h
> > > b/drivers/vfio/pci/vfio_pci_private.h
> > > index 0e7394f..319352a 100644
> > > --- a/drivers/vfio/pci/vfio_pci_private.h
> > > +++ b/drivers/vfio/pci/vfio_pci_private.h
> > > @@ -69,6 +69,11 @@ struct vfio_pci_device {
> > >   #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
> > > is_msix(vdev)))
> > >   #define irq_is(vdev, type) (vdev->irq_type == type)
> > >   
> > > +static inline bool vfio_pci_bar_page_aligned(void)
> > > +{
> > > + return IS_ENABLED(CONFIG_PPC64);
> > > +}
> > I really dislike this.  This is a problem for any architecture that
> > runs on larger pages, and even an annoyance on 4k hosts.  Why are
> > we
> > only solving it for PPC64?
> Yes, I know it's a problem for other architectures. But I'm not sure
> if 
> other archs prefer
> to enforce the alignment of all BARs to be at least PAGE_SIZE which 
> would result in
> some waste of address space.
> 
> So I just propose a prototype and add PPC64 support here. And other 
> archs could decide
> to use it or not by themselves.
> > Can't we do something similar in the core PCI code and detect it?
> So you mean we can do it like this:
> 
> diff --git a/drivers/pci/pci

Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2015-12-17 Thread Alex Williamson
On Thu, 2015-12-17 at 18:37 +0800, yongji xie wrote:
> 
> On 2015/12/17 4:14, Alex Williamson wrote:
> > On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> > > Current vfio-pci implementation disallows to mmap MSI-X table in
> > > case that user get to touch this directly.
> > > 
> > > However, EEH mechanism could ensure that a given pci device
> > > can only shoot the MSIs assigned for its PE and guest kernel also
> > > would not write to MSI-X table in pci_enable_msix() because
> > > para-virtualization on PPC64 platform. So MSI-X table is safe to
> > > access directly from the guest with EEH mechanism enabled.
> > The MSI-X table is paravirtualized on vfio in general and interrupt
> > remapping theoretically protects against errant interrupts, so why
> > is
> > this PPC64 specific?  We have the same safeguards on x86 if we want
> > to
> > decide they're sufficient.  Offhand, the only way I can think that
> > a
> > device can touch the MSI-X table is via backdoors or p2p DMA with
> > another device.
> Maybe I didn't make my point clear. The reasons why we can mmap MSI-X
> table on PPC64 are:
> 
> 1. EEH mechanism could ensure that a given pci device can only shoot
> the MSIs assigned for its PE. So it would not do harm to other memory
> space when the guest write a garbage MSI-X address/data to the vector
> table
> if we passthough MSI-X tables to guest.

Interrupt remapping does the same on x86.

> 2. The guest kernel would not write to MSI-X table on PPC64 platform
> when device drivers call pci_enable_msix() to initialize MSI-X
> interrupts.

This is irrelevant to the vfio API.  vfio is a userspace driver
interface, QEMU is just one possible consumer of the interface.  Even
in the case of PPC64 & QEMU, the guest is still capable of writing to
the vector table, it just probably won't.

> So I think it is safe to mmap/passthrough MSI-X table on PPC64
> platform.
> And I'm not sure whether other architectures can ensure these two 
> points. 

There is another consideration, which is the API exposed to the user.
 vfio currently enforces interrupt setup through ioctls by making the
PCI mechanisms for interrupt programming inaccessible through the
device regions.  Ignoring that you are only focused on PPC64 with QEMU,
does it make sense for the vfio API to allow a user to manipulate
interrupt programming in a way that not only will not work, but in a
way that we expect to fail and require error isolation to recover from?
 I can't say I'm fully convinced that a footnote in the documentation
is sufficient for that.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2015-12-17 Thread Alex Williamson
On Thu, 2015-12-17 at 10:08 +, David Laight wrote:
> > The MSI-X table is paravirtualized on vfio in general and interrupt
> > remapping theoretically protects against errant interrupts, so why
> > is
> > this PPC64 specific? We have the same safeguards on x86 if we want
> > to
> > decide they're sufficient. Offhand, the only way I can think that a
> > device can touch the MSI-X table is via backdoors or p2p DMA with
> > another device.
> 
> Is this all related to the statements in the PCI(e) spec that the
> MSI-X table and Pending bit array should in their own BARs?
> (ISTR it even suggests a BAR each.)
> 
> Since the MSI-X table exists in device memory/registers there is
> nothing to stop the device modifying the table contents (or even
> ignoring the contents and writing address+data pairs that are known
> to reference the CPUs MSI-X interrupt generation logic).
> 
> We've an fpga based PCIe slave that has some additional PCIe slaves
> (associated with the interrupt generation logic) that are currently
> next to the PBA (which is 8k from the MSI-X table).
> If we can't map the PBA we can't actually raise any interrupts.
> The same would be true if page size is 64k and mapping the MSI-X
> table banned.
> 
> Do we need to change our PCIe slave address map so we don't need
> to access anything in the same page (which might be 64k were we to
> target large ppc - which we don't at the moment) as both the
> MSI-X table and the PBA?
> 
> I'd also note that being able to read the MSI-X table is a useful
> diagnostic that the relevant interrupts are enabled properly.

Yes, the spec requirement is that MSI-X structures must reside in a 4k
aligned area that doesn't overlap with other configuration registers
for the device.  It's only an advisement to put them into their own
BAR, and 4k clearly wasn't as forward looking as we'd hope.  Vfio
doesn't particularly care about the PBA, but if it resides in the same
host PAGE_SIZE area as the MSI-X vector table, you currently won't be
able to get to it.  Most devices are not at all dependent on the PBA
for any sort of functionality.

It's really more correct to say that both the vector table and PBA are
emulated by QEMU than paravirtualized.  Only PPC64 has the guest OS
taking a paravirtual path to program the vector table, everyone else
attempts to read/write to the device MMIO space, which gets trapped and
emulated in QEMU.  This is why the QEMU side patch has further ugly
hacks to mess with the ordering of MemoryRegions since even if we can
access and mmap the MSI-X vector table, we'll still trap into QEMU for
emulation.

How exactly does the ability to map the PBA affect your ability to
raise an interrupt?  I can only think that maybe you're writing PBA
bits to clear them, but the spec indicates that software should never
write to the PBA, only read, and that writes are undefined.  So that
would be very non-standard, QEMU drops writes, they don't even make it
to the hardware.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/3] vfio-pci: Allow to mmap MSI-X table if EEH is supported

2015-12-16 Thread Alex Williamson
On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap MSI-X table in
> case that user get to touch this directly.
> 
> However, EEH mechanism could ensure that a given pci device
> can only shoot the MSIs assigned for its PE and guest kernel also
> would not write to MSI-X table in pci_enable_msix() because
> para-virtualization on PPC64 platform. So MSI-X table is safe to
> access directly from the guest with EEH mechanism enabled.

The MSI-X table is paravirtualized on vfio in general and interrupt
remapping theoretically protects against errant interrupts, so why is
this PPC64 specific?  We have the same safeguards on x86 if we want to
decide they're sufficient.  Offhand, the only way I can think that a
device can touch the MSI-X table is via backdoors or p2p DMA with
another device.

> This patch adds support for this case and allow to mmap MSI-X
> table if EEH is supported on PPC64 platform.
> 
> And we also add a VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP flag to notify
> userspace that it's safe to mmap MSI-X table.
> 
> Signed-off-by: Yongji Xie 
> ---
>  drivers/vfio/pci/vfio_pci.c |5 -
>  drivers/vfio/pci/vfio_pci_private.h |5 +
>  include/uapi/linux/vfio.h   |2 ++
>  3 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dbcad99..85d9980 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -446,6 +446,9 @@ static long vfio_pci_ioctl(void *device_data,
>   if (vfio_pci_bar_page_aligned())
>   info.flags |= VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
>  
> + if (vfio_msix_table_mmap_enabled())
> + info.flags |= VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP;
> +
>   info.num_regions = VFIO_PCI_NUM_REGIONS;
>   info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> @@ -871,7 +874,7 @@ static int vfio_pci_mmap(void *device_data, struct 
> vm_area_struct *vma)
>   if (phys_len < PAGE_SIZE || req_start + req_len > phys_len)
>   return -EINVAL;
>  
> - if (index == vdev->msix_bar) {
> + if (index == vdev->msix_bar && !vfio_msix_table_mmap_enabled()) {
>   /*
>    * Disallow mmaps overlapping the MSI-X table; users don't
>    * get to touch this directly.  We could find somewhere
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index 319352a..835619e 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -74,6 +74,11 @@ static inline bool vfio_pci_bar_page_aligned(void)
>   return IS_ENABLED(CONFIG_PPC64);
>  }
>  
> +static inline bool vfio_msix_table_mmap_enabled(void)
> +{
> + return IS_ENABLED(CONFIG_EEH);
> +}

I really dislike these.

> +
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1fc8066..289e662 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -173,6 +173,8 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3) /* vfio-amba device */
>  /* Platform support all PCI MMIO BARs to be page aligned */
>  #define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED   (1 << 4)
> +/* Platform support mmapping PCI MSI-X vector table */
> +#define VFIO_DEVICE_FLAGS_PCI_MSIX_MMAP  (1 << 5)

Again, not sure why this is on the device versus the region, but I'd
prefer to investigate whether we can handle this with the sparse mmap
capability (or lack of) in the capability chains I proposed[1]. Thanks,

Alex

[1] https://lkml.org/lkml/2015/11/23/748

>   __u32   num_regions;/* Max region index + 1 */
>   __u32   num_irqs;   /* Max IRQ index + 1 */
>  };

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/3] vfio-pci: Allow to mmap sub-page MMIO BARs if all MMIO BARs are page aligned

2015-12-16 Thread Alex Williamson
On Fri, 2015-12-11 at 16:53 +0800, Yongji Xie wrote:
> Current vfio-pci implementation disallows to mmap
> sub-page(size < PAGE_SIZE) MMIO BARs because these BARs' mmio page
> may be shared with other BARs.
> 
> But we should allow to mmap these sub-page MMIO BARs if all MMIO BARs
> are page aligned which leads the BARs' mmio page would not be shared
> with other BARs.
> 
> This patch adds support for this case and we also add a
> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED flag to notify userspace that
> platform supports all MMIO BARs to be page aligned.
> 
> Signed-off-by: Yongji Xie 
> ---
>  drivers/vfio/pci/vfio_pci.c |   10 +-
>  drivers/vfio/pci/vfio_pci_private.h |5 +
>  include/uapi/linux/vfio.h   |2 ++
>  3 files changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c
> b/drivers/vfio/pci/vfio_pci.c
> index 32b88bd..dbcad99 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -443,6 +443,9 @@ static long vfio_pci_ioctl(void *device_data,
>   if (vdev->reset_works)
>   info.flags |= VFIO_DEVICE_FLAGS_RESET;
>  
> + if (vfio_pci_bar_page_aligned())
> + info.flags |=
> VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED;
> +
>   info.num_regions = VFIO_PCI_NUM_REGIONS;
>   info.num_irqs = VFIO_PCI_NUM_IRQS;
>  
> @@ -479,7 +482,8 @@ static long vfio_pci_ioctl(void *device_data,
>    VFIO_REGION_INFO_FLAG_WRITE;
>   if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
>   pci_resource_flags(pdev, info.index) &
> - IORESOURCE_MEM && info.size >=
> PAGE_SIZE)
> + IORESOURCE_MEM && (info.size >=
> PAGE_SIZE ||
> + vfio_pci_bar_page_aligned()))
>   info.flags |=
> VFIO_REGION_INFO_FLAG_MMAP;
>   break;
>   case VFIO_PCI_ROM_REGION_INDEX:
> @@ -855,6 +859,10 @@ static int vfio_pci_mmap(void *device_data,
> struct vm_area_struct *vma)
>   return -EINVAL;
>  
>   phys_len = pci_resource_len(pdev, index);
> +
> + if (vfio_pci_bar_page_aligned())
> + phys_len = PAGE_ALIGN(phys_len);
> +
>   req_len = vma->vm_end - vma->vm_start;
>   pgoff = vma->vm_pgoff &
>   ((1U << (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT)) - 1);
> diff --git a/drivers/vfio/pci/vfio_pci_private.h
> b/drivers/vfio/pci/vfio_pci_private.h
> index 0e7394f..319352a 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -69,6 +69,11 @@ struct vfio_pci_device {
>  #define is_irq_none(vdev) (!(is_intx(vdev) || is_msi(vdev) ||
> is_msix(vdev)))
>  #define irq_is(vdev, type) (vdev->irq_type == type)
>  
> +static inline bool vfio_pci_bar_page_aligned(void)
> +{
> + return IS_ENABLED(CONFIG_PPC64);
> +}

I really dislike this.  This is a problem for any architecture that
runs on larger pages, and even an annoyance on 4k hosts.  Why are we
only solving it for PPC64?  Can't we do something similar in the core
PCI code and detect it?

> +
>  extern void vfio_pci_intx_mask(struct vfio_pci_device *vdev);
>  extern void vfio_pci_intx_unmask(struct vfio_pci_device *vdev);
>  
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 751b69f..1fc8066 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -171,6 +171,8 @@ struct vfio_device_info {
>  #define VFIO_DEVICE_FLAGS_PCI(1 << 1)/* vfio-pci
> device */
>  #define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)  /* vfio-platform
> device */
>  #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3) /* vfio-amba device
> */
> +/* Platform support all PCI MMIO BARs to be page aligned */
> +#define VFIO_DEVICE_FLAGS_PCI_PAGE_ALIGNED   (1 << 4)
>   __u32   num_regions;/* Max region index + 1 */
>   __u32   num_irqs;   /* Max IRQ index + 1 */
>  };

Why is this on the device info, shouldn't it be per region?  Do we even
need a flag or can we just set the existing mmap flag with the
clarification that sub-host page size regions can mmap an entire host-
page aligned, sized area in the documentation?  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] VFIO: Support threaded interrupt handling on VFIO

2015-12-16 Thread Alex Williamson
On Thu, 2015-12-03 at 10:22 -0800, Yunhong Jiang wrote:
> For VFIO device with MSI interrupt type, it's possible to handle the
> interrupt on hard interrupt context without invoking the interrupt
> thread. Handling the interrupt on hard interrupt context reduce the
> interrupt latency.
> 
> Signed-off-by: Yunhong Jiang 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 39 
> ++-
>  1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 3b3ba15558b7..108d335c5656 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -236,12 +236,35 @@ static void vfio_intx_disable(struct vfio_pci_device 
> *vdev)
>   kfree(vdev->ctx);
>  }
>  
> +static irqreturn_t vfio_msihandler(int irq, void *arg)
> +{
> + struct vfio_pci_irq_ctx *ctx = arg;
> + struct irq_bypass_producer *producer = &ctx->producer;
> + struct irq_bypass_consumer *consumer;
> + int ret = IRQ_HANDLED, idx;
> +
> + idx = srcu_read_lock(&producer->srcu);
> +
> + list_for_each_entry_rcu(consumer, &producer->consumers, sibling) {
> + /*
> +  * Invoke the thread handler if any consumer would block, but
> +  * finish all consumes.
> +  */
> + if (consumer->handle_irq(consumer->irq_context) == -EWOULDBLOCK)
> + ret = IRQ_WAKE_THREAD;
> + continue;
> + }
> +
> + srcu_read_unlock(&producer->srcu, idx);


There should be an irq bypass manager interface to abstract this.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] VIRT: Support runtime irq_bypass consumer

2015-12-16 Thread Alex Williamson
On Thu, 2015-12-03 at 10:22 -0800, Yunhong Jiang wrote:
> Extend the irq_bypass manager to support runtime consumers. A runtime
> irq_bypass consumer can handle interrupt when an interrupt triggered. A
> runtime consumer has it's handle_irq() function set and passing a
> irq_context for the irq handling.
> 
> A producer keep a link for the runtime consumers, so that it can invoke
> each consumer's handle_irq() when irq invoked.
> 
> Currently the irq_bypass manager has several code path assuming there is
> only one consumer/producer pair for each token. For example, when
> register the producer, it exits the loop after finding one match
> consumer.  This is updated to support both static consumer (like for
> Posted Interrupt consumer) and runtime consumer.
> 
> Signed-off-by: Yunhong Jiang 
> ---
>  include/linux/irqbypass.h |  8 +
>  virt/lib/irqbypass.c  | 82 
> +++
>  2 files changed, 69 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/irqbypass.h b/include/linux/irqbypass.h
> index 1551b5b2f4c2..d5bec0c7be3a 100644
> --- a/include/linux/irqbypass.h
> +++ b/include/linux/irqbypass.h
> @@ -12,6 +12,7 @@
>  #define IRQBYPASS_H
>  
>  #include 
> +#include 
>  
>  struct irq_bypass_consumer;
>  
> @@ -47,6 +48,9 @@ struct irq_bypass_consumer;
>   */
>  struct irq_bypass_producer {
>   struct list_head node;
> + /* Update side is synchronized by the lock on irqbypass.c */
> + struct srcu_struct srcu;
> + struct list_head consumers;
>   void *token;
>   int irq;
>   int (*add_consumer)(struct irq_bypass_producer *,

Documentation?

> @@ -61,6 +65,7 @@ struct irq_bypass_producer {
>   * struct irq_bypass_consumer - IRQ bypass consumer definition
>   * @node: IRQ bypass manager private list management
>   * @token: opaque token to match between producer and consumer
> + * @sibling: consumers with same token list management
>   * @add_producer: Connect the IRQ consumer to an IRQ producer
>   * @del_producer: Disconnect the IRQ consumer from an IRQ producer
>   * @stop: Perform any quiesce operations necessary prior to add/del 
> (optional)

What about @handle_irq and @irq_context?

> @@ -73,6 +78,7 @@ struct irq_bypass_producer {
>   */
>  struct irq_bypass_consumer {
>   struct list_head node;
> + struct list_head sibling;
>   void *token;
>   int (*add_producer)(struct irq_bypass_consumer *,
>   struct irq_bypass_producer *);
> @@ -80,6 +86,8 @@ struct irq_bypass_consumer {
>    struct irq_bypass_producer *);
>   void (*stop)(struct irq_bypass_consumer *);
>   void (*start)(struct irq_bypass_consumer *);
> + int (*handle_irq)(void *arg);

If we called this with a pointer to the consumer, like the other
functions, the consumer could embed arg (irq_context) into their own
structure, or in this case, do a container_of and avoid storing the
irqfd pointer entirely.

> + void *irq_context;
>  };
>  
>  int irq_bypass_register_producer(struct irq_bypass_producer *);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Threaded MSI interrupt for VFIO PCI device

2015-12-16 Thread Alex Williamson
On Wed, 2015-12-16 at 18:56 +0100, Paolo Bonzini wrote:
> Alex,
> 
> can you take a look at the extension to the irq bypass interface in
> patch 2?  I'm not sure I understand what is the case where you have
> multiple consumers for the same token.

The consumers would be, for instance, Intel PI + the threaded handler
added in this series.  These run independently, the PI bypass simply
makes the interrupt disappear from the host when it catches it, but if
the vCPU isn't running in the right place at the time of the interrupt,
it gets delivered to the host, in which case the secondary consumer
implementing handle_irq() provides a lower latency injection than the
eventfd path.  If PI isn't supported, only this latter consumer is
registered.

On the surface it seems like a reasonable solution, though having
multiple consumers implementing handle_irq() seems problematic.  Do we
get multiple injections if we call them all?  Should we have some way
to prioritize one handler versus another?  Perhaps KVM should have a
single unified consumer that can provide that sort of logic, though we
still need the srcu code added here to protect against registration and
irq_handler() races.  Thanks,

Alex

> On 03/12/2015 19:22, Yunhong Jiang wrote:
> > When assigning a VFIO device to a KVM guest with low latency
> > requirement, it  
> > is better to handle the interrupt in the hard interrupt context, to
> > reduce 
> > the context switch to/from the IRQ thread.
> > 
> > Based on discussion on https://lkml.org/lkml/2015/10/26/764, the
> > VFIO msi 
> > interrupt is changed to use request_threaded_irq(). The primary
> > interrupt 
> > handler tries to set the guest interrupt atomically. If it fails to
> > achieve 
> > it, a threaded interrupt handler will be invoked.
> > 
> > The irq_bypass manager is extended for this purpose. The KVM
> > eventfd will 
> > provide a irqbypass consumer to handle the interrupt at hard
> > interrupt 
> > context. The producer will invoke the consumer's handler then.
> > 
> > Yunhong Jiang (5):
> >   Extract the irqfd_wakeup_pollin/irqfd_wakeup_pollup
> >   Support runtime irq_bypass consumer
> >   Support threaded interrupt handling on VFIO
> >   Add the irq handling consumer
> >   Expose x86 kvm_arch_set_irq_inatomic()
> > 
> >  arch/x86/kvm/Kconfig  |   1 +
> >  drivers/vfio/pci/vfio_pci_intrs.c |  39 ++--
> >  include/linux/irqbypass.h |   8 +++
> >  include/linux/kvm_host.h  |  19 +-
> >  include/linux/kvm_irqfd.h |   1 +
> >  virt/kvm/Kconfig  |   3 +
> >  virt/kvm/eventfd.c| 131
> > ++
> >  virt/lib/irqbypass.c  |  82 ++--
> >  8 files changed, 214 insertions(+), 70 deletions(-)
> > 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] VFIO fixes for v4.4-rc5

2015-12-09 Thread Alex Williamson
Hi Linus,

The following changes since commit 8005c49d9aea74d382f474ce11afbbc7d7130bec:

  Linux 4.4-rc1 (2015-11-15 17:00:27 -0800)

are available in the git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.4-rc5

for you to fetch changes up to ae5515d66362b9d96cdcfce504567f0b8b7bd83e:

  Revert: "vfio: Include No-IOMMU mode" (2015-12-04 08:38:42 -0700)


VFIO fixes for v4.4-rc5
 - Various fixes for removing redundancy, const'ifying structs,
   avoiding stack usage, fixing WARN usage (Krzysztof Kozlowski,
   Julia Lawall, Kees Cook, Dan Carpenter)
 - Revert No-IOMMU mode as the intended user has not emerged
   (Alex Williamson)

----
Alex Williamson (1):
  Revert: "vfio: Include No-IOMMU mode"

Dan Carpenter (1):
  vfio: fix a warning message

Julia Lawall (1):
  vfio-pci: constify pci_error_handlers structures

Kees Cook (1):
  vfio: platform: remove needless stack usage

Krzysztof Kozlowski (1):
  vfio: Drop owner assignment from platform_driver

 drivers/vfio/Kconfig |  15 ---
 drivers/vfio/pci/vfio_pci.c  |  10 +-
 drivers/vfio/platform/vfio_platform.c|   1 -
 drivers/vfio/platform/vfio_platform_common.c |   5 +-
 drivers/vfio/vfio.c  | 188 +--
 include/linux/vfio.h |   3 -
 include/uapi/linux/vfio.h|   7 -
 7 files changed, 13 insertions(+), 216 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] Threaded MSI interrupt for VFIO PCI device

2015-12-03 Thread Alex Williamson
On Thu, 2015-12-03 at 10:22 -0800, Yunhong Jiang wrote:
> When assigning a VFIO device to a KVM guest with low latency requirement, it  
> is better to handle the interrupt in the hard interrupt context, to reduce 
> the context switch to/from the IRQ thread.
> 
> Based on discussion on https://lkml.org/lkml/2015/10/26/764, the VFIO msi 
> interrupt is changed to use request_threaded_irq(). The primary interrupt 
> handler tries to set the guest interrupt atomically. If it fails to achieve 
> it, a threaded interrupt handler will be invoked.
> 
> The irq_bypass manager is extended for this purpose. The KVM eventfd will 
> provide a irqbypass consumer to handle the interrupt at hard interrupt 
> context. The producer will invoke the consumer's handler then.

Do you have any performance data?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO

2015-12-03 Thread Alex Williamson
On Thu, 2015-12-03 at 16:16 +0300, Pavel Fedin wrote:
>  Hello!
> 
> > I like that you're making this transparent
> > for the user, but at the same time, directly calling function pointers
> > through the msi_domain_ops is quite ugly.
> 
>  Do you mean dereferencing info->ops->vfio_map() in .c code? I can
> introduce some wrappers in include/linux/msi.h like
> msi_domain_vfio_map()/msi_domain_vfio_unmap(), this would not
> conceptually change anything.

But otherwise we're parsing data structures that are possibly intended
to be private.  An interface also abstracts the implementation from the
caller.

> >  There needs to be a real, interface there that isn't specific to
> vfio.
> 
>  Hm... What else is going to use this?

I don't know, but pushing vfio specific data structures and concepts
into core kernel callbacks is clearly the wrong direction to go.

>  Actually, in my implementation the only thing specific to vfio is
> using struct vfio_iommu_driver_ops. This is because we have to perform
> MSI mapping for all "vfio domains" registered for this container. At
> least this is how original type1 driver works.
>  Can anybody explain me, what these "vfio domains" are? From the code
> it looks like we can have several IOMMU instances belonging to one
> VFIO container, and in this case one IOMMU == one "vfio domain". So is
> my understanding correct that "vfio domain" is IOMMU instance?

There's no such thing as a vfio domain, I think you mean iommu domains.
A vfio container represents a user iommu context.  All of the groups
(and thus devices) within a container have the same iommu mappings.
However, not all of the groups are necessarily behind iommu hardware
units that support the same set of features.  We might therefore need to
mirror the user iommu context for the container across multiple physical
iommu contexts (aka domains).  When we walk the iommu->domain_list,
we're mirroring mappings across these multiple iommu domains within the
container.

>  And here come completely different ideas...
>  First of all, can anybody explain, why do i perform all mappings on
> per-IOMMU basis, not on per-device basis? AFAIK at least ARM SMMU
> knows about "stream IDs", and therefore it should be capable of
> distinguishing between different devices. So can i have per-device
> mapping? This would make things much simpler.

vfio is built on iommu groups with the premise being that an iommu group
represents the smallest set of devices that are isolated from all other
devices both by iommu visibility and by DMA isolation (peer-to-peer).
Therefore we base iommu mappings on an iommu group because we cannot
enforce userspace isolation at a sub-group level.  In a system or
topology that is well architected for device isolation, there will be a
one-to-one mapping of iommu groups to devices.

So, iommu mappings are always on a per-group basis, but the user may
attach multiple groups to a single container, which as discussed above
represents a single iommu context.  That single context may be backed by
one or more iommu domains, depending on the capabilities of the iommu
hardware.  You're therefore not performing all mappings on a per-iommu
basis unless you have a user defined container spanning multiple iommus
which are incompatible in a way that requires us to manage them with
separate iommu domains.

The iommu's ability to do per device mappings here is irrelevant.
You're working within a user defined IOMMU context where they have
decided that all of the devices should have the same context.

>  So:
>  Idea 1: do per-device mappings. In this case i don't have to track
> down which devices belong to which group and which IOMMU...

Nak, that doesn't solve anything.

>  Idea 2: What if we indeed simply simulate x86 behavior? What if we
> just do 1:1 mapping for MSI register when IOMMU is initialized and
> forget about it, so that MSI messages are guaranteed to reach the
> host? Or would this mean that we would have to do 1:1 mapping for the
> whole address range? Looks like (1) tried to do something similar,
> with address reservation.

x86 isn't problem-free in this space.  An x86 VM is going to know that
the 0xfee0 address range is special, it won't be backed by RAM and
won't be a DMA target, thus we'll never attempt to map it for an iova
address.  However, if we run a non-x86 VM or a userspace driver, it
doesn't necessarily know that there's anything special about that range
of iovas.  I intend to resolve this with an extension to the iommu info
ioctl that describes the available iova space for the iommu.  The
interrupt region would simply be excluded.

This may be an option for you too, but you need to consider whether it
precludes things like hotplug.  Even in the x86 case, if we have a
non-x86 VM and we try to hot-add a PCI device, we can't dynamically
remove the RAM that would interfere with with the MSI vector block.  I
don't know what that looks like on your platform, whether you can pick a
fixed range for the V

Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-12-03 Thread Alex Williamson
On Thu, 2015-12-03 at 16:40 +0800, Lan, Tianyu wrote:
> On 12/3/2015 6:25 AM, Alex Williamson wrote:
> > I didn't seen a matching kernel patch series for this, but why is the
> > kernel more capable of doing this than userspace is already?
> The following link is the kernel patch.
> http://marc.info/?l=kvm&m=144837328920989&w=2
> 
> > These seem
> > like pointless ioctls, we're creating a purely virtual PCI capability,
> > the kernel doesn't really need to participate in that.
> 
> VFIO kernel driver has pci_config_map which indicates the PCI capability 
> position and length which helps to find free PCI config regs. Qemu side 
> doesn't have such info and can't get the exact table size of PCI 
> capability. If we want to add such support in the Qemu, needs duplicates 
> a lot of code of vfio_pci_configs.c in the Qemu.

That's an internal implementation detail of the kernel, not motivation
for creating a new userspace ABI.  QEMU can recreate this data on its
own.  The kernel is in no more of an authoritative position to determine
capability extents than userspace.

> > Also, why are we
> > restricting ourselves to standard capabilities?
> 
> This version is to check whether it's on the right way and We can extend
> this to pci extended capability later.
> 
> > That's often a crowded
> > space and we can't always know whether an area is free or not based only
> > on it being covered by a capability.  Some capabilities can also appear
> > more than once, so there's context that isn't being passed to the kernel
> > here.  Thanks,
> 
> The region outside of PCI capability are not passed to kernel or used by 
> Qemu for MSI/MSIX . It's possible to use these places for new 
> capability. One concerns is that guest driver may abuse them and quirk 
> of masking some special regs outside of capability maybe helpful.

That's not correct, see kernel commit
a7d1ea1c11b33bda2691f3294b4d735ed635535a.  Gaps between capabilities are
exposed with raw read-write access from the kernel and some drivers and
devices depend on this.  There's also no guarantee that there's a
sufficiently sized gap in conventional space.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 06/10] Qemu/PCI: Add macros for faked PCI migration capability

2015-12-02 Thread Alex Williamson
On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to extend PCI CAP id for migration cap and
> add reg macros. The CAP ID is trial and we may find better one if the
> solution is feasible.
> 
> *PCI_VF_MIGRATION_CAP
> For VF driver to  control that triggers mailbox irq or not during migration.
> 
> *PCI_VF_MIGRATION_VMM_STATUS
> Qemu stores migration status in the reg
> 
> *PCI_VF_MIGRATION_VF_STATUS
> VF driver tells Qemu ready for migration
> 
> *PCI_VF_MIGRATION_IRQ
> VF driver stores mailbox interrupt vector in the reg for Qemu to trigger 
> during migration.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  include/hw/pci/pci_regs.h | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/include/hw/pci/pci_regs.h b/include/hw/pci/pci_regs.h
> index 57e8c80..0dcaf7e 100644
> --- a/include/hw/pci/pci_regs.h
> +++ b/include/hw/pci/pci_regs.h
> @@ -213,6 +213,7 @@
>  #define  PCI_CAP_ID_MSIX 0x11/* MSI-X */
>  #define  PCI_CAP_ID_SATA 0x12/* Serial ATA */
>  #define  PCI_CAP_ID_AF   0x13/* PCI Advanced Features */
> +#define  PCI_CAP_ID_MIGRATION   0x14 
>  #define PCI_CAP_LIST_NEXT1   /* Next capability in the list */
>  #define PCI_CAP_FLAGS2   /* Capability defined flags (16 
> bits) */
>  #define PCI_CAP_SIZEOF   4
> @@ -716,4 +717,22 @@
>  #define PCI_ACS_CTRL 0x06/* ACS Control Register */
>  #define PCI_ACS_EGRESS_CTL_V 0x08/* ACS Egress Control Vector */
>  
> +/* Migration*/
> +#define PCI_VF_MIGRATION_CAP0x04
> +#define PCI_VF_MIGRATION_VMM_STATUS  0x05
> +#define PCI_VF_MIGRATION_VF_STATUS   0x06
> +#define PCI_VF_MIGRATION_IRQ 0x07
> +
> +#define PCI_VF_MIGRATION_CAP_SIZE   0x08
> +
> +#define VMM_MIGRATION_END0x00
> +#define VMM_MIGRATION_START  0x01  
> +
> +#define PCI_VF_WAIT_FOR_MIGRATION   0x00  
> +#define PCI_VF_READY_FOR_MIGRATION  0x01
> +
> +#define PCI_VF_MIGRATION_DISABLE0x00
> +#define PCI_VF_MIGRATION_ENABLE 0x01
> +
> +
>  #endif /* LINUX_PCI_REGS_H */

This will of course break if the PCI SIG defines that capability index.
Couldn't this be done within a vendor defined capability?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 09/10] Qemu/VFIO: Add SRIOV VF migration support

2015-12-02 Thread Alex Williamson
On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> This patch is to add SRIOV VF migration support.
> Create new device type "vfio-sriov" and add faked PCI migration capability
> to the type device.
> 
> The purpose of the new capability
> 1) sync migration status with VF driver in the VM
> 2) Get mailbox irq vector to notify VF driver during migration.
> 3) Provide a way to control injecting irq or not.
> 
> Qemu will migrate PCI configure space regs and MSIX config for VF.
> Inject mailbox irq at last stage of migration to notify VF about
> migration event and wait VF driver ready for migration. VF driver
> writeS PCI config reg PCI_VF_MIGRATION_VF_STATUS in the new cap table
> to tell Qemu.

What makes this sr-iov specific?  Why wouldn't we simply extend vfio-pci
with a migration=on feature?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH V2 02/10] Qemu/VFIO: Add new VFIO_GET_PCI_CAP_INFO ioctl cmd definition

2015-12-02 Thread Alex Williamson
On Tue, 2015-11-24 at 21:35 +0800, Lan Tianyu wrote:
> Signed-off-by: Lan Tianyu 
> ---
>  linux-headers/linux/vfio.h | 16 
>  1 file changed, 16 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 0508d0b..732b0bd 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -495,6 +495,22 @@ struct vfio_eeh_pe_op {
>  
>  #define VFIO_EEH_PE_OP   _IO(VFIO_TYPE, VFIO_BASE + 21)
>  
> +
> +#define VFIO_FIND_FREE_PCI_CONFIG_REG   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +#define VFIO_GET_PCI_CAP_INFO   _IO(VFIO_TYPE, VFIO_BASE + 22)
> +
> +struct vfio_pci_cap_info {
> +__u32 argsz;
> +__u32 flags;
> +#define VFIO_PCI_CAP_GET_SIZE (1 << 0)
> +#define VFIO_PCI_CAP_GET_FREE_REGION (1 << 1)
> +__u32 index;
> +__u32 offset;
> +__u32 size;
> +__u8 cap;
> +};
> +
>  /* * */
>  
>  #endif /* VFIO_H */

I didn't seen a matching kernel patch series for this, but why is the
kernel more capable of doing this than userspace is already?  These seem
like pointless ioctls, we're creating a purely virtual PCI capability,
the kernel doesn't really need to participate in that.  Also, why are we
restricting ourselves to standard capabilities?  That's often a crowded
space and we can't always know whether an area is free or not based only
on it being covered by a capability.  Some capabilities can also appear
more than once, so there's context that isn't being passed to the kernel
here.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] Introduce MSI hardware mapping for VFIO

2015-12-02 Thread Alex Williamson
On Tue, 2015-11-24 at 16:50 +0300, Pavel Fedin wrote:
> On some architectures (e.g. ARM64) if the device is behind an IOMMU, and
> is being mapped by VFIO, it is necessary to also add mappings for MSI
> translation register for interrupts to work. This series implements the
> necessary API to do this, and makes use of this API for GICv3 ITS on
> ARM64.
> 
> v1 => v2:
> - Adde dependency on CONFIG_GENERIC_MSI_IRQ_DOMAIN in some parts of the
>   code, should fix build without this option
> 
> Pavel Fedin (3):
>   vfio: Introduce map and unmap operations
>   gicv3, its: Introduce VFIO map and unmap operations
>   vfio: Introduce generic MSI mapping operations
> 
>  drivers/irqchip/irq-gic-v3-its.c   |  31 ++
>  drivers/vfio/pci/vfio_pci_intrs.c  |  11 
>  drivers/vfio/vfio.c| 116 
> +
>  drivers/vfio/vfio_iommu_type1.c|  29 ++
>  include/linux/irqchip/arm-gic-v3.h |   2 +
>  include/linux/msi.h|  12 
>  include/linux/vfio.h   |  17 +-
>  7 files changed, 217 insertions(+), 1 deletion(-)


Some good points and bad.  I like that you're making this transparent
for the user, but at the same time, directly calling function pointers
through the msi_domain_ops is quite ugly.  There needs to be a real,
interface there that isn't specific to vfio.  The down side of making it
transparent to the user is that parts of their IOVA space are being
claimed and they have no way to figure out what they are.  In fact, the
IOMMU mappings bypass the rb-tree that the type1 driver uses, so these
mappings might stomp on existing mappings for the user or the user might
stomp on these.  Neither of which would be much fun to debug.

There have been previous efforts to support MSI mapping in VFIO[1,2],
but none of them have really gone anywhere.  Whatever solution we use
needs to support everyone who needs it.  Thanks,

Alex

[1] http://www.spinics.net/lists/kvm/msg121669.html, 
http://www.spinics.net/lists/kvm/msg121662.html
[2] http://www.spinics.net/lists/kvm/msg119236.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] vfio: fix a problematic usage of WARN()

2015-11-25 Thread Alex Williamson
On Wed, 2015-11-25 at 21:12 +0800, Geliang Tang wrote:
> WARN() takes a condition and a format string. The condition was
> omitted. So I added it.
> 
> Signed-off-by: Geliang Tang 
> ---
>  drivers/vfio/vfio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index de632da..9da0703 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -682,7 +682,7 @@ static int vfio_group_nb_add_dev(struct vfio_group 
> *group, struct device *dev)
>   return 0;
>  
>   /* TODO Prevent device auto probing */
> - WARN("Device %s added to live group %d!\n", dev_name(dev),
> + WARN(1, "Device %s added to live group %d!\n", dev_name(dev),
>iommu_group_id(group->iommu_group));
>  
>   return 0;

This was already reported and I've got a patch queued to resolve it:

https://www.mail-archive.com/kvm@vger.kernel.org/msg123061.html

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC post-2.5 PATCH 5/5] vfio: Enable sparse mmap capability

2015-11-23 Thread Alex Williamson
The sparse mmap capability in a vfio region info allows vfio to tell
us which sub-areas of a region may be mmap'd.  Thus rather than
assuming a single mmap covers the entire region and later frobbing it
ourselves for things like the PCI MSI-X vector table, we can read that
directly from vfio.

Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c |   66 +++---
 trace-events |2 ++
 2 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index a73c6ad..3b7cf23 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,6 +492,52 @@ static void vfio_listener_release(VFIOContainer *container)
 memory_listener_unregister(&container->listener);
 }
 
+static struct vfio_info_cap_header *
+vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
+{
+struct vfio_info_cap_header *header;
+
+if (!(info->flags & VFIO_REGION_INFO_FLAG_CAPS)) {
+return NULL;
+}
+
+for (header = (void *)info + info->cap_offset;
+ (void *)header > (void *)info; header = (void *)info + header->next) {
+if (header->id == id) {
+return header;
+}
+}
+
+return NULL;
+}
+
+static void vfio_setup_region_sparse_mmaps(VFIORegion *region,
+   struct vfio_region_info *info)
+{
+struct vfio_region_info_cap_sparse_mmap *sparse;
+int i;
+
+sparse = (void *)vfio_get_region_info_cap(info,
+  
VFIO_REGION_INFO_CAP_SPARSE_MMAP);
+if (!sparse) {
+return;
+}
+
+trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
+ region->nr, sparse->nr_areas);
+
+region->nr_mmaps = sparse->nr_areas;
+region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+
+for (i = 0; i < region->nr_mmaps; i++) {
+region->mmaps[i].offset = sparse->areas[i].offset;
+region->mmaps[i].size = sparse->areas[i].size;
+trace_vfio_region_sparse_mmap_entry(i, region->mmaps[i].offset,
+region->mmaps[i].offset +
+region->mmaps[i].size);
+}
+}
+
 int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
   int index, const char *name)
 {
@@ -515,11 +561,15 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, 
VFIORegion *region,
   region, name, region->size);
 
 if (!vbasedev->no_mmap && region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
-region->nr_mmaps = 1;
-region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+vfio_setup_region_sparse_mmaps(region, info);
+
+if (!region->nr_mmaps) {
+region->nr_mmaps = 1;
+region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
 
-region->mmaps[0].offset = 0;
-region->mmaps[0].size = region->size;
+region->mmaps[0].offset = 0;
+region->mmaps[0].size = region->size;
+}
 }
 }
 
@@ -1079,6 +1129,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 *info = g_malloc0(argsz);
 
 (*info)->index = index;
+retry:
 (*info)->argsz = argsz;
 
 if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
@@ -1086,6 +1137,13 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
 return -errno;
 }
 
+if ((*info)->argsz > argsz) {
+argsz = (*info)->argsz;
+*info = g_realloc(*info, argsz);
+
+goto retry;
+}
+
 return 0;
 }
 
diff --git a/trace-events b/trace-events
index 0128680..9c6c12b 100644
--- a/trace-events
+++ b/trace-events
@@ -1708,6 +1708,8 @@ vfio_region_mmap(const char *name, unsigned long offset, 
unsigned long end) "Reg
 vfio_region_exit(const char *name, int index) "Device %s, region %d"
 vfio_region_finalize(const char *name, int index) "Device %s, region %d"
 vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps 
enabled: %d"
+vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) 
"Device %s region %d: %d sparse mmap entries"
+vfio_region_sparse_mmap_entry(int i, off_t start, off_t end) "sparse entry %d 
[0x%lx - 0x%lx]"
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group 
#%d"

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC post-2.5 PATCH 2/5] vfio: Generalize region support

2015-11-23 Thread Alex Williamson
Both platform and PCI vfio drivers create a "slow", I/O memory region
with one or more mmap memory regions overlayed when supported by the
device. Generalize this to a set of common helpers in the core that
pulls the region info from vfio, fills the region data, configures
slow mapping, and adds helpers for comleting the mmap, enable/disable,
and teardown.  This can be immediately used by the PCI MSI-X code,
which needs to mmap around the MSI-X vector table and will also be
used with the vfio sparse mmap capability.

This also changes VFIORegion.mem to be dynamically allocated because
otherwise we don't know how the caller has allocated VFIORegion and
therefore don't know whether to unreference it to destroy the
MemoryRegion or not.

Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c  |  169 ++---
 hw/vfio/pci-quirks.c  |   24 +++---
 hw/vfio/pci.c |  164 
 hw/vfio/platform.c|   72 +++--
 include/hw/vfio/vfio-common.h |   23 --
 trace-events  |   10 ++
 6 files changed, 272 insertions(+), 190 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 901a2b9..a73c6ad 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -492,46 +492,159 @@ static void vfio_listener_release(VFIOContainer 
*container)
 memory_listener_unregister(&container->listener);
 }
 
-int vfio_mmap_region(Object *obj, VFIORegion *region,
- MemoryRegion *mem, MemoryRegion *submem,
- void **map, size_t size, off_t offset,
- const char *name)
+int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
+  int index, const char *name)
 {
-int ret = 0;
-VFIODevice *vbasedev = region->vbasedev;
+struct vfio_region_info *info;
+int ret;
+
+ret = vfio_get_region_info(vbasedev, index, &info);
+if (ret) {
+return ret;
+}
+
+region->vbasedev = vbasedev;
+region->flags = info->flags;
+region->size = info->size;
+region->fd_offset = info->offset;
+region->nr = index;
 
-if (!vbasedev->no_mmap && size && region->flags &
-VFIO_REGION_INFO_FLAG_MMAP) {
-int prot = 0;
+if (region->size) {
+region->mem = g_new0(MemoryRegion, 1);
+memory_region_init_io(region->mem, obj, &vfio_region_ops,
+  region, name, region->size);
 
-if (region->flags & VFIO_REGION_INFO_FLAG_READ) {
-prot |= PROT_READ;
+if (!vbasedev->no_mmap && region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
+region->nr_mmaps = 1;
+region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+
+region->mmaps[0].offset = 0;
+region->mmaps[0].size = region->size;
 }
+}
 
-if (region->flags & VFIO_REGION_INFO_FLAG_WRITE) {
-prot |= PROT_WRITE;
+g_free(info);
+
+trace_vfio_region_setup(vbasedev->name, index, name,
+region->flags, region->fd_offset, region->size);
+return 0;
+}
+
+int vfio_region_mmap(VFIORegion *region)
+{
+int i, prot = 0;
+char *name;
+
+if (!region->mem) {
+return 0;
+}
+
+prot |= region->flags & VFIO_REGION_INFO_FLAG_READ ? PROT_READ : 0;
+prot |= region->flags & VFIO_REGION_INFO_FLAG_WRITE ? PROT_WRITE : 0;
+
+for (i = 0; i < region->nr_mmaps; i++) {
+region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
+ MAP_SHARED, region->vbasedev->fd,
+ region->fd_offset +
+ region->mmaps[i].offset);
+if (region->mmaps[i].mmap == MAP_FAILED) {
+int ret = -errno;
+
+trace_vfio_region_mmap_fault(memory_region_name(region->mem), i,
+ region->fd_offset +
+ region->mmaps[i].offset,
+ region->fd_offset +
+ region->mmaps[i].offset +
+ region->mmaps[i].size - 1, ret);
+
+region->mmaps[i].mmap = NULL;
+
+for (i--; i >= 0; i--) {
+memory_region_del_subregion(region->mem, 
®ion->mmaps[i].mem);
+munmap(region->mmaps[i].mmap, region->mmaps[i].size);
+object_unparent(OBJECT(®ion->mmaps[i].mem));
+region->mmaps[i].mmap = NULL;
+}
+
+return ret;
 }
 
-*map = mmap(NULL, size, prot, MAP_SHARED,

[RFC post-2.5 PATCH 4/5] linux-headers/vfio: Update for proposed capabilities list

2015-11-23 Thread Alex Williamson
Signed-off-by: Alex Williamson 
---
 linux-headers/linux/vfio.h |   53 +++-
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index aa276bc..c3860f6 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -52,6 +52,33 @@
 #define VFIO_TYPE  (';')
 #define VFIO_BASE  100
 
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities.  A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set.  This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id.  The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+   __u16   id; /* Identifies capability */
+   __u16   version;/* Version specific to the capability ID */
+   __u32   next;   /* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size.  For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
 /*  IOCTLs for VFIO file descriptor (/dev/vfio/vfio)  */
 
 /**
@@ -187,13 +214,37 @@ struct vfio_region_info {
 #define VFIO_REGION_INFO_FLAG_READ (1 << 0) /* Region supports read */
 #define VFIO_REGION_INFO_FLAG_WRITE(1 << 1) /* Region supports write */
 #define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS (1 << 3) /* Info supports caps */
__u32   index;  /* Region index */
-   __u32   resv;   /* Reserved for alignment */
+   __u32   cap_offset; /* Offset within info struct of first cap */
__u64   size;   /* Region size (bytes) */
__u64   offset; /* Region offset from start of device fd */
 };
 #define VFIO_DEVICE_GET_REGION_INFO_IO(VFIO_TYPE, VFIO_BASE + 8)
 
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support.  When specified, the user should only
+ * mmap the offset ranges specified by the areas array.  mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP   1
+
+struct vfio_region_sparse_mmap_area {
+   __u64   offset; /* Offset of mmap'able area within region */
+   __u64   size;   /* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+   struct vfio_info_cap_header header;
+   __u32   nr_areas;
+   __u32   reserved;
+   struct vfio_region_sparse_mmap_area areas[];
+};
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  * struct vfio_irq_info)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC post-2.5 PATCH 3/5] vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent functions

2015-11-23 Thread Alex Williamson
Match common vfio code with setup, exit, and finalize functions for
BAR, quirk, and VGA management.  VGA is also changed to dynamic
allocation to match the other MemoryRegions.

Signed-off-by: Alex Williamson 
---
 hw/vfio/pci-quirks.c |   38 -
 hw/vfio/pci.c|  114 +-
 hw/vfio/pci.h|   10 ++--
 3 files changed, 71 insertions(+), 91 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 92a2d9d..d4b076f 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -289,10 +289,10 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice 
*vdev)
 
 memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_ati_3c3_quirk, vdev,
   "vfio-ati-3c3-quirk", 1);
-memory_region_add_subregion(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].mem,
+memory_region_add_subregion(&vdev->vga->region[QEMU_PCI_VGA_IO_HI].mem,
 3 /* offset 3 bytes from 0x3c0 */, quirk->mem);
 
-QLIST_INSERT_HEAD(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].quirks,
+QLIST_INSERT_HEAD(&vdev->vga->region[QEMU_PCI_VGA_IO_HI].quirks,
   quirk, next);
 
 trace_vfio_quirk_ati_3c3_probe(vdev->vbasedev.name);
@@ -427,7 +427,7 @@ static uint64_t vfio_nvidia_3d4_quirk_read(void *opaque,
 
 quirk->state = NONE;
 
-return vfio_vga_read(&vdev->vga.region[QEMU_PCI_VGA_IO_HI],
+return vfio_vga_read(&vdev->vga->region[QEMU_PCI_VGA_IO_HI],
  addr + 0x14, size);
 }
 
@@ -464,7 +464,7 @@ static void vfio_nvidia_3d4_quirk_write(void *opaque, 
hwaddr addr,
 break;
 }
 
-vfio_vga_write(&vdev->vga.region[QEMU_PCI_VGA_IO_HI],
+vfio_vga_write(&vdev->vga->region[QEMU_PCI_VGA_IO_HI],
addr + 0x14, data, size);
 }
 
@@ -480,7 +480,7 @@ static uint64_t vfio_nvidia_3d0_quirk_read(void *opaque,
 VFIONvidia3d0Quirk *quirk = opaque;
 VFIOPCIDevice *vdev = quirk->vdev;
 VFIONvidia3d0State old_state = quirk->state;
-uint64_t data = vfio_vga_read(&vdev->vga.region[QEMU_PCI_VGA_IO_HI],
+uint64_t data = vfio_vga_read(&vdev->vga->region[QEMU_PCI_VGA_IO_HI],
   addr + 0x10, size);
 
 quirk->state = NONE;
@@ -522,7 +522,7 @@ static void vfio_nvidia_3d0_quirk_write(void *opaque, 
hwaddr addr,
 }
 }
 
-vfio_vga_write(&vdev->vga.region[QEMU_PCI_VGA_IO_HI],
+vfio_vga_write(&vdev->vga->region[QEMU_PCI_VGA_IO_HI],
addr + 0x10, data, size);
 }
 
@@ -550,15 +550,15 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice 
*vdev)
 
 memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_nvidia_3d4_quirk,
   data, "vfio-nvidia-3d4-quirk", 2);
-memory_region_add_subregion(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].mem,
+memory_region_add_subregion(&vdev->vga->region[QEMU_PCI_VGA_IO_HI].mem,
 0x14 /* 0x3c0 + 0x14 */, &quirk->mem[0]);
 
 memory_region_init_io(&quirk->mem[1], OBJECT(vdev), &vfio_nvidia_3d0_quirk,
   data, "vfio-nvidia-3d0-quirk", 2);
-memory_region_add_subregion(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].mem,
+memory_region_add_subregion(&vdev->vga->region[QEMU_PCI_VGA_IO_HI].mem,
 0x10 /* 0x3c0 + 0x10 */, &quirk->mem[1]);
 
-QLIST_INSERT_HEAD(&vdev->vga.region[QEMU_PCI_VGA_IO_HI].quirks,
+QLIST_INSERT_HEAD(&vdev->vga->region[QEMU_PCI_VGA_IO_HI].quirks,
   quirk, next);
 
 trace_vfio_quirk_nvidia_3d0_probe(vdev->vbasedev.name);
@@ -969,28 +969,28 @@ void vfio_vga_quirk_setup(VFIOPCIDevice *vdev)
 vfio_vga_probe_nvidia_3d0_quirk(vdev);
 }
 
-void vfio_vga_quirk_teardown(VFIOPCIDevice *vdev)
+void vfio_vga_quirk_exit(VFIOPCIDevice *vdev)
 {
 VFIOQuirk *quirk;
 int i, j;
 
-for (i = 0; i < ARRAY_SIZE(vdev->vga.region); i++) {
-QLIST_FOREACH(quirk, &vdev->vga.region[i].quirks, next) {
+for (i = 0; i < ARRAY_SIZE(vdev->vga->region); i++) {
+QLIST_FOREACH(quirk, &vdev->vga->region[i].quirks, next) {
 for (j = 0; j < quirk->nr_mem; j++) {
-memory_region_del_subregion(&vdev->vga.region[i].mem,
+memory_region_del_subregion(&vdev->vga->region[i].mem,
 &quirk->mem[j]);
 }
 }
 }
 }
 
-void vfio_vga_quirk_free(VFIOPCIDevice *vdev)
+void vfio_vga_quirk_finalize(VFIOPCIDevice *vdev)
 {
 int i, j;
 
-for (i = 0; i < ARRAY_SIZE(vdev->vga.region); i++) {
-while (!QLIST_EMPTY(&vdev->vga.region[i].quirks)) {
-  

[RFC post-2.5 PATCH 1/5] vfio: Wrap VFIO_DEVICE_GET_REGION_INFO

2015-11-23 Thread Alex Williamson
In preparation for supporting capability chains on regions, wrap
ioctl(VFIO_DEVICE_GET_REGION_INFO) so we don't duplicate the code for
each caller.

Signed-off-by: Alex Williamson 
---
 hw/vfio/common.c  |   18 +
 hw/vfio/pci.c |   81 +
 hw/vfio/platform.c|   13 ---
 include/hw/vfio/vfio-common.h |3 ++
 4 files changed, 69 insertions(+), 46 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 6797208..901a2b9 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -958,6 +958,24 @@ void vfio_put_base_device(VFIODevice *vbasedev)
 close(vbasedev->fd);
 }
 
+int vfio_get_region_info(VFIODevice *vbasedev, int index,
+ struct vfio_region_info **info)
+{
+size_t argsz = sizeof(struct vfio_region_info);
+
+*info = g_malloc0(argsz);
+
+(*info)->index = index;
+(*info)->argsz = argsz;
+
+if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+g_free(*info);
+return -errno;
+}
+
+return 0;
+}
+
 static int vfio_container_do_ioctl(AddressSpace *as, int32_t groupid,
int req, void *param)
 {
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1fb868c..a4f3f1f 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -768,25 +768,25 @@ static void vfio_update_msi(VFIOPCIDevice *vdev)
 
 static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
 {
-struct vfio_region_info reg_info = {
-.argsz = sizeof(reg_info),
-.index = VFIO_PCI_ROM_REGION_INDEX
-};
+struct vfio_region_info *reg_info;
 uint64_t size;
 off_t off = 0;
 ssize_t bytes;
 
-if (ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, ®_info)) {
+if (vfio_get_region_info(&vdev->vbasedev,
+ VFIO_PCI_ROM_REGION_INDEX, ®_info)) {
 error_report("vfio: Error getting ROM info: %m");
 return;
 }
 
-trace_vfio_pci_load_rom(vdev->vbasedev.name, (unsigned long)reg_info.size,
-(unsigned long)reg_info.offset,
-(unsigned long)reg_info.flags);
+trace_vfio_pci_load_rom(vdev->vbasedev.name, (unsigned long)reg_info->size,
+(unsigned long)reg_info->offset,
+(unsigned long)reg_info->flags);
+
+vdev->rom_size = size = reg_info->size;
+vdev->rom_offset = reg_info->offset;
 
-vdev->rom_size = size = reg_info.size;
-vdev->rom_offset = reg_info.offset;
+g_free(reg_info);
 
 if (!vdev->rom_size) {
 vdev->rom_read_failed = true;
@@ -2010,7 +2010,7 @@ static VFIODeviceOps vfio_pci_ops = {
 static int vfio_populate_device(VFIOPCIDevice *vdev)
 {
 VFIODevice *vbasedev = &vdev->vbasedev;
-struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) };
+struct vfio_region_info *reg_info;
 struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) };
 int i, ret = -1;
 
@@ -2032,72 +2032,73 @@ static int vfio_populate_device(VFIOPCIDevice *vdev)
 }
 
 for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) {
-reg_info.index = i;
-
-ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info);
+ret = vfio_get_region_info(vbasedev, i, ®_info);
 if (ret) {
 error_report("vfio: Error getting region %d info: %m", i);
 goto error;
 }
 
 trace_vfio_populate_device_region(vbasedev->name, i,
-  (unsigned long)reg_info.size,
-  (unsigned long)reg_info.offset,
-  (unsigned long)reg_info.flags);
+  (unsigned long)reg_info->size,
+  (unsigned long)reg_info->offset,
+  (unsigned long)reg_info->flags);
 
 vdev->bars[i].region.vbasedev = vbasedev;
-vdev->bars[i].region.flags = reg_info.flags;
-vdev->bars[i].region.size = reg_info.size;
-vdev->bars[i].region.fd_offset = reg_info.offset;
+vdev->bars[i].region.flags = reg_info->flags;
+vdev->bars[i].region.size = reg_info->size;
+vdev->bars[i].region.fd_offset = reg_info->offset;
 vdev->bars[i].region.nr = i;
 QLIST_INIT(&vdev->bars[i].quirks);
-}
 
-reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX;
+g_free(reg_info);
+}
 
-ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_REGION_INFO, ®_info);
+ret = vfio_get_region_info(vbasedev,
+   VFIO_PCI_CONFIG_REGION_INDEX, ®_info);
 if (ret) {
 error_report("vfio: Error getting config info: %m&q

[RFC post-2.5 PATCH 0/5] VFIO: capability chains

2015-11-23 Thread Alex Williamson
This is the matching QEMU changes for the proposed kernel-side
capability chains.  Unfortunately there's also a lot of churn to get
to consistent interfaces involved in this series, which allow us to
generically handle multiple mmaps overlapping a region and making the
actual step of consuming the new kernel data trivial.  The last patch
shows the proof of concept for reallocating the info buffer when
necessary and finding and using capability data.  Apologies for the
patches being a little rough, but I hope the concept shows through.
As noted above, this is of course post-2.5 material and of course
kernel header updates would only be included when accepted upstream.
Thanks,

Alex

---

Alex Williamson (5):
  vfio: Wrap VFIO_DEVICE_GET_REGION_INFO
  vfio: Generalize region support
  vfio/pci: Convert all MemoryRegion to dynamic alloc and consistent 
functions
  linux-headers/vfio: Update for proposed capabilities list
  vfio: Enable sparse mmap capability


 hw/vfio/common.c  |  245 +++
 hw/vfio/pci-quirks.c  |   62 
 hw/vfio/pci.c |  329 +++--
 hw/vfio/pci.h |   10 +
 hw/vfio/platform.c|   71 ++---
 include/hw/vfio/vfio-common.h |   26 ++-
 linux-headers/linux/vfio.h|   53 ++-
 trace-events  |   12 +
 8 files changed, 502 insertions(+), 306 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] vfio/pci: Include sparse mmap capability for MSI-X table regions

2015-11-23 Thread Alex Williamson
vfio-pci has never allowed the user to directly mmap the MSI-X vector
table, but we've always relied on implicit knowledge of the user that
they cannot do this.  Now that we have capability chains that we can
expose in the region info ioctl and a sparse mmap capability that
represents the sub-areas within the region that can be mmap'd, we can
make the mmap constraints more explicit.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/pci/vfio_pci.c |  101 +++
 1 file changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 32b88bd..46e7aed 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -421,6 +421,77 @@ static int vfio_pci_for_each_slot_or_bus(struct pci_dev 
*pdev,
return walk.ret;
 }
 
+struct caps {
+   struct vfio_info_cap_header *buf;
+   size_t size;
+   size_t head;
+};
+
+static void *add_region_info_cap(struct caps *caps,
+size_t size, u16 id, u16 version)
+{
+   void *tmp;
+   struct vfio_info_cap_header *header;
+
+   /* This would be ridiculous and exceeds the ioctl's abilities */
+   BUG_ON(caps->size + size + sizeof(struct vfio_region_info) > U32_MAX);
+
+   tmp = krealloc(caps->buf, caps->size + size, GFP_KERNEL);
+   if (!tmp) {
+   kfree(caps->buf);
+   caps->size = 0;
+   return ERR_PTR(-ENOMEM);
+   }
+
+   caps->buf = tmp;
+   header = tmp + caps->size;
+   header->id = id;
+   header->version = version;
+   header->next = caps->head;
+   caps->head = caps->size + sizeof(struct vfio_region_info);
+   caps->size += size;
+
+   return header;
+}
+
+static int msix_sparse_mmap_cap(struct vfio_pci_device *vdev, struct caps 
*caps)
+{
+   struct vfio_region_info_cap_sparse_mmap *sparse;
+   size_t end, size;
+   int nr_areas = 2, i = 0;
+
+   end = pci_resource_len(vdev->pdev, vdev->msix_bar);
+
+   /* If MSI-X table is aligned to the start or end, only one area */
+   if (((vdev->msix_offset & PAGE_MASK) == 0) ||
+   (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) >= end))
+   nr_areas = 1;
+
+   size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+   sparse = add_region_info_cap(caps, size,
+VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+   if (IS_ERR(sparse))
+   return PTR_ERR(sparse);
+
+   sparse->nr_areas = nr_areas;
+
+   if (vdev->msix_offset & PAGE_MASK) {
+   sparse->areas[i].offset = 0;
+   sparse->areas[i].size = vdev->msix_offset & PAGE_MASK;
+   i++;
+   }
+
+   if (PAGE_ALIGN(vdev->msix_offset + vdev->msix_size) < end) {
+   sparse->areas[i].offset = PAGE_ALIGN(vdev->msix_offset +
+vdev->msix_size);
+   sparse->areas[i].size = end - sparse->areas[i].offset;
+   i++;
+   }
+
+   return 0;
+}
+
 static long vfio_pci_ioctl(void *device_data,
   unsigned int cmd, unsigned long arg)
 {
@@ -451,6 +522,8 @@ static long vfio_pci_ioctl(void *device_data,
} else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
struct pci_dev *pdev = vdev->pdev;
struct vfio_region_info info;
+   struct caps caps = { .buf = NULL, .size = 0, .head = 0 };
+   int ret;
 
minsz = offsetofend(struct vfio_region_info, offset);
 
@@ -479,8 +552,15 @@ static long vfio_pci_ioctl(void *device_data,
 VFIO_REGION_INFO_FLAG_WRITE;
if (IS_ENABLED(CONFIG_VFIO_PCI_MMAP) &&
pci_resource_flags(pdev, info.index) &
-   IORESOURCE_MEM && info.size >= PAGE_SIZE)
+   IORESOURCE_MEM && info.size >= PAGE_SIZE) {
info.flags |= VFIO_REGION_INFO_FLAG_MMAP;
+   if (info.index == vdev->msix_bar) {
+   ret = msix_sparse_mmap_cap(vdev, &caps);
+   if (ret)
+   return ret;
+   }
+   }
+
break;
case VFIO_PCI_ROM_REGION_INDEX:
{
@@ -520,6 +600,25 @@ static long vfio_pci_ioctl(void *device_data,
return -EINVAL;
}
 
+   if (caps.size) {
+   info.flags |= VFIO_REGION_INFO_FLAG_CAPS;
+   if (info.argsz < sizeof(info) + caps.size) {
+

[RFC PATCH 1/3] vfio: Define capability chains

2015-11-23 Thread Alex Williamson
We have a few cases where we need to extend the data returned from the
INFO ioctls in VFIO.  For instance we already have devices exposed
through vfio-pci where VFIO_DEVICE_GET_REGION_INFO reports the region
as mmap-capable, but really only supports sparse mmaps, avoiding the
MSI-X table.  If we wanted to provide in-kernel emulation or extended
functionality for devices, we'd also want the ability to tell the
user not to mmap various regions, rather than forcing them to figure
it out on their own.

Another example is VFIO_IOMMU_GET_INFO.  We'd really like to expose
the actual IOVA capabilities of the IOMMU rather than letting the
user assume the address space they have available to them.  We could
add IOVA base and size fields to struct vfio_iommu_type1_info, but
what if we have multiple IOVA ranges.  For instance x86 uses a range
of addresses at 0xfee0 for MSI vectors.  These typically are not
available for standard DMA IOVA mappings and splits our available IOVA
space into two ranges.  POWER systems have both an IOVA window below
4G as well as dynamic data window which they can use to remap all of
guest memory.

Representing variable sized arrays within a fixed structure makes it
very difficult to parse, we'd therefore like to put this data beyond
fixed fields within the data structures.  One way to do this is to
emulate capabilities in PCI configuration space.  A new flag indciates
whether capabilties are supported and a new fixed field reports the
offset of the first entry.  Users can then walk the chain to find
capabilities, adding capabilities does not require additional fields
in the fixed structure, and parsing variable sized data becomes
trivial.

This patch outlines the theory and base header structure, which
should be shared by all future users.

Signed-off-by: Alex Williamson 
---
 include/uapi/linux/vfio.h |   27 +++
 1 file changed, 27 insertions(+)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 751b69f..432569f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -59,6 +59,33 @@
 #define VFIO_TYPE  (';')
 #define VFIO_BASE  100
 
+/*
+ * For extension of INFO ioctls, VFIO makes use of a capability chain
+ * designed after PCI/e capabilities.  A flag bit indicates whether
+ * this capability chain is supported and a field defined in the fixed
+ * structure defines the offset of the first capability in the chain.
+ * This field is only valid when the corresponding bit in the flags
+ * bitmap is set.  This offset field is relative to the start of the
+ * INFO buffer, as is the next field within each capability header.
+ * The id within the header is a shared address space per INFO ioctl,
+ * while the version field is specific to the capability id.  The
+ * contents following the header are specific to the capability id.
+ */
+struct vfio_info_cap_header {
+   __u16   id; /* Identifies capability */
+   __u16   version;/* Version specific to the capability ID */
+   __u32   next;   /* Offset of next capability */
+};
+
+/*
+ * Callers of INFO ioctls passing insufficiently sized buffers will see
+ * the capability chain flag bit set, a zero value for the first capability
+ * offset (if available within the provided argsz), and argsz will be
+ * updated to report the necessary buffer size.  For compatibility, the
+ * INFO ioctl will not report error in this case, but the capability chain
+ * will not be available.
+ */
+
 /*  IOCTLs for VFIO file descriptor (/dev/vfio/vfio)  */
 
 /**

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] vfio: Define sparse mmap capability for regions

2015-11-23 Thread Alex Williamson
We can't always support mmap across an entire device region, for
example we deny mmaps covering the MSI-X table of PCI devices, but
we don't really have a way to report it.  We expect the user to
implicitly know this restriction.  We also can't split the region
because vfio-pci defines an API with fixed region index to BAR
number mapping.  We therefore define a new capability which lists
areas within the region that may be mmap'd.  In addition to the
MSI-X case, this potentially enables in-kernel emulation and
extensions to devices.

Signed-off-by: Alex Williamson 
---
 include/uapi/linux/vfio.h |   26 +-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 432569f..d3f6499 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -221,13 +221,37 @@ struct vfio_region_info {
 #define VFIO_REGION_INFO_FLAG_READ (1 << 0) /* Region supports read */
 #define VFIO_REGION_INFO_FLAG_WRITE(1 << 1) /* Region supports write */
 #define VFIO_REGION_INFO_FLAG_MMAP (1 << 2) /* Region supports mmap */
+#define VFIO_REGION_INFO_FLAG_CAPS (1 << 3) /* Info supports caps */
__u32   index;  /* Region index */
-   __u32   resv;   /* Reserved for alignment */
+   __u32   cap_offset; /* Offset within info struct of first cap */
__u64   size;   /* Region size (bytes) */
__u64   offset; /* Region offset from start of device fd */
 };
 #define VFIO_DEVICE_GET_REGION_INFO_IO(VFIO_TYPE, VFIO_BASE + 8)
 
+/*
+ * The sparse mmap capability allows finer granularity of specifying areas
+ * within a region with mmap support.  When specified, the user should only
+ * mmap the offset ranges specified by the areas array.  mmaps outside of the
+ * areas specified may fail (such as the range covering a PCI MSI-X table) or
+ * may result in improper device behavior.
+ *
+ * The structures below define version 1 of this capability.
+ */
+#define VFIO_REGION_INFO_CAP_SPARSE_MMAP   1
+
+struct vfio_region_sparse_mmap_area {
+   __u64   offset; /* Offset of mmap'able area within region */
+   __u64   size;   /* Size of mmap'able area */
+};
+
+struct vfio_region_info_cap_sparse_mmap {
+   struct vfio_info_cap_header header;
+   __u32   nr_areas;
+   __u32   reserved;
+   struct vfio_region_sparse_mmap_area areas[];
+};
+
 /**
  * VFIO_DEVICE_GET_IRQ_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 9,
  * struct vfio_irq_info)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/3] VFIO: capability chains

2015-11-23 Thread Alex Williamson
Please see the commit log and comments in patch 1 for a general
explanation of the problems that this series tries to address.  The
general problem is that we have several cases where we want to expose
variable sized information to the user, whether it's sparse mmaps for
a region, as implemented here, or DMA mapping ranges of an IOMMU, or
reserved MSI mapping ranges, etc.  Extending data structures is hard;
extending them to report variable sized data is really hard.  After
considering several options, I think the best approach is to copy how
PCI does capabilities.  This allows the ioctl to only expose the
capabilities that are relevant for them, avoids data structures that
are too complicated to parse, and avoids creating a new ioctl each
time we think of something else that we'd like to report.  This method
also doesn't preclude extensions to the fixed structure since the
offset of these capabilities is entirely dynamic.

Comments welcome, I'll also follow-up to the QEMU and KVM lists with
an RFC making use of this for mmaps skipping over the MSI-X table.
Thanks,

Alex

---

Alex Williamson (3):
  vfio: Define capability chains
  vfio: Define sparse mmap capability for regions
  vfio/pci: Include sparse mmap capability for MSI-X table regions


 drivers/vfio/pci/vfio_pci.c |  101 +++
 include/uapi/linux/vfio.h   |   53 ++-
 2 files changed, 152 insertions(+), 2 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 9/9] vfio-pci: constify pci_error_handlers structures

2015-11-19 Thread Alex Williamson
On Sat, 2015-11-14 at 11:07 +0100, Julia Lawall wrote:
> This pci_error_handlers structure is never modified, like all the other
> pci_error_handlers structures, so declare it as const.
> 
> Done with the help of Coccinelle.
> 
> Signed-off-by: Julia Lawall 
> 
> ---
> There are no dependencies between these patches.
> 
>  drivers/vfio/pci/vfio_pci.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 32b88bd..2760a7b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1035,7 +1035,7 @@ static pci_ers_result_t 
> vfio_pci_aer_err_detected(struct pci_dev *pdev,
>   return PCI_ERS_RESULT_CAN_RECOVER;
>  }
>  
> -static struct pci_error_handlers vfio_err_handlers = {
> +static const struct pci_error_handlers vfio_err_handlers = {
>   .error_detected = vfio_pci_aer_err_detected,
>  };
>  
> 

Thank you!  I'll queue this one in my tree.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH] vfio: Drop owner assignment from platform_driver

2015-11-19 Thread Alex Williamson
On Thu, 2015-11-19 at 13:00 +0900, Krzysztof Kozlowski wrote:
> platform_driver does not need to set an owner because
> platform_driver_register() will set it.
> 
> Signed-off-by: Krzysztof Kozlowski 
> Acked-by: Baptiste Reynal 
> 
> ---

Oops, sorry I dropped it.  Since it's a fix, I'll queue it for 4.4.
Thanks,

Alex


> 
> The coccinelle script which generated the patch was sent here:
> http://www.spinics.net/lists/kernel/msg2029903.html
> ---
>  drivers/vfio/platform/vfio_platform.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/drivers/vfio/platform/vfio_platform.c 
> b/drivers/vfio/platform/vfio_platform.c
> index f1625dcfbb23..b1cc3a768784 100644
> --- a/drivers/vfio/platform/vfio_platform.c
> +++ b/drivers/vfio/platform/vfio_platform.c
> @@ -92,7 +92,6 @@ static struct platform_driver vfio_platform_driver = {
>   .remove = vfio_platform_remove,
>   .driver = {
>   .name   = "vfio-platform",
> - .owner  = THIS_MODULE,
>   },
>  };
>  



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] VFIO updates for v4.4-rc1

2015-11-13 Thread Alex Williamson
Hi Linus,

The following changes since commit 32b88194f71d6ae7768a29f87fbba454728273ee:

  Linux 4.3-rc7 (2015-10-25 10:39:47 +0900)

are available in the git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v4.4-rc1

for you to fetch changes up to 222e684ca762e9288108fcf852eb5d08cbe10ae3:

  vfio/pci: make an array larger (2015-11-09 08:59:11 -0700)


VFIO updates for v4.4-rc1
 - Use kernel interfaces for VPD emulation (Alex Williamson)
 - Platform fix for releasing IRQs (Eric Auger)
 - Type1 IOMMU always advertises PAGE_SIZE support when smaller
   mapping sizes are available (Eric Auger)
 - Platform fixes for incorrectly using copies of structures rather
   than pointers to structures (James Morse)
 - Rework platform reset modules, fix leak, and add AMD xgbe reset
   module (Eric Auger)
 - Fix vfio_device_get_from_name() return value (Joerg Roedel)
 - No-IOMMU interface (Alex Williamson)
 - Fix potential out of bounds array access in PCI config handling
   (Dan Carpenter)


Alex Williamson (3):
  vfio: Whitelist PCI bridges
  vfio/pci: Use kernel VPD access functions
  vfio: Include No-IOMMU mode

Dan Carpenter (1):
  vfio/pci: make an array larger

Eric Auger (11):
  VFIO: platform: clear IRQ_NOAUTOEN when de-assigning the IRQ
  vfio/type1: handle case where IOMMU does not support PAGE_SIZE size
  vfio: platform: introduce vfio-platform-base module
  vfio: platform: add capability to register a reset function
  vfio: platform: introduce module_vfio_reset_handler macro
  vfio: platform: reset: calxedaxgmac: add reset function registration
  vfio: platform: add compat in vfio_platform_device
  vfio: platform: use list of registered reset function
  vfio: platform: add dev_info on device reset
  vfio: platform: reset: calxedaxgmac: fix ioaddr leak
  VFIO: platform: reset: AMD xgbe reset module

James Morse (1):
  vfio/platform: store mapped memory in region, instead of an on-stack copy

Joerg Roedel (1):
  vfio: Fix bug in vfio_device_get_from_name()

 drivers/vfio/Kconfig   |  15 ++
 drivers/vfio/pci/vfio_pci.c|   8 +-
 drivers/vfio/pci/vfio_pci_config.c |  74 ++-
 drivers/vfio/platform/Makefile |   6 +-
 drivers/vfio/platform/reset/Kconfig|   8 +
 drivers/vfio/platform/reset/Makefile   |   2 +
 .../vfio/platform/reset/vfio_platform_amdxgbe.c| 127 
 .../platform/reset/vfio_platform_calxedaxgmac.c|  19 +-
 drivers/vfio/platform/vfio_amba.c  |   1 +
 drivers/vfio/platform/vfio_platform.c  |   1 +
 drivers/vfio/platform/vfio_platform_common.c   | 155 +-
 drivers/vfio/platform/vfio_platform_irq.c  |   1 +
 drivers/vfio/platform/vfio_platform_private.h  |  40 +++-
 drivers/vfio/vfio.c| 224 +++--
 drivers/vfio/vfio_iommu_type1.c|  15 +-
 include/linux/vfio.h   |   3 +
 include/uapi/linux/vfio.h  |   7 +
 17 files changed, 616 insertions(+), 90 deletions(-)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_amdxgbe.c

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2] vfio/pci: make an array larger

2015-11-10 Thread Alex Williamson
On Mon, 2015-11-09 at 15:24 +0300, Dan Carpenter wrote:
> Smatch complains about a possible out of bounds error:
> 
>   drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
>   error: buffer overflow 'pci_cap_length' 20 <= 20
> 
> The problem is that pci_cap_length[] was defined as large enough to
> hold "PCI_CAP_ID_AF + 1" elements.  The code in vfio_cap_init() assumes
> it has PCI_CAP_ID_MAX + 1 elements.  Originally, PCI_CAP_ID_AF and
> PCI_CAP_ID_MAX were the same but then we introduced PCI_CAP_ID_EA in
> f80b0ba95964 ('PCI: Add Enhanced Allocation register entries') so now
> the array is too small.
> 
> Let's fix this by making the array size PCI_CAP_ID_MAX + 1.  And let's
> make a similar change to pci_ext_cap_length[] for consistency.  Also
> both these arrays can be made const.
> 
> Signed-off-by: Dan Carpenter 
> ---

Applied to next for v4.4.  Thanks!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] vfio: Include No-IOMMU mode

2015-11-04 Thread Alex Williamson
There is really no way to safely give a user full access to a DMA
capable device without an IOMMU to protect the host system.  There is
also no way to provide DMA translation, for use cases such as device
assignment to virtual machines.  However, there are still those users
that want userspace drivers even under those conditions.  The UIO
driver exists for this use case, but does not provide the degree of
device access and programming that VFIO has.  In an effort to avoid
code duplication, this introduces a No-IOMMU mode for VFIO.

This mode requires building VFIO with CONFIG_VFIO_NOIOMMU and enabling
the "enable_unsafe_noiommu_mode" option on the vfio driver.  This
should make it very clear that this mode is not safe.  Additionally,
CAP_SYS_RAWIO privileges are necessary to work with groups and
containers using this mode.  Groups making use of this support are
named /dev/vfio/noiommu-$GROUP and can only make use of the special
VFIO_NOIOMMU_IOMMU for the container.  Use of this mode, specifically
binding a device without a native IOMMU group to a VFIO bus driver
will taint the kernel and should therefore not be considered
supported.  This patch includes no-iommu support for the vfio-pci bus
driver only.

Signed-off-by: Alex Williamson 
Acked-by: Michael S. Tsirkin 
---

v2: A minor change to the vfio_for_each_iommu_driver macro:

@@ -229,7 +229,7 @@ static struct vfio_iommu_driver vfio_noiommu_driver = {
for (pos = con->noiommu ? &vfio_noiommu_driver :\
 list_first_entry(&vfio.iommu_drivers_list, \
  struct vfio_iommu_driver, vfio_next); \
-(con->noiommu && pos) || (!con->noiommu && \
+(con->noiommu ? pos != NULL :  \
&pos->vfio_next != &vfio.iommu_drivers_list);   \
  pos = con->noiommu ? NULL : list_next_entry(pos, vfio_next))
 #else

The 0-day coccinelle test seems to think that driver can be null for callers
Using this for-loop, perhaps it's not seeing this as simply a modified
list_for_each_entry macro.  I don't think that's possible, but using a
terinary here makes it more readable and makes all parts of this bi-modal
loop more consistent.

 drivers/vfio/Kconfig|   15 +++
 drivers/vfio/pci/vfio_pci.c |8 +-
 drivers/vfio/vfio.c |  186 ++-
 include/linux/vfio.h|3 +
 include/uapi/linux/vfio.h   |7 ++
 5 files changed, 209 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 4540179..b6d3cdc 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -31,5 +31,20 @@ menuconfig VFIO
 
  If you don't know what to do here, say N.
 
+menuconfig VFIO_NOIOMMU
+   bool "VFIO No-IOMMU support"
+   depends on VFIO
+   help
+ VFIO is built on the ability to isolate devices using the IOMMU.
+ Only with an IOMMU can userspace access to DMA capable devices be
+ considered secure.  VFIO No-IOMMU mode enables IOMMU groups for
+ devices without IOMMU backing for the purpose of re-using the VFIO
+ infrastructure in a non-secure mode.  Use of this mode will result
+ in an unsupportable kernel and will therefore taint the kernel.
+ Device assignment to virtual machines is also not possible with
+ this mode since there is no IOMMU to provide DMA translation.
+
+ If you don't know what to do here, say N.
+
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 964ad57..32b88bd 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -940,13 +940,13 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
return -EINVAL;
 
-   group = iommu_group_get(&pdev->dev);
+   group = vfio_iommu_group_get(&pdev->dev);
if (!group)
return -EINVAL;
 
vdev = kzalloc(sizeof(*vdev), GFP_KERNEL);
if (!vdev) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
return -ENOMEM;
}
 
@@ -957,7 +957,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
if (ret) {
-   iommu_group_put(group);
+   vfio_iommu_group_put(group, &pdev->dev);
kfree(vdev);
return ret;
}
@@ -993,7 +993,7 @@ static void vfio_pci_remove(struct pci_dev *pdev)
if (!vdev)
return;
 
-   iommu_group_put(pdev

Re: [patch] vfio: make an array larger

2015-11-04 Thread Alex Williamson
On Wed, 2015-11-04 at 21:20 +0300, Dan Carpenter wrote:
> Sorry, I should have said that I am on linux-next at the start.
> 
> > > -static u8 pci_cap_length[] = {
> > > +static u8 pci_cap_length[PCI_CAP_ID_MAX + 1] = {
> > >   [PCI_CAP_ID_BASIC]  = PCI_STD_HEADER_SIZEOF, /* pci config header */
> > >   [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
> > >   [PCI_CAP_ID_AGP]= PCI_AGP_SIZEOF,
> > 
> > This doesn't make a whole lot of sense to me.  The last entry we define
> > is:
> > 
> > [PCI_CAP_ID_AF] = PCI_CAP_AF_SIZEOF,
> 
> Yes.
> 
> > };
> > 
> > and PCI_CAP_ID_MAX is defined as:
> > 
> > #define  PCI_CAP_ID_MAX PCI_CAP_ID_AF
> 
> No.  I am on linux-next and we appear to have added a new element
> beyond PCI_CAP_ID_AF.
> 
> #define  PCI_CAP_ID_AF  0x13/* PCI Advanced Features */
> #define  PCI_CAP_ID_EA  0x14/* PCI Enhanced Allocation */
> #define  PCI_CAP_ID_MAX PCI_CAP_ID_EA
> 
> > 
> > So the array is implicitly sized to PCI_CAP_ID_MAX + 1 already, this
> > doesn't make it any larger.
> 
> In linux-next it makes it larger.  But also explicitly using
> PCI_CAP_ID_MAX + 1 is cleaner as well as fixing the bug in case we add
> more elements later again.

Ok, all the pieces line up now.  Please add mention of that to the
commit log and I'll look for the respin including the same for
pci_ext_cap_length.  Thanks for spotting this!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] vfio: make an array larger

2015-11-04 Thread Alex Williamson
On Wed, 2015-11-04 at 16:26 +0300, Dan Carpenter wrote:
> Smatch complains about a possible out of bounds error:
> 
>   drivers/vfio/pci/vfio_pci_config.c:1241 vfio_cap_init()
>   error: buffer overflow 'pci_cap_length' 20 <= 20
> 
> Fix this by making the array larger.
> 
> Signed-off-by: Dan Carpenter 
> 
> diff --git a/drivers/vfio/pci/vfio_pci_config.c 
> b/drivers/vfio/pci/vfio_pci_config.c
> index ff75ca3..001d48a 100644
> --- a/drivers/vfio/pci/vfio_pci_config.c
> +++ b/drivers/vfio/pci/vfio_pci_config.c
> @@ -46,7 +46,7 @@
>   *   0: Removed from the user visible capability list
>   *   FF: Variable length
>   */
> -static u8 pci_cap_length[] = {
> +static u8 pci_cap_length[PCI_CAP_ID_MAX + 1] = {
>   [PCI_CAP_ID_BASIC]  = PCI_STD_HEADER_SIZEOF, /* pci config header */
>   [PCI_CAP_ID_PM] = PCI_PM_SIZEOF,
>   [PCI_CAP_ID_AGP]= PCI_AGP_SIZEOF,

This doesn't make a whole lot of sense to me.  The last entry we define
is:

[PCI_CAP_ID_AF] = PCI_CAP_AF_SIZEOF,
};

and PCI_CAP_ID_MAX is defined as:

#define  PCI_CAP_ID_MAX PCI_CAP_ID_AF

So the array is implicitly sized to PCI_CAP_ID_MAX + 1 already, this
doesn't make it any larger.  I imagine this silences smatch because it's
hitting this:

if (cap <= PCI_CAP_ID_MAX) {
len = pci_cap_length[cap];

And it doesn't like that we're indexing an array that has entries up to
PCI_CAP_ID_AF and we're testing against PCI_CAP_ID_MAX.  They happen to
be the same now, but that could change and then we'd index off the end
of the array.  That's unlikely, but valid.  Is that the real
justification for this patch?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix bug in vfio_device_get_from_name()

2015-11-04 Thread Alex Williamson
On Wed, 2015-11-04 at 13:53 +0100, Joerg Roedel wrote:
> From: Joerg Roedel 
> 
> The vfio_device_get_from_name() function might return a
> non-NULL pointer, when called with a device name that is not
> found in the list. This causes undefined behavior, in my
> case calling an invalid function pointer later on:
> 
>  kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
>  BUG: unable to handle kernel paging request at 8800cb3ddc08
> 
> [...]
> 
>  Call Trace:
>   [] ? vfio_group_fops_unl_ioctl+0x253/0x410 [vfio]
>   [] do_vfs_ioctl+0x2cd/0x4c0
>   [] ? __fget+0x77/0xb0
>   [] SyS_ioctl+0x79/0x90
>   [] ? syscall_return_slowpath+0x50/0x130
>   [] entry_SYSCALL_64_fastpath+0x16/0x75
> 
> Fix the issue by returning NULL when there is no device with
> the requested name in the list.
> 
> Cc: sta...@vger.kernel.org # v4.2+
> Fixes: 4bc94d5dc95d ("vfio: Fix lockdep issue")
> Signed-off-by: Joerg Roedel 
> ---

Thanks for tracking this down, Joerg!  Looks right, I'll queue it for
next.  Thanks,

Alex

>  drivers/vfio/vfio.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 563c510..8c50ea6 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -692,11 +692,12 @@ EXPORT_SYMBOL_GPL(vfio_device_get_from_dev);
>  static struct vfio_device *vfio_device_get_from_name(struct vfio_group 
> *group,
>char *buf)
>  {
> - struct vfio_device *device;
> + struct vfio_device *it, *device = NULL;
>  
>   mutex_lock(&group->device_lock);
> - list_for_each_entry(device, &group->device_list, group_next) {
> - if (!strcmp(dev_name(device->dev), buf)) {
> + list_for_each_entry(it, &group->device_list, group_next) {
> + if (!strcmp(dev_name(it->dev), buf)) {
> + device = it;
>   vfio_device_get(device);
>   break;
>   }



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 18:05 +0100, Paolo Bonzini wrote:
> 
> On 28/10/2015 17:00, Alex Williamson wrote:
> > > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > > not just for VT-d, to do the MSI injection directly from the VFIO
> > > interrupt handler and bypass the eventfd?  Basically this would add an
> > > RCU-protected list of consumers matching the token to struct
> > > irq_bypass_producer, and a
> > > 
> > >   int (*inject)(struct irq_bypass_consumer *);
> > > 
> > > callback to struct irq_bypass_consumer.  If any callback returns true,
> > > the eventfd is not signaled.
> >
> > Yeah, that might be a good idea, it's probably more plausible than
> > making the eventfd_signal() code friendly to call from hard interrupt
> > context.  On the vfio side can we use request_threaded_irq() directly
> > for this?
> 
> I don't know if that gives you a non-threaded IRQ with the real-time
> kernel...  CCing Marcelo to get some insight.
> 
> > Making the hard irq handler return IRQ_HANDLED if we can use
> > the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
> > I think we need some way to get back to irq thread context to use
> > eventfd_signal().
> 
> The irqfd is already able to schedule a work item, because it runs with
> interrupts disabled, so I think we can always return IRQ_HANDLED.

I'm confused by this.  The problem with adding IRQF_NO_THREAD to our
current handler is that it hits the spinlock that can sleep in
eventfd_signal() and the waitqueue further down the stack before we get
to the irqfd.  So if we split to a non-threaded handler vs a threaded
handler, where the non-threaded handler either returns IRQ_HANDLED or
IRQ_WAKE_THREAD to queue the threaded handler, there's only so much that
the non-threaded handler can do before we start running into the same
problem.  I think that means that the non-threaded handler needs to
return IRQ_WAKE_THREAD if we need to use the current eventfd_signal()
path, such as if the bypass path is not available.  If we can get
through the bypass path and the KVM irqfd side is safe for the
non-threaded handler, inject succeeds and we return IRQ_HANDLED, right?
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 10:50 -0700, Yunhong Jiang wrote:
> On Wed, Oct 28, 2015 at 01:44:55AM +0100, Paolo Bonzini wrote:
> > 
> > 
> > On 27/10/2015 22:26, Yunhong Jiang wrote:
> > >> > On RT kernels however can you call eventfd_signal from interrupt
> > >> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
> > >> > non-threaded interrupt handler, can you?  You would need a raw spin 
> > >> > lock.
> > > Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> > > kernel. Will do this way on next patch. But not sure if it's overkill to 
> > > use 
> > > raw_spinlock there since the eventfd_signal is used by other caller also.
> > 
> > No, I don't think you can use raw_spinlock there.  The problem is not
> > just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> > convert the whole workqueue infrastructure to use raw_spinlock.
> 
> You mean the waitqueue, instead of workqueue, right? One choice is to change 
> the eventfd to use simple wait queue, which is raw_spinlock. But use simple 
> waitqueue on eventfd may in fact impact real time latency if not in this 
> scenario.
> 
> > 
> > Alex, would it make sense to use the IRQ bypass infrastructure always,
> > not just for VT-d, to do the MSI injection directly from the VFIO
> > interrupt handler and bypass the eventfd?  Basically this would add an
> > RCU-protected list of consumers matching the token to struct
> > irq_bypass_producer, and a
> > 
> > int (*inject)(struct irq_bypass_consumer *);
> > 
> > callback to struct irq_bypass_consumer.  If any callback returns true,
> > the eventfd is not signaled.  The KVM implementation would be like this
> > (compare with virt/kvm/eventfd.c):
> > 
> > /* Extracted out of irqfd_wakeup */
> > static int
> > irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
> > {
> > ...
> > }
> > 
> > /* Extracted out of irqfd_wakeup */
> > static int
> > irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
> > {
> > ...
> > }
> > 
> > static int
> > irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
> >  void *key)
> > {
> > struct _irqfd *irqfd = container_of(wait,
> > struct _irqfd, wait);
> > unsigned long flags = (unsigned long)key;
> > 
> > if (flags & POLLIN)
> > irqfd_wakeup_pollin(irqfd);
> > if (flags & POLLHUP)
> > irqfd_wakeup_pollhup(irqfd);
> > 
> > return 0;
> > }
> > 
> > static int kvm_arch_irq_bypass_inject(
> > struct irq_bypass_consumer *cons)
> > {
> > struct kvm_kernel_irqfd *irqfd =
> > container_of(cons, struct kvm_kernel_irqfd,
> >  consumer); 
> > 
> > irqfd_wakeup_pollin(irqfd);
> > }
> > 
> This is a good idea IMHO. So for MSI interrupt, the 
> kvm_arch_irq_bypass_inject will be used, and the irqfd_wakeup will not be 
> invoked anymore, am I right?
> 
> I noticed the irq bypass manager is not merged yet, are there any git branch 
> for it?

It's in linux-next via the kvm.git next branch:

git://git.kernel.org/pub/scm/virt/kvm/kvm.git

Thanks,
Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 19:00 +0100, Eric Auger wrote:
> On 10/28/2015 06:55 PM, Will Deacon wrote:
> > On Wed, Oct 28, 2015 at 06:48:41PM +0100, Eric Auger wrote:
> >> On 10/28/2015 06:37 PM, Alex Williamson wrote:
> >>> Ok, so with hopefully correcting my understand of what this does, isn't
> >>> this effectively the same:
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >>> b/drivers/vfio/vfio_iommu_type1.c
> >>> index 57d8c37..7db4f5a 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu 
> >>> *iommu, stru
> >>>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>>  {
> >>> struct vfio_domain *domain;
> >>> -   unsigned long bitmap = PAGE_MASK;
> >>> +   unsigned long bitmap = ULONG_MAX;
> >>>  
> >>> mutex_lock(&iommu->lock);
> >>> list_for_each_entry(domain, &iommu->domain_list, next)
> >>> bitmap &= domain->domain->ops->pgsize_bitmap;
> >>> mutex_unlock(&iommu->lock);
> >>>  
> >>> +   /* Some comment about how the IOMMU API splits requests */
> >>> +   if (bitmap & ~PAGE_MASK) {
> >>> +   bitmap &= PAGE_MASK;
> >>> +   bitmap |= PAGE_SIZE;
> >>> +   }
> >>> +
> >>> return bitmap;
> >>>  }
> >> Yes, to me it is indeed the same
> >>>  
> >>> This would also expose to the user that we're accepting PAGE_SIZE, which
> >>> we weren't before, so it was not quite right to just let them do it
> >>> anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
> >>> Thanks,
> >>
> >> The end-user might be afraid of those latter. Personally I would get rid
> >> of them but that's definitively up to you.
> > 
> > I think Alex's point is that the WARN_ON's won't trigger with this patch,
> > because he clears those lower bits in the bitmap.
> ah yes sure!

The WARN_ON triggers when the IOMMU mask is greater than PAGE_SIZE,
which means we can't operate on the IOMMU with PAGE_SIZE granularity,
which we do in a couple places.  So I think the WARN_ON is actually
valid for the code and won't trigger for you now that the IOMMU mask is
always at least ~PAGE_MASK if we can use the IOMMU at anything less than
PAGE_SIZE granularity.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 18:10 +0100, Eric Auger wrote:
> Hi Alex,
> On 10/28/2015 05:27 PM, Alex Williamson wrote:
> > On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> >> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> >> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> >> does not support PAGE_SIZE page, the alignment check on map/unmap
> >> is done with larger page sizes, if any. This can fail although
> >> mapping could be done with pages smaller than PAGE_SIZE.
> >>
> >> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
> >> supported by all domains, even those smaller than PAGE_SIZE. The
> >> alignment check on map is performed against PAGE_SIZE if the minimum
> >> IOMMU size is less than PAGE_SIZE or against the min page size greater
> >> than PAGE_SIZE.
> >>
> >> Signed-off-by: Eric Auger 
> >>
> >> ---
> >>
> >> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> >> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> >> the map/unmap check is done against 2MB. Some alignment check fail
> >> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> >> size.
> >> ---
> >>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
> >>  1 file changed, 11 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c 
> >> b/drivers/vfio/vfio_iommu_type1.c
> >> index 57d8c37..13fb974 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> >> struct vfio_dma *dma)
> >>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> >>  {
> >>struct vfio_domain *domain;
> >> -  unsigned long bitmap = PAGE_MASK;
> >> +  unsigned long bitmap = ULONG_MAX;
> > 
> > Isn't this and removing the WARN_ON()s the only real change in this
> > patch?  The rest looks like conversion to use IS_ALIGNED and the
> > following test, that I don't really understand...
> Yes basically you're right.


Ok, so with hopefully correcting my understand of what this does, isn't
this effectively the same:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..7db4f5a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -403,13 +403,19 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, stru
 static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
 {
struct vfio_domain *domain;
-   unsigned long bitmap = PAGE_MASK;
+   unsigned long bitmap = ULONG_MAX;
 
mutex_lock(&iommu->lock);
list_for_each_entry(domain, &iommu->domain_list, next)
bitmap &= domain->domain->ops->pgsize_bitmap;
mutex_unlock(&iommu->lock);
 
+   /* Some comment about how the IOMMU API splits requests */
+   if (bitmap & ~PAGE_MASK) {
+   bitmap &= PAGE_MASK;
+   bitmap |= PAGE_SIZE;
+   }
+
return bitmap;
 }
 
This would also expose to the user that we're accepting PAGE_SIZE, which
we weren't before, so it was not quite right to just let them do it
anyway.  I don't think we even need to get rid of the WARN_ONs, do we?
Thanks,

Alex

> > 
> >>  
> >>mutex_lock(&iommu->lock);
> >>list_for_each_entry(domain, &iommu->domain_list, next)
> >> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> >> vfio_iommu *iommu)
> >>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> >> struct vfio_iommu_type1_dma_unmap *unmap)
> >>  {
> >> -  uint64_t mask;
> >>struct vfio_dma *dma;
> >>size_t unmapped = 0;
> >>int ret = 0;
> >> +  unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> >> +  unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> >> +  PAGE_SIZE : min_pagesz;
> > 
> > This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> > care to cap alignment at PAGE_SIZE?
> My intent in this patch isn't to allow the user-space to map/unmap
> sub-PAGE_SIZE buffers. The new test makes sure the mapped area is bigger
> or equal than a host page whatever the supported page sizes.
> 
> I noticed that chunk construction, pinning and other many things are
> b

Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 17:14 +, Will Deacon wrote:
> On Wed, Oct 28, 2015 at 10:27:28AM -0600, Alex Williamson wrote:
> > On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > > b/drivers/vfio/vfio_iommu_type1.c
> > > index 57d8c37..13fb974 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> > > struct vfio_dma *dma)
> > >  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
> > >  {
> > >   struct vfio_domain *domain;
> > > - unsigned long bitmap = PAGE_MASK;
> > > + unsigned long bitmap = ULONG_MAX;
> > 
> > Isn't this and removing the WARN_ON()s the only real change in this
> > patch?  The rest looks like conversion to use IS_ALIGNED and the
> > following test, that I don't really understand...
> > 
> > >  
> > >   mutex_lock(&iommu->lock);
> > >   list_for_each_entry(domain, &iommu->domain_list, next)
> > > @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> > > vfio_iommu *iommu)
> > >  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > >struct vfio_iommu_type1_dma_unmap *unmap)
> > >  {
> > > - uint64_t mask;
> > >   struct vfio_dma *dma;
> > >   size_t unmapped = 0;
> > >   int ret = 0;
> > > + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > > + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > > + PAGE_SIZE : min_pagesz;
> > 
> > This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
> > care to cap alignment at PAGE_SIZE?
> 
> Eric can clarify, but I think the intention here is to have VFIO continue
> doing things in PAGE_SIZE chunks precisely so that we don't have to rework
> all of the pinning code etc. The IOMMU API can then deal with the smaller
> page size.

Gak, I read this wrong.  So really we're just artificially adding
PAGE_SIZE as a supported IOMMU size so long as the IOMMU support
something smaller than PAGE_SIZE, where PAGE_SIZE is obviously a
multiple of that smaller size.  Ok, but should we just do this once in
vfio_pgsize_bitmap()?  This is exactly why VT-d just reports ~(4k - 1)
for the iommu bitmap.

> > > - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > -
> > > - if (unmap->iova & mask)
> > > + if (!IS_ALIGNED(unmap->iova, requested_alignment))
> > >   return -EINVAL;
> > > - if (!unmap->size || unmap->size & mask)
> > > + if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
> > >   return -EINVAL;
> > >  
> > > - WARN_ON(mask & PAGE_MASK);
> > > -
> > >   mutex_lock(&iommu->lock);
> > >  
> > >   /*
> > > @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > >   size_t size = map->size;
> > >   long npage;
> > >   int ret = 0, prot = 0;
> > > - uint64_t mask;
> > >   struct vfio_dma *dma;
> > >   unsigned long pfn;
> > > + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> > > + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> > > + PAGE_SIZE : min_pagesz;
> > >  
> > >   /* Verify that none of our __u64 fields overflow */
> > >   if (map->size != size || map->vaddr != vaddr || map->iova != iova)
> > >   return -EINVAL;
> > >  
> > > - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> > > -
> > > - WARN_ON(mask & PAGE_MASK);
> > > -
> > >   /* READ/WRITE from device perspective */
> > >   if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > >   prot |= IOMMU_WRITE;
> > >   if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > >   prot |= IOMMU_READ;
> > >  
> > > - if (!prot || !size || (size | iova | vaddr) & mask)
> > > + if (!prot || !size ||
> > > + !IS_ALIGNED(size | iova | vaddr, requested_alignment))
> > >   return -EINVAL;
> > >  
> > >   /* Don't allow IOVA or virtual address wrap */
> > 
> > This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
> > instance, we can only pin on PAGE_SIZE and therefore we only do
> > accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
> > page, that page gets pinned and accounted 16 times.  Are we going to
> > tell users that their locked memory limit needs to be 16x now?  The rest
> > of the code would need an audit as well to see what other sub-page bugs
> > might be hiding.  Thanks,
> 
> I don't see that. The pinning all happens the same in VFIO, which can
> then happily pass a 64k region to iommu_map. iommu_map will then call
> ->map in 4k chunks on the IOMMU driver ops.

Yep, I see now that this isn't doing sub-page mappings.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] vfio/type1: handle case where IOMMU does not support PAGE_SIZE size

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 13:12 +, Eric Auger wrote:
> Current vfio_pgsize_bitmap code hides the supported IOMMU page
> sizes smaller than PAGE_SIZE. As a result, in case the IOMMU
> does not support PAGE_SIZE page, the alignment check on map/unmap
> is done with larger page sizes, if any. This can fail although
> mapping could be done with pages smaller than PAGE_SIZE.
> 
> vfio_pgsize_bitmap is modified to expose the IOMMU page sizes,
> supported by all domains, even those smaller than PAGE_SIZE. The
> alignment check on map is performed against PAGE_SIZE if the minimum
> IOMMU size is less than PAGE_SIZE or against the min page size greater
> than PAGE_SIZE.
> 
> Signed-off-by: Eric Auger 
> 
> ---
> 
> This was tested on AMD Seattle with 64kB page host. ARM MMU 401
> currently expose 4kB, 2MB and 1GB page support. With a 64kB page host,
> the map/unmap check is done against 2MB. Some alignment check fail
> so VFIO_IOMMU_MAP_DMA fail while we could map using 4kB IOMMU page
> size.
> ---
>  drivers/vfio/vfio_iommu_type1.c | 25 +++--
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..13fb974 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -403,7 +403,7 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *dma)
>  static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu)
>  {
>   struct vfio_domain *domain;
> - unsigned long bitmap = PAGE_MASK;
> + unsigned long bitmap = ULONG_MAX;

Isn't this and removing the WARN_ON()s the only real change in this
patch?  The rest looks like conversion to use IS_ALIGNED and the
following test, that I don't really understand...

>  
>   mutex_lock(&iommu->lock);
>   list_for_each_entry(domain, &iommu->domain_list, next)
> @@ -416,20 +416,18 @@ static unsigned long vfio_pgsize_bitmap(struct 
> vfio_iommu *iommu)
>  static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
>struct vfio_iommu_type1_dma_unmap *unmap)
>  {
> - uint64_t mask;
>   struct vfio_dma *dma;
>   size_t unmapped = 0;
>   int ret = 0;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;

This one.  If we're going to support sub-PAGE_SIZE mappings, why do we
care to cap alignment at PAGE_SIZE?

> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> -
> - if (unmap->iova & mask)
> + if (!IS_ALIGNED(unmap->iova, requested_alignment))
>   return -EINVAL;
> - if (!unmap->size || unmap->size & mask)
> + if (!unmap->size || !IS_ALIGNED(unmap->size, requested_alignment))
>   return -EINVAL;
>  
> - WARN_ON(mask & PAGE_MASK);
> -
>   mutex_lock(&iommu->lock);
>  
>   /*
> @@ -553,25 +551,24 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>   size_t size = map->size;
>   long npage;
>   int ret = 0, prot = 0;
> - uint64_t mask;
>   struct vfio_dma *dma;
>   unsigned long pfn;
> + unsigned int min_pagesz = __ffs(vfio_pgsize_bitmap(iommu));
> + unsigned int requested_alignment = (min_pagesz < PAGE_SIZE) ?
> + PAGE_SIZE : min_pagesz;
>  
>   /* Verify that none of our __u64 fields overflow */
>   if (map->size != size || map->vaddr != vaddr || map->iova != iova)
>   return -EINVAL;
>  
> - mask = ((uint64_t)1 << __ffs(vfio_pgsize_bitmap(iommu))) - 1;
> -
> - WARN_ON(mask & PAGE_MASK);
> -
>   /* READ/WRITE from device perspective */
>   if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
>   prot |= IOMMU_WRITE;
>   if (map->flags & VFIO_DMA_MAP_FLAG_READ)
>   prot |= IOMMU_READ;
>  
> - if (!prot || !size || (size | iova | vaddr) & mask)
> + if (!prot || !size ||
> + !IS_ALIGNED(size | iova | vaddr, requested_alignment))
>   return -EINVAL;
>  
>   /* Don't allow IOVA or virtual address wrap */

This is mostly ignoring the problems with sub-PAGE_SIZE mappings.  For
instance, we can only pin on PAGE_SIZE and therefore we only do
accounting on PAGE_SIZE, so if the user does 4K mappings across your 64K
page, that page gets pinned and accounted 16 times.  Are we going to
tell users that their locked memory limit needs to be 16x now?  The rest
of the code would need an audit as well to see what other sub-page bugs
might be hiding.  Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-28 Thread Alex Williamson
On Wed, 2015-10-28 at 01:44 +0100, Paolo Bonzini wrote:
> 
> On 27/10/2015 22:26, Yunhong Jiang wrote:
> >> > On RT kernels however can you call eventfd_signal from interrupt
> >> > context?  You cannot call spin_lock_irqsave (which can sleep) from a
> >> > non-threaded interrupt handler, can you?  You would need a raw spin lock.
> > Thanks for pointing this out. Yes, we can't call spin_lock_irqsave on RT 
> > kernel. Will do this way on next patch. But not sure if it's overkill to 
> > use 
> > raw_spinlock there since the eventfd_signal is used by other caller also.
> 
> No, I don't think you can use raw_spinlock there.  The problem is not
> just eventfd_signal, it is especially wake_up_locked_poll.  You cannot
> convert the whole workqueue infrastructure to use raw_spinlock.
> 
> Alex, would it make sense to use the IRQ bypass infrastructure always,
> not just for VT-d, to do the MSI injection directly from the VFIO
> interrupt handler and bypass the eventfd?  Basically this would add an
> RCU-protected list of consumers matching the token to struct
> irq_bypass_producer, and a
> 
>   int (*inject)(struct irq_bypass_consumer *);
> 
> callback to struct irq_bypass_consumer.  If any callback returns true,
> the eventfd is not signaled.  The KVM implementation would be like this
> (compare with virt/kvm/eventfd.c):
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollin(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   /* Extracted out of irqfd_wakeup */
>   static int
>   irqfd_wakeup_pollhup(struct kvm_kernel_irqfd *irqfd)
>   {
>   ...
>   }
> 
>   static int
>   irqfd_wakeup(wait_queue_t *wait, unsigned mode, int sync,
>void *key)
>   {
>   struct _irqfd *irqfd = container_of(wait,
>   struct _irqfd, wait);
>   unsigned long flags = (unsigned long)key;
> 
>   if (flags & POLLIN)
>   irqfd_wakeup_pollin(irqfd);
>   if (flags & POLLHUP)
>   irqfd_wakeup_pollhup(irqfd);
> 
>   return 0;
>   }
> 
>   static int kvm_arch_irq_bypass_inject(
>   struct irq_bypass_consumer *cons)
>   {
>   struct kvm_kernel_irqfd *irqfd =
>   container_of(cons, struct kvm_kernel_irqfd,
>consumer); 
> 
>   irqfd_wakeup_pollin(irqfd);
>   }
> 
> Or do you think it would be a hack?  The latency improvement might
> actually be even better than what Yunhong is already reporting.

Yeah, that might be a good idea, it's probably more plausible than
making the eventfd_signal() code friendly to call from hard interrupt
context.  On the vfio side can we use request_threaded_irq() directly
for this?  Making the hard irq handler return IRQ_HANDLED if we can use
the irq bypass manager or IRQ_WAKE_THREAD if we need to use the eventfd.
I think we need some way to get back to irq thread context to use
eventfd_signal().  Would we ever not want to use the direct bypass
manager path if available?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-27 Thread Alex Williamson
[cc +iommu]

On Tue, 2015-10-27 at 15:40 +, Will Deacon wrote:
> On Fri, Oct 16, 2015 at 09:51:22AM -0600, Alex Williamson wrote:
> > On Fri, 2015-10-16 at 16:03 +0200, Eric Auger wrote:
> > > Hi Alex,
> > > On 10/15/2015 10:52 PM, Alex Williamson wrote:
> > > > We can only provide isolation if DMA is forced through the IOMMU
> > > > aperture.  Don't allow type1 to be used if this is not the case.
> > > > 
> > > > Signed-off-by: Alex Williamson 
> > > > ---
> > > > 
> > > > Eric, I see a number of IOMMU drivers enable this, do the ones you
> > > > care about for ARM set geometry.force_aperture?  Thanks,
> > > I am currently using arm-smmu.c. I don't see force_aperture being set.
> > 
> > Hi Will,
> 
> Hi Alex,
> 
> > Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
> > In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
> > eventually like to pass the aperture information out through the VFIO
> > API.  Thanks,
> 
> The slight snag here is that we don't know which SMMU in the system the
> domain is attached to at the point when the geometry is queried, so I
> can't give you an upper bound on the aperture. For example, if there is
> an SMMU with a 32-bit input address and another with a 48-bit input
> address.
> 
> We could play the same horrible games that we do with the pgsize bitmap,
> and truncate some global aperture everytime we probe an SMMU device, but
> I'd really like to have fewer hacks like that if possible. The root of
> the problem is still that domains are allocated for a bus, rather than
> an IOMMU instance.

Hi Will,

Yes, Intel VT-d has this issue as well.  In theory we can have
heterogeneous IOMMU hardware units (DRHDs) in a system and the upper
bound of the geometry could be diminished if we add a less capable DRHD
into the domain.  I suspect this is more a theoretical problem than a
practical one though as we're typically mixing similar DRHDs and I think
we're still capable of 39-bit addressing in the least capable version
per the spec.

In any case, I really want to start testing geometry.force_aperture,
even if we're not yet comfortable to expose the IOMMU limits to the
user.  The vfio type1 shouldn't be enabled at all for underlying
hardware that allows DMA bypass.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] VFIO: Add a parameter to force nonthread IRQ

2015-10-26 Thread Alex Williamson
On Mon, 2015-10-26 at 18:20 -0700, Yunhong Jiang wrote:
> An option to force VFIO PCI MSI/MSI-X handler as non-threaded IRQ,
> even when CONFIG_IRQ_FORCED_THREADING=y. This is uselful when
> assigning a device to a guest with low latency requirement since it
> reduce the context switch to/from the IRQ thread.

Is there any way we can do this automatically?  Perhaps detecting that
we're on a RT kernel or maybe that the user is running with RT priority?
I find that module options are mostly misunderstood and misused.

> An experiment was conducted on a HSW platform for 1 minutes, with the
> guest vCPU bound to isolated pCPU. The assigned device triggered the
> interrupt every 1ms. The average EXTERNAL_INTERRUPT exit handling time
> is dropped from 5.3us to 2.2us.
> 
> Another choice is to change VFIO_DEVICE_SET_IRQS ioctl, to apply this
> option only to specific devices when in kernel irq_chip is enabled. It
> provides more flexibility but is more complex, not sure if we need go
> through that way.

Allowing the user to decide whether or not to use a threaded IRQ seems
like a privilege violation; a chance for the user to game the system and
give themselves better latency, maybe at the cost of others.  I think
we're better off trying to infer the privilege from the task priority or
kernel config or, if we run out of options, make a module option as you
have here requiring the system admin to provide the privilege.  Thanks,

Alex


> Signed-off-by: Yunhong Jiang 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1f577b4..ca1f95a 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -22,9 +22,13 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vfio_pci_private.h"
>  
> +static bool nonthread_msi = 1;
> +module_param(nonthread_msi, bool, 0444);
> +
>  /*
>   * INTx
>   */
> @@ -313,6 +317,7 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   char *name = msix ? "vfio-msix" : "vfio-msi";
>   struct eventfd_ctx *trigger;
>   int ret;
> + unsigned long irqflags = 0;
>  
>   if (vector >= vdev->num_ctx)
>   return -EINVAL;
> @@ -352,7 +357,10 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   pci_write_msi_msg(irq, &msg);
>   }
>  
> - ret = request_irq(irq, vfio_msihandler, 0,
> + if (nonthread_msi)
> + irqflags = IRQF_NO_THREAD;
> +
> + ret = request_irq(irq, vfio_msihandler, irqflags,
> vdev->ctx[vector].name, trigger);
>   if (ret) {
>   kfree(vdev->ctx[vector].name);



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-23 Thread Alex Williamson
On Fri, 2015-10-23 at 11:36 -0700, Alexander Duyck wrote:
> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> > This patchset is to propose a new solution to add live migration support 
> > for 82599
> > SRIOV network card.
> >
> > Im our solution, we prefer to put all device specific operation into VF and
> > PF driver and make code in the Qemu more general.
> >
> >
> > VF status migration
> > =
> > VF status can be divided into 4 parts
> > 1) PCI configure regs
> > 2) MSIX configure
> > 3) VF status in the PF driver
> > 4) VF MMIO regs
> >
> > The first three status are all handled by Qemu.
> > The PCI configure space regs and MSIX configure are originally
> > stored in Qemu. To save and restore "VF status in the PF driver"
> > by Qemu during migration, adds new sysfs node "state_in_pf" under
> > VF sysfs directory.
> >
> > For VF MMIO regs, we introduce self emulation layer in the VF
> > driver to record MMIO reg values during reading or writing MMIO
> > and put these data in the guest memory. It will be migrated with
> > guest memory to new machine.
> >
> >
> > VF function restoration
> > 
> > Restoring VF function operation are done in the VF and PF driver.
> >
> > In order to let VF driver to know migration status, Qemu fakes VF
> > PCI configure regs to indicate migration status and add new sysfs
> > node "notify_vf" to trigger VF mailbox irq in order to notify VF
> > about migration status change.
> >
> > Transmit/Receive descriptor head regs are read-only and can't
> > be restored via writing back recording reg value directly and they
> > are set to 0 during VF reset. To reuse original tx/rx rings, shift
> > desc ring in order to move the desc pointed by original head reg to
> > first entry of the ring and then enable tx/rx rings. VF restarts to
> > receive and transmit from original head desc.
> >
> >
> > Tracking DMA accessed memory
> > =
> > Migration relies on tracking dirty page to migrate memory.
> > Hardware can't automatically mark a page as dirty after DMA
> > memory access. VF descriptor rings and data buffers are modified
> > by hardware when receive and transmit data. To track such dirty memory
> > manually, do dummy writes(read a byte and write it back) when receive
> > and transmit data.
> 
> I was thinking about it and I am pretty sure the dummy write approach is 
> problematic at best.  Specifically the issue is that while you are 
> performing a dummy write you risk pulling in descriptors for data that 
> hasn't been dummy written to yet.  So when you resume and restore your 
> descriptors you will have once that may contain Rx descriptors 
> indicating they contain data when after the migration they don't.
> 
> I really think the best approach to take would be to look at 
> implementing an emulated IOMMU so that you could track DMA mapped pages 
> and avoid migrating the ones marked as DMA_FROM_DEVICE until they are 
> unmapped.  The advantage to this is that in the case of the ixgbevf 
> driver it now reuses the same pages for Rx DMA.  As a result it will be 
> rewriting the same pages often and if you are marking those pages as 
> dirty and transitioning them it is possible for a flow of small packets 
> to really make a mess of things since you would be rewriting the same 
> pages in a loop while the device is processing packets.

I'd be concerned that an emulated IOMMU on the DMA path would reduce
throughput to the point where we shouldn't even bother with assigning
the device in the first place and should be using virtio-net instead.
POWER systems have a guest visible IOMMU and it's been challenging for
them to get to 10Gbps, requiring real-mode tricks.  virtio-net may add
some latency, but it's not that hard to get it to 10Gbps and it already
supports migration.  An emulated IOMMU in the guest is really only good
for relatively static mappings, the latency for anything else is likely
too high.  Maybe there are shadow page table tricks that could help, but
it's imposing overhead the whole time the guest is running, not only on
migration.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 18:58 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 10:20 PM, Alex Williamson
>  wrote:
> 
> > This is why the typical VF agnostic approach here is to using bonding
> > and fail over to a emulated device during migration, so performance
> > suffers, but downtime is something acceptable.
> 
> bonding in the VM isn't a zero touch solution, right? is it really acceptable?

The bonding solution requires configuring the bond in the guest and
doing the hot unplug/re-plug around migration.  It's zero touch in that
it works on current code with any PF/VF, but it's certainly not zero
configuration in the guest.  Is what acceptable?  The configuration?
The performance?  The downtime?  I don't think we can hope to improve on
the downtime of an emulated device, but obviously the configuration and
performance are not always acceptable or we wouldn't be seeing so many
people working on migration of assigned devices.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 16:23 +0200, Eric Auger wrote:
> On 10/22/2015 04:10 PM, Arnd Bergmann wrote:
> > On Thursday 22 October 2015 15:26:55 Eric Auger wrote:
>  @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
>  if (ret)
>  goto err_irq;
>   
>  +   vfio_platform_get_reset(vdev);
>  +
>  if (vdev->reset)
>  vdev->reset(vdev);
> 
> >>>
> >>> This needs some error handling to ensure that the open() fails
> >>> if there is no reset handler.
> >>
> >> Is that really what we want? The code was meant to allow the use case
> >> where the VFIO platform driver would be used without such reset module.
> >>
> >> I think the imperious need for a reset module depends on the device and
> >> more importantly depends on the IOMMU mapping. With QEMU VFIO
> >> integration this is needed because the whole VM memory is IOMMU mapped
> >> but in a simpler user-space driver context, we might live without.
> >>
> >> Any thought?
> > 
> > I would think we need a reset driver for any device that can start DMA,
> > otherwise things can go wrong as soon as you attach it to a different domain
> > while there is ongoing DMA.
> > 
> > Maybe we could just allow devices to be attached without a reset handler,
> > but then disallow DMA on them?
> 
> Well I am tempted to think that most assigned devices will perform DMA
> accesses so to me this somehow comes to the same result, ie disallowing
> functional passthrough for devices not properly/fully integrated.
> 
> Alex/Baptiste, any opinion on this?

We have an IOMMU and the user doesn't get access to the device until the
IOMMU domain is established.  So, ideally yes, we should have a way to
reset the device, but I don't see it as a requirement.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 15:32 +0300, Michael S. Tsirkin wrote:
> On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> > The trouble here is that the VF needs to be unplugged prior to the start
> > of migration because we can't do effective dirty page tracking while the
> > device is connected and doing DMA.
> 
> That's exactly what patch 12/12 is trying to accomplish.
> 
> I do see some problems with it, but I also suggested some solutions.

I was replying to:

> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:

And then later note:

"Here it's done via an enlightened guest driver."

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alex Williamson
On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu  wrote:
> > This patchset is to propose a new solution to add live migration support
> > for 82599 SRIOV network card.
> 
> > In our solution, we prefer to put all device specific operation into VF and
> > PF driver and make code in the Qemu more general.
> 
> [...]
> 
> > Service down time test
> > So far, we tested migration between two laptops with 82599 nic which
> > are connected to a gigabit switch. Ping VF in the 0.001s interval
> > during migration on the host of source side. It service down
> > time is about 180ms.
> 
> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:
> 
> on host A: unplug the VM and conduct live migration to host B ala the
> no-SRIOV case.

The trouble here is that the VF needs to be unplugged prior to the start
of migration because we can't do effective dirty page tracking while the
device is connected and doing DMA.  So the downtime, assuming we're
counting only VF connectivity, is dependent on memory size, rate of
dirtying, and network bandwidth; seconds for small guests, minutes or
more (maybe much, much more) for large guests.

This is why the typical VF agnostic approach here is to using bonding
and fail over to a emulated device during migration, so performance
suffers, but downtime is something acceptable.

If we want the ability to defer the VF unplug until just before the
final stages of the migration, we need the VF to participate in dirty
page tracking.  Here it's done via an enlightened guest driver.  Alex
Graf presented a solution using a device specific enlightenment in QEMU.
Otherwise we'd need hardware support from the IOMMU.  Thanks,

Alex

> on host B:
> 
> when the VM "gets back to live", probe a VF there with the same assigned mac
> 
> next, udev on the VM will call the VF driver to create netdev instance
> 
> DHCP client would run to get the same IP address
> 
> + under config directive (or from Qemu) send Gratuitous ARP to notify
> the switch/es on the new location for that mac.
> 
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alex Williamson
On Thu, 2015-10-22 at 00:52 +0800, Lan Tianyu wrote:
> This patchset is Qemu part for live migration support for SRIOV NIC.
> kernel part patch information is in the following link.
> http://marc.info/?l=kvm&m=144544635330193&w=2
> 
> 
> Lan Tianyu (3):
>   Qemu: Add pci-assign.h to share functions and struct definition with
> new file
>   Qemu: Add post_load_state() to run after restoring CPU state
>   Qemu: Introduce pci-sriov device type to support VF live migration
> 
>  hw/i386/kvm/Makefile.objs   |   2 +-
>  hw/i386/kvm/pci-assign.c| 113 +--
>  hw/i386/kvm/pci-assign.h| 109 +++
>  hw/i386/kvm/sriov.c | 213 
> 
>  include/migration/vmstate.h |   2 +
>  migration/savevm.c  |  15 
>  6 files changed, 344 insertions(+), 110 deletions(-)
>  create mode 100644 hw/i386/kvm/pci-assign.h
>  create mode 100644 hw/i386/kvm/sriov.c
> 

Hi Lan,

Seems like there are a couple immediate problems with this approach.
The first is that you're modifying legacy KVM device assignment, which
is deprecated upstream and not even enabled by some distros.  VFIO is
the supported mechanism for doing PCI device assignment now and any
features like this need to be added there first.  It's not only more
secure than legacy KVM device assignment, but it also doesn't limit this
to an x86-only solution.  Surely you want to support 82599 VF migration
on other platforms as well.

Using sysfs to interact with the PF is also problematic since that means
that libvirt needs to grant qemu access to these files, adding one more
layer to the stack.  If we were to use VFIO, we could potentially enable
this through a save-state region on the device file descriptor and if
necessary, virtual interrupt channels for the device as well.  This of
course implies that the kernel internal channels are made as general as
possible in order to support any PF driver.

That said, there are some nice features here.  Using unused PCI config
bytes to communicate with the guest driver and enable guest-based page
dirtying is a nice hack.  However, if we want to add this capability to
other devices, we're not always going to be able to use fixed addresses
0xf0 and 0xf1.  I would suggest that we probably want to create a
virtual capability in the config space of the VF, perhaps a Vendor
Specific capability.  Obviously some devices won't have room for a full
capability in the standard config space, so we may need to optionally
expose it in extended config space.  Those device would be limited to
only supporting migration in PCI-e configurations in the guest.  Also,
plenty of devices make use of undefined PCI config space, so we may not
be able to simply add a capability to a region we think is unused, maybe
it needs to happen through reserved space in another capability or
perhaps defining a virtual BAR that unenlightened guest drivers would
ignore.  The point is that we somehow need to standardize that so that
rather than implicitly know that it's at 0xf0/0xf1 on 82599 VFs.

Also, I haven't looked at the kernel-side patches yet, but the saved
state received from and loaded into the PF driver needs to be versioned
and maybe we need some way to know whether versions are compatible.
Migration version information is difficult enough for QEMU, it's a
completely foreign concept in the kernel.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] vfio: platform: add capability to register a reset function

2015-10-19 Thread Alex Williamson
On Sun, 2015-10-18 at 18:00 +0200, Eric Auger wrote:
> In preparation for subsequent changes in reset function lookup,
> lets introduce a dynamic list of reset combos (compat string,
> reset module, reset function). The list can be populated/voided with
> two new functions, vfio_platform_register/unregister_reset. Those are
> not yet used in this patch.
> 
> Signed-off-by: Eric Auger 
> ---
>  drivers/vfio/platform/vfio_platform_common.c  | 55 
> +++
>  drivers/vfio/platform/vfio_platform_private.h | 14 +++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/drivers/vfio/platform/vfio_platform_common.c 
> b/drivers/vfio/platform/vfio_platform_common.c
> index e43efb5..d36afc9 100644
> --- a/drivers/vfio/platform/vfio_platform_common.c
> +++ b/drivers/vfio/platform/vfio_platform_common.c
> @@ -23,6 +23,8 @@
>  
>  #include "vfio_platform_private.h"
>  
> +struct list_head reset_list;

What's the purpose of this one above?

> +LIST_HEAD(reset_list);

static

I think you also need a mutex protecting this list, otherwise nothing
prevents concurrent list changes and walks.  A rw lock probably fits the
usage model best, but I don't expect much contention if you just want to
use a mutex.

>  static DEFINE_MUTEX(driver_lock);
>  
>  static const struct vfio_platform_reset_combo reset_lookup_table[] = {
> @@ -573,3 +575,56 @@ struct vfio_platform_device 
> *vfio_platform_remove_common(struct device *dev)
>   return vdev;
>  }
>  EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
> +
> +int vfio_platform_register_reset(struct module *reset_owner, char *compat,

const char *

> +  vfio_platform_reset_fn_t reset)
> +{
> + struct vfio_platform_reset_node *node, *iter;
> + bool found = false;
> +
> + list_for_each_entry(iter, &reset_list, link) {
> + if (!strcmp(iter->compat, compat)) {
> + found = true;
> + break;
> + }
> + }
> + if (found)
> + return -EINVAL;
> +
> + node = kmalloc(sizeof(*node), GFP_KERNEL);
> + if (!node)
> + return -ENOMEM;
> +
> + node->compat = kstrdup(compat, GFP_KERNEL);
> + if (!node->compat)
> + return -ENOMEM;

Leaks node

> +
> + node->owner = reset_owner;
> + node->reset = reset;
> +
> + list_add(&node->link, &reset_list);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_platform_register_reset);
> +
> +int vfio_platform_unregister_reset(char *compat)

const char *

> +{
> + struct vfio_platform_reset_node *iter;
> + bool found = false;
> +
> + list_for_each_entry(iter, &reset_list, link) {
> + if (!strcmp(iter->compat, compat)) {
> + found = true;
> + break;
> + }
> + }
> + if (!found)
> + return -EINVAL;
> +
> + list_del(&iter->link);
> + kfree(iter->compat);
> + kfree(iter);
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vfio_platform_unregister_reset);
> +
> diff --git a/drivers/vfio/platform/vfio_platform_private.h 
> b/drivers/vfio/platform/vfio_platform_private.h
> index 1c9b3d5..17323f0 100644
> --- a/drivers/vfio/platform/vfio_platform_private.h
> +++ b/drivers/vfio/platform/vfio_platform_private.h
> @@ -76,6 +76,15 @@ struct vfio_platform_reset_combo {
>   const char *module_name;
>  };
>  
> +typedef int (*vfio_platform_reset_fn_t)(struct vfio_platform_device *vdev);
> +
> +struct vfio_platform_reset_node {
> + struct list_head link;
> + char *compat;
> + struct module *owner;
> + vfio_platform_reset_fn_t reset;
> +};
> +
>  extern int vfio_platform_probe_common(struct vfio_platform_device *vdev,
> struct device *dev);
>  extern struct vfio_platform_device *vfio_platform_remove_common
> @@ -89,4 +98,9 @@ extern int vfio_platform_set_irqs_ioctl(struct 
> vfio_platform_device *vdev,
>   unsigned start, unsigned count,
>   void *data);
>  
> +extern int vfio_platform_register_reset(struct module *owner,
> + char *compat,
> + vfio_platform_reset_fn_t reset);
> +extern int vfio_platform_unregister_reset(char *compat);
> +
>  #endif /* VFIO_PLATFORM_PRIVATE_H */



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/2] Enable specifying a non-default Hyper-V vendor ID

2015-10-16 Thread Alex Williamson
v3: Incorporate suggestion from Igor to move the string length test
out to the property parsing.  A string that's too long will now
get an error like:

$ qemu-system-x86_64 -cpu qemu64,hv_vendor_id=123456789abcd
qemu-system-x86_64: Property 'host-x86_64-cpu.hv-vendor-id' doesn't take 
value '123456789abcd'

v2: Remove abort, but truncate string

---

Alex Williamson (2):
  qapi: Create DEFINE_PROP_STRING_LEN
  kvm: Allow the Hyper-V vendor ID to be specified


 hw/core/qdev-properties.c|7 +++
 include/hw/qdev-core.h   |1 +
 include/hw/qdev-properties.h |   16 ++--
 target-i386/cpu-qom.h|1 +
 target-i386/cpu.c|1 +
 target-i386/kvm.c|8 +++-
 6 files changed, 31 insertions(+), 3 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/2] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-16 Thread Alex Williamson
According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---
 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |8 +++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 05d7f26..f9304ea 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING_LEN("hv-vendor-id", X86CPU, hyperv_vendor_id, 12),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..c4108ab 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -490,7 +490,13 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = &cpuid_data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id,
+   strlen(cpu->hyperv_vendor_id));
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/2] qapi: Create DEFINE_PROP_STRING_LEN

2015-10-16 Thread Alex Williamson
A slight addition to set_string(), which optionally tests that the
string length is within the bounds specified.

Signed-off-by: Alex Williamson 
---
 hw/core/qdev-properties.c|7 +++
 include/hw/qdev-core.h   |1 +
 include/hw/qdev-properties.h |   16 ++--
 3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 33e245e..51a05c7 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -422,6 +422,13 @@ static void set_string(Object *obj, Visitor *v, void 
*opaque,
 error_propagate(errp, local_err);
 return;
 }
+
+if (prop->strlen >= 0 && strlen(str) > prop->strlen) {
+error_set_from_qdev_prop_error(errp, EINVAL, dev, prop, str);
+g_free(str);
+return;
+}
+
 g_free(*ptr);
 *ptr = str;
 }
diff --git a/include/hw/qdev-core.h b/include/hw/qdev-core.h
index 8057aed..12e3031 100644
--- a/include/hw/qdev-core.h
+++ b/include/hw/qdev-core.h
@@ -244,6 +244,7 @@ struct Property {
 int  arrayoffset;
 PropertyInfo *arrayinfo;
 int  arrayfieldsize;
+int  strlen;
 };
 
 struct PropertyInfo {
diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index 77538a8..2a46640 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -69,6 +69,20 @@ extern PropertyInfo qdev_prop_arraylen;
 .qtype = QTYPE_QBOOL,\
 .defval= (bool)_defval,  \
 }
+#define DEFINE_PROP_STRING(_name, _state, _field) {\
+.name  = (_name),  \
+.info  = &(qdev_prop_string),  \
+.strlen= -1,   \
+.offset= offsetof(_state, _field)  \
++ type_check(char*, typeof_field(_state, _field)), \
+}
+#define DEFINE_PROP_STRING_LEN(_name, _state, _field, _strlen) {   \
+.name  = (_name),  \
+.info  = &(qdev_prop_string),  \
+.strlen= (_strlen),\
+.offset= offsetof(_state, _field)  \
++ type_check(char*, typeof_field(_state, _field)), \
+}
 
 #define PROP_ARRAY_LEN_PREFIX "len-"
 
@@ -144,8 +158,6 @@ extern PropertyInfo qdev_prop_arraylen;
 
 #define DEFINE_PROP_CHR(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_chr, CharDriverState*)
-#define DEFINE_PROP_STRING(_n, _s, _f) \
-DEFINE_PROP(_n, _s, _f, qdev_prop_string, char*)
 #define DEFINE_PROP_NETDEV(_n, _s, _f) \
 DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, NICPeers)
 #define DEFINE_PROP_VLAN(_n, _s, _f) \

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-16 Thread Alex Williamson
On Fri, 2015-10-16 at 16:03 +0200, Eric Auger wrote:
> Hi Alex,
> On 10/15/2015 10:52 PM, Alex Williamson wrote:
> > We can only provide isolation if DMA is forced through the IOMMU
> > aperture.  Don't allow type1 to be used if this is not the case.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> > 
> > Eric, I see a number of IOMMU drivers enable this, do the ones you
> > care about for ARM set geometry.force_aperture?  Thanks,
> I am currently using arm-smmu.c. I don't see force_aperture being set.

Hi Will,

Would it be possible to add iommu_domain_geometry support to arm-smmu.c?
In addition to this test to verify that DMA cannot bypass the IOMMU, I'd
eventually like to pass the aperture information out through the VFIO
API.  Thanks,

Alex
 
> >  drivers/vfio/vfio_iommu_type1.c |   12 
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c 
> > b/drivers/vfio/vfio_iommu_type1.c
> > index 57d8c37..6afa9d4 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -728,6 +728,7 @@ static int vfio_iommu_type1_attach_group(void 
> > *iommu_data,
> > struct vfio_group *group, *g;
> > struct vfio_domain *domain, *d;
> > struct bus_type *bus = NULL;
> > +   struct iommu_domain_geometry geometry;
> > int ret;
> >  
> > mutex_lock(&iommu->lock);
> > @@ -762,6 +763,17 @@ static int vfio_iommu_type1_attach_group(void 
> > *iommu_data,
> > goto out_free;
> > }
> >  
> > +   /*
> > +* If a domain does not force DMA within the aperture, devices are not
> > +* isolated and type1 is not an appropriate IOMMU model.
> > +*/
> > +   ret = iommu_domain_get_attr(domain->domain,
> > +   DOMAIN_ATTR_GEOMETRY, &geometry);
> > +   if (ret || !geometry.force_aperture) {
> > +   ret = -EPERM;
> > +   goto out_domain;
> > +   }
> > +
> > if (iommu->nesting) {
> > int attr = 1;
> >  
> > 
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-16 Thread Alex Williamson
According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---

v2: Replace abort() with truncating the string, error report updated

Igor also had the idea of creating a DEFINE_PROP_STRING_LEN property
where we could enforce the length earlier in the parameter checking.
If we like that idea, we probably need to do it first since we don't
want to switch from truncating to erroring between releases.  I can
work on that if preferred.  Thanks,

Alex

 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |   14 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 05d7f26..71df546 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..9d25fd7 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -490,7 +490,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = &cpuid_data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+size_t len = strlen(cpu->hyperv_vendor_id);
+
+if (len > 12) {
+fprintf(stderr,
+"hyperv-vendor-id too long, truncated to 12 
charaters");
+len = 12;
+}
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id, len);
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-16 Thread Alex Williamson
On Fri, 2015-10-16 at 09:30 +0200, Paolo Bonzini wrote:
> 
> On 16/10/2015 00:16, Alex Williamson wrote:
> > According to Microsoft documentation, the signature in the standard
> > hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
> > for reporting and diagnostic purposes only.  We can therefore allow
> > the user to change it to whatever they want, within the 12 character
> > limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
> > for this, ex:
> > 
> >  -cpu host,hv_time,hv_vendor_id=KeenlyKVM
> > 
> > Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
> > Signed-off-by: Alex Williamson 
> > ---
> > 
> > Cc'ing get_maintainers this time.  Any takers?  Thanks,
> > Alex
> > 
> >  target-i386/cpu-qom.h |1 +
> >  target-i386/cpu.c |1 +
> >  target-i386/kvm.c |   14 +-
> >  3 files changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
> > index c35b624..6c1eaaa 100644
> > --- a/target-i386/cpu-qom.h
> > +++ b/target-i386/cpu-qom.h
> > @@ -88,6 +88,7 @@ typedef struct X86CPU {
> >  bool hyperv_vapic;
> >  bool hyperv_relaxed_timing;
> >  int hyperv_spinlock_attempts;
> > +char *hyperv_vendor_id;
> >  bool hyperv_time;
> >  bool hyperv_crash;
> >  bool check_cpuid;
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 05d7f26..71df546 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
> >  DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
> >  DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
> >  DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> > +DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
> >  DEFINE_PROP_END_OF_LIST()
> >  };
> >  
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 80d1a7e..5e3ab22 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -490,7 +490,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >  if (hyperv_enabled(cpu)) {
> >  c = &cpuid_data.entries[cpuid_i++];
> >  c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
> > -memcpy(signature, "Microsoft Hv", 12);
> > +if (!cpu->hyperv_vendor_id) {
> > +memcpy(signature, "Microsoft Hv", 12);
> > +} else {
> > +size_t len = strlen(cpu->hyperv_vendor_id);
> > +
> > +if (len > 12) {
> > +fprintf(stderr,
> > +"hyperv-vendor-id too long, limited to 12 
> > charaters");
> > +abort();
> 
> I'm removing this abort and queueing the patch.  I'll send a pull
> request today.

If we don't abort then we should really set len = 12 here.  Thanks,

Alex

> > +}
> > +memset(signature, 0, 12);
> > +memcpy(signature, cpu->hyperv_vendor_id, len);
> > +}
> >  c->eax = HYPERV_CPUID_MIN;
> >  c->ebx = signature[0];
> >  c->ecx = signature[1];
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-15 Thread Alex Williamson
According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---

Cc'ing get_maintainers this time.  Any takers?  Thanks,
Alex

 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |   14 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index 05d7f26..71df546 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3146,6 +3146,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 80d1a7e..5e3ab22 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -490,7 +490,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = &cpuid_data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+size_t len = strlen(cpu->hyperv_vendor_id);
+
+if (len > 12) {
+fprintf(stderr,
+"hyperv-vendor-id too long, limited to 12 charaters");
+abort();
+}
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id, len);
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio/pci: Use kernel VPD access functions

2015-10-15 Thread Alex Williamson
The PCI VPD capability operates on a set of window registers in PCI
config space.  Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register.  The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers.  Additionally, commits like 932c435caba8 ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39ed
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel.  We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson 
Acked-by: Mark Rustad 
---

I posted this about a month ago as an RFC and it got positive feedback
as a thing we should do.  Therefore, officially proposing it for v4.4.

 drivers/vfio/pci/vfio_pci_config.c |   70 +++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..a8657ef 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -671,6 +671,73 @@ static int __init init_pci_cap_pm_perm(struct perm_bits 
*perm)
return 0;
 }
 
+static int vfio_vpd_config_write(struct vfio_pci_device *vdev, int pos,
+int count, struct perm_bits *perm,
+int offset, __le32 val)
+{
+   struct pci_dev *pdev = vdev->pdev;
+   __le16 *paddr = (__le16 *)(vdev->vconfig + pos - offset + PCI_VPD_ADDR);
+   __le32 *pdata = (__le32 *)(vdev->vconfig + pos - offset + PCI_VPD_DATA);
+   u16 addr;
+   u32 data;
+
+   /*
+* Write through to emulation.  If the write includes the upper byte
+* of PCI_VPD_ADDR, then the PCI_VPD_ADDR_F bit is written and we
+* have work to do.
+*/
+   count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+   if (count < 0 || offset > PCI_VPD_ADDR + 1 ||
+   offset + count <= PCI_VPD_ADDR + 1)
+   return count;
+
+   addr = le16_to_cpu(*paddr);
+
+   if (addr & PCI_VPD_ADDR_F) {
+   data = le32_to_cpu(*pdata);
+   if (pci_write_vpd(pdev, addr & ~PCI_VPD_ADDR_F, 4, &data) != 4)
+   return count;
+   } else {
+   if (pci_read_vpd(pdev, addr, 4, &data) != 4)
+   return count;
+   *pdata = cpu_to_le32(data);
+   }
+
+   /*
+* Toggle PCI_VPD_ADDR_F in the emulated PCI_VPD_ADDR register to
+* signal completion.  If an error occurs above, we assume that not
+* toggling this bit will induce a driver timeout.
+*/
+   addr ^= PCI_VPD_ADDR_F;
+   *paddr = cpu_to_le16(addr);
+
+   return count;
+}
+
+/* Permissions for Vital Product Data capability */
+static int __init init_pci_cap_vpd_perm(struct perm_bits *perm)
+{
+   if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_VPD]))
+   return -ENOMEM;
+
+   perm->writefn = vfio_vpd_config_write;
+
+   /*
+* We always virtualize the next field so we can remove
+* capabilities from the chain if we want to.
+*/
+   p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+   /*
+* Both the address and data registers are virtualized to
+* enable access through the pci_vpd_read/write functions
+*/
+   p_setw(perm, PCI_VPD_ADDR, (u16)ALL_VIRT, (u16)ALL_WRITE);
+   p_setd(perm, PCI_VPD_DATA, ALL_VIRT, ALL_WRITE);
+
+   return 0;
+}
+
 /* Permissions for PCI-X capability */
 static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
 {
@@ -790,6 +857,7 @@ void vfio_pci_uninit_perm_bits(void)
free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
 
free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+   free_perm_bits(&cap_perms[PCI_CAP_ID_VPD]);
free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
@@ -807,7 +875,7 @@ int __init vfio_pci_init_perm_bits(void)
 
/* Capabilities */
ret |= init_pci_cap_pm_perm(&cap_perms[PCI_C

[RFC PATCH] vfio/type1: Do not support IOMMUs that allow bypass

2015-10-15 Thread Alex Williamson
We can only provide isolation if DMA is forced through the IOMMU
aperture.  Don't allow type1 to be used if this is not the case.

Signed-off-by: Alex Williamson 
---

Eric, I see a number of IOMMU drivers enable this, do the ones you
care about for ARM set geometry.force_aperture?  Thanks,

Alex

 drivers/vfio/vfio_iommu_type1.c |   12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 57d8c37..6afa9d4 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -728,6 +728,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
struct vfio_group *group, *g;
struct vfio_domain *domain, *d;
struct bus_type *bus = NULL;
+   struct iommu_domain_geometry geometry;
int ret;
 
mutex_lock(&iommu->lock);
@@ -762,6 +763,17 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
goto out_free;
}
 
+   /*
+* If a domain does not force DMA within the aperture, devices are not
+* isolated and type1 is not an appropriate IOMMU model.
+*/
+   ret = iommu_domain_get_attr(domain->domain,
+   DOMAIN_ATTR_GEOMETRY, &geometry);
+   if (ret || !geometry.force_aperture) {
+   ret = -EPERM;
+   goto out_domain;
+   }
+
if (iommu->nesting) {
int attr = 1;
 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Alex Williamson
On Thu, 2015-10-15 at 21:42 +0200, Christoffer Dall wrote:
> On Thu, Oct 15, 2015 at 10:53:17AM -0600, Alex Williamson wrote:
> > On Thu, 2015-10-15 at 16:46 +0200, Eric Auger wrote:
> > > Hi Arnd,
> > > On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> > > > On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> > > >>>
> > > >>> enum vfio_platform_op {
> > > >>>   VFIO_PLATFORM_BIND,
> > > >>>   VFIO_PLATFORM_UNBIND,
> > > >>>   VFIO_PLATFORM_RESET,
> > > >>> };
> > > >>>
> > > >>> struct platform_driver {
> > > >>> int (*probe)(struct platform_device *);
> > > >>> int (*remove)(struct platform_device *);
> > > >>>   ...
> > > >>>   int (*vfio_manage)(struct platform_device *, enum 
> > > >>> vfio_platform_op);
> > > >>> struct device_driver driver;
> > > >>> };
> > > >>>
> > > >>> This would integrate much more closely into the platform driver 
> > > >>> framework,
> > > >>> just like the regular vfio driver integrates into the PCI framework.
> > > >>> Unlike PCI however, you can't just use the generic driver framework to
> > > >>> unbind the driver, because you still need device specific code.
> > > >>>
> > > >> Thanks for these suggestions, really helpful.
> > > >>
> > > >> What I don't understand in the latter example is how VFIO knows which
> > > >> struct platform_driver to interact with?
> > > > 
> > > > This would assume that the driver remains bound to the device, so VFIO
> > > > gets a pointer to the device from somewhere (as it does today) and then
> > > > follows the dev->driver pointer to get to the platform_driver.
> > 
> > The complexity of managing a bi-modal driver seems like far more than a
> > little bit of code duplication in a device specific reset module and
> > extends into how userspace makes devices available through vfio, so I
> > think it's too late for that discussion.
> >   
> 
> I have had extremely limited exposure to the implementation details of
> the drivers for devices relevant for VFIO platform, so apologies for
> asking stupid questions.
> 
> I'm sure that your point is valid, I just fully understand how the
> complexities of a bi-modal driver arise?
> 
> Is it simply that the reset function in a particular device driver may
> not be self-contained so therefore the whole driver would need to be
> refactored to be able to do a reset for the purpose of VFIO?

Yes, I would expect that reset function in a driver is not typically
self contained, probably referencing driver specific data structures for
register offsets, relying on various mappings that are expected to be
created from the driver probe() function, etc.  It also creates a
strange dependency on the host driver, how is the user to know they need
the native host driver loaded for full functionality in a device
assignment scenario?  Are we going to need to do a module_request() of
the host driver in vfio platform anyway?  What if there are multiple
devices and the host driver claims the others when loaded?  In the case
of PCI and SR-IOV virtual functions, I often blacklist the host VF
driver because I shouldn't need it when I only intend to use the device
in guests.  Not to mention that we'd have to drop a little bit of vfio
knowledge into each host driver that we intend to enlighten like this,
and how do we resolve whether the host driver, potentially compiled from
a separate source tree has this support.

I really don't see the layering violation in having a set of reset
functions and some lookup mechanism to pick the correct one.  The vfio
platform driver is a meta driver and sometimes we need to enlighten it a
little bit more about the device it's operating on.  For PCI we have all
sorts of special functionality for reset, but we have a standard to work
with there, so we may need to choose between a bus reset, PM reset, AF
FLR, PCIe FLR, or device specific reset, but it's all buried in the PCI
core code; where device specific resets are the exception on PCI, they
are the norm on platform.

> > > >> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> > > >> then called by VFIO before the VFIO driver unbinds from the device
> > > >> (unbinding the platform driver from the device being a completely
> > > >> separate thing)?
> &

Re: [PATCH] VFIO: platform: AMD xgbe reset module

2015-10-15 Thread Alex Williamson
On Thu, 2015-10-15 at 16:46 +0200, Eric Auger wrote:
> Hi Arnd,
> On 10/15/2015 03:59 PM, Arnd Bergmann wrote:
> > On Thursday 15 October 2015 14:12:28 Christoffer Dall wrote:
> >>>
> >>> enum vfio_platform_op {
> >>>   VFIO_PLATFORM_BIND,
> >>>   VFIO_PLATFORM_UNBIND,
> >>>   VFIO_PLATFORM_RESET,
> >>> };
> >>>
> >>> struct platform_driver {
> >>> int (*probe)(struct platform_device *);
> >>> int (*remove)(struct platform_device *);
> >>>   ...
> >>>   int (*vfio_manage)(struct platform_device *, enum vfio_platform_op);
> >>> struct device_driver driver;
> >>> };
> >>>
> >>> This would integrate much more closely into the platform driver framework,
> >>> just like the regular vfio driver integrates into the PCI framework.
> >>> Unlike PCI however, you can't just use the generic driver framework to
> >>> unbind the driver, because you still need device specific code.
> >>>
> >> Thanks for these suggestions, really helpful.
> >>
> >> What I don't understand in the latter example is how VFIO knows which
> >> struct platform_driver to interact with?
> > 
> > This would assume that the driver remains bound to the device, so VFIO
> > gets a pointer to the device from somewhere (as it does today) and then
> > follows the dev->driver pointer to get to the platform_driver.

The complexity of managing a bi-modal driver seems like far more than a
little bit of code duplication in a device specific reset module and
extends into how userspace makes devices available through vfio, so I
think it's too late for that discussion.
  
> >> Also, just so I'm sure I understand correctly, VFIO_PLATFORM_UNBIND is
> >> then called by VFIO before the VFIO driver unbinds from the device
> >> (unbinding the platform driver from the device being a completely
> >> separate thing)?
> > 
> > This is where we'd need a little more changes for this approach. Instead
> > of unbinding the device from its driver, the idea would be that the
> > driver remains bound as far as the driver model is concerned, but
> > it would be in a quiescent state where no other subsystem interacts with
> > it (i.e. it gets unregistered from networking core or whichever it uses).
> 
> Currently we use the same mechanism as for PCI, ie. unbind the native
> driver and then bind VFIO platform driver in its place. Don't you think
> changing this may be a pain for user-space tools that are designed to
> work that way for PCI?
> 
> My personal preference would be to start with your first proposal since
> it looks (to me) less complex and "unknown" that the 2d approach.
> 
> Let's wait for Alex opinion too...

I thought the reason we took the approach we have now is so that we
don't have reset code loaded into the kernel unless we have a device
that needs it.  Therefore we don't really want to preemptively load all
the reset drivers and have them do a registration.  The unfortunate
side-effect of that is the platform code needs to go looking for the
driver.  We do that via the __symbol_get() trick, which only fails
without modules because the underscore variant isn't defined in that
case.  I remember asking Eric previously why we're using that rather
than symbol_get(), I've since forgotten his answer, but the fact that
__symbol_get() is only defined for modules makes it moot, we either need
to make symbol_get() work or define __symbol_get() for non-module
builds.

Otherwise, we should probably abandon the idea of these reset functions
being modules and build them into the vfio platform driver (which would
still be less loaded, dead code than a bi-modal host driver).  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: unable to handle kernel paging request with v4.3-rc4

2015-10-09 Thread Alex Williamson
On Fri, 2015-10-09 at 16:58 +0200, Joerg Roedel wrote:
> Hi Alex,
> 
> while playing around with attaching a 32bit PCI device to a guest via
> VFIO I triggered this oops:
> 
> [  192.289917] kernel tried to execute NX-protected page - exploit attempt? 
> (uid: 0)
> [  192.298245] BUG: unable to handle kernel paging request at 880224582608
> [  192.306195] IP: [] 0x880224582608
> [  192.312302] PGD 2026067 PUD 2029067 PMD 8002244001e3 
> [  192.318589] Oops: 0011 [#1] PREEMPT SMP 
> [  192.323363] Modules linked in: kvm_amd kvm vfio_pci vfio_iommu_type1 
> vfio_virqfd vfio bnep bluetooth rfkill iscsi_ibft iscsi_boot_sysfs af_packet 
> snd_hda_codec_via snd_hda_codec_generic snd_hda_codec_hdmi raid1 
> snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec crc32c_intel 
> ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm snd_timer aesni_intel 
> aes_x86_64 md_mod glue_helper lrw gf128mul ablk_helper be2net snd serio_raw 
> cryptd sp5100_tco pcspkr xhci_pci vxlan ip6_udp_tunnel fam15h_power sky2 
> udp_tunnel xhci_hcd soundcore dm_mod k10temp i2c_piix4 shpchp wmi 
> acpi_cpufreq asus_atk0110 button processor ata_generic firewire_ohci 
> firewire_core ohci_pci crc_itu_t radeon i2c_algo_bit drm_kms_helper 
> pata_jmicron syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm sg [last 
> unloaded: kvm]
> [  192.399986] CPU: 4 PID: 2037 Comm: qemu-system-x86 Not tainted 4.3.0-rc4+ 
> #4
> [  192.408260] Hardware name: System manufacturer System Product 
> Name/Crosshair IV Formula, BIOS 302710/28/2011
> [  192.419746] task: 880223e24040 ti: 8800cae5c000 task.ti: 
> 8800cae5c000
> [  192.428506] RIP: 0010:[]  [] 
> 0x880224582608
> [  192.437376] RSP: 0018:8800cae5fe58  EFLAGS: 00010286
> [  192.443940] RAX: 8800cb3c8800 RBX: 8800cba55800 RCX: 
> 0004
> [  192.452370] RDX: 0004 RSI: 8802233e7887 RDI: 
> 0001
> [  192.460796] RBP: 8800cae5fe98 R08: 0ff8 R09: 
> 0008
> [  192.469145] R10: 0001d300 R11:  R12: 
> 8800cba55800
> [  192.477584] R13: 8802233e7880 R14: 8800cba55830 R15: 
> 7fff43b30b50
> [  192.486025] FS:  7f94375b2c00() GS:88022ed0() 
> knlGS:
> [  192.495445] CS:  0010 DS:  ES:  CR0: 80050033
> [  192.502481] CR2: 880224582608 CR3: cb9d9000 CR4: 
> 000406e0
> [  192.510850] Stack:
> [  192.514094]  a03f9733 0001 0001 
> 880223c74600
> [  192.522876]  8800ca4f6d88 7fff43b30b50 3b6a 
> 7fff43b30b50
> [  192.531582]  8800cae5ff08 811efc7d 8800cae5fec8 
> 880223c74600
> [  192.540439] Call Trace:
> [  192.544145]  [] ? vfio_group_fops_unl_ioctl+0x253/0x410 
> [vfio]
> [  192.552898]  [] do_vfs_ioctl+0x2cd/0x4c0
> [  192.559713]  [] ? __fget+0x77/0xb0
> [  192.565998]  [] SyS_ioctl+0x79/0x90
> [  192.572373]  [] ? syscall_return_slowpath+0x50/0x130
> [  192.580258]  [] entry_SYSCALL_64_fastpath+0x16/0x75
> [  192.588049] Code: 88 ff ff d8 25 58 24 02 88 ff ff e8 25 58 24 02 88 ff ff 
> e8 25 58 24 02 88 ff ff 58 a2 70 21 02 88 ff ff c0 65 39 cb 00 88 ff ff <08> 
> 2d 58 24 02 88 ff ff 08 88 3c cb 00 88 ff ff d8 58 c1 24 02 
> [  192.610309] RIP  [] 0x880224582608
> [  192.616940]  RSP 
> [  192.621805] CR2: 880224582608
> [  192.632826] ---[ end trace ce135ef0c9b1869f ]---
> 
> I am not sure whether this is an IOMMU or VFIO bug, have you seen
> something like this before?

Hey Joerg,

I have not seen this one yet.  There literally have been no changes for
vfio in 4.3, so if this is new, it may be collateral from changes
elsewhere.  32bit devices really shouldn't make any difference to vfio,
I'll see if I can reproduce it myself though.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region

2015-10-06 Thread Alex Williamson
On Tue, 2015-10-06 at 09:39 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, October 06, 2015 4:15 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova
> > region
> > 
> > On Mon, 2015-10-05 at 04:55 +, Bhushan Bharat wrote:
> > > Hi Alex,
> > >
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Saturday, October 03, 2015 4:16 AM
> > > > To: Bhushan Bharat-R65777 
> > > > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > > > christoffer.d...@linaro.org; eric.au...@linaro.org;
> > > > pranavku...@linaro.org; marc.zyng...@arm.com; will.dea...@arm.com
> > > > Subject: Re: [RFC PATCH 1/6] vfio: Add interface for add/del
> > > > reserved iova region
> > > >
> > > > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > > > This Patch adds the VFIO APIs to add and remove reserved iova regions.
> > > > > The reserved iova region can be used for mapping some specific
> > > > > physical address in iommu.
> > > > >
> > > > > Currently we are planning to use this interface for adding iova
> > > > > regions for creating iommu of msi-pages. But the API are designed
> > > > > for future extension where some other physical address can be
> > mapped.
> > > > >
> > > > > Signed-off-by: Bharat Bhushan 
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 201
> > > > +++-
> > > > >  include/uapi/linux/vfio.h   |  43 +
> > > > >  2 files changed, 243 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > > > b/drivers/vfio/vfio_iommu_type1.c index 57d8c37..fa5d3e4 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -59,6 +59,7 @@ struct vfio_iommu {
> > > > >   struct rb_root  dma_list;
> > > > >   boolv2;
> > > > >   boolnesting;
> > > > > + struct list_headreserved_iova_list;
> > > >
> > > > This alignment leads to poor packing in the structure, put it above the
> > bools.
> > >
> > > ok
> > >
> > > >
> > > > >  };
> > > > >
> > > > >  struct vfio_domain {
> > > > > @@ -77,6 +78,15 @@ struct vfio_dma {
> > > > >   int prot;   /* IOMMU_READ/WRITE */
> > > > >  };
> > > > >
> > > > > +struct vfio_resvd_region {
> > > > > + dma_addr_t  iova;
> > > > > + size_t  size;
> > > > > + int prot;   /* IOMMU_READ/WRITE */
> > > > > + int refcount;   /* ref count of 
> > > > > mappings */
> > > > > + uint64_tmap_paddr;  /* Mapped Physical 
> > > > > Address
> > > > */
> > > >
> > > > phys_addr_t
> > >
> > > Ok,
> > >
> > > >
> > > > > + struct list_head next;
> > > > > +};
> > > > > +
> > > > >  struct vfio_group {
> > > > >   struct iommu_group  *iommu_group;
> > > > >   struct list_headnext;
> > > > > @@ -106,6 +116,38 @@ static struct vfio_dma *vfio_find_dma(struct
> > > > vfio_iommu *iommu,
> > > > >   return NULL;
> > > > >  }
> > > > >
> > > > > +/* This function must be called with iommu->lock held */ static
> > > > > +bool vfio_overlap_with_resvd_region(struct vfio_iommu *iommu,
> > > > > +dma_addr_t start, size_t 
> > > > > size) {
> > > > > + struct vfio_resvd_region *region;
> > > > >

Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI pages

2015-10-06 Thread Alex Williamson
On Tue, 2015-10-06 at 09:05 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, October 06, 2015 4:15 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI
> > pages
> > 
> > On Mon, 2015-10-05 at 06:27 +, Bhushan Bharat wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Saturday, October 03, 2015 4:16 AM
> > > > To: Bhushan Bharat-R65777 
> > > > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > > > christoffer.d...@linaro.org; eric.au...@linaro.org;
> > > > pranavku...@linaro.org; marc.zyng...@arm.com; will.dea...@arm.com
> > > > Subject: Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap
> > > > MSI pages
> > > >
> > > > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > > > For MSI interrupts to work for a pass-through devices we need to
> > > > > have mapping of msi-pages in iommu. Now on some platforms (like
> > > > > x86) does this msi-pages mapping happens magically and in these
> > > > > case they chooses an iova which they somehow know that it will
> > > > > never overlap with guest memory. But this magic iova selection may
> > > > > not be always true for all platform (like PowerPC and ARM64).
> > > > >
> > > > > Also on x86 platform, there is no problem as long as running a
> > > > > x86-guest on x86-host but there can be issues when running a
> > > > > non-x86 guest on
> > > > > x86 host or other userspace applications like (I think ODP/DPDK).
> > > > > As in these cases there can be chances that it overlaps with guest
> > > > > memory mapping.
> > > >
> > > > Wow, it's amazing anything works... smoke and mirrors.
> > > >
> > > > > This patch add interface to iommu-map and iommu-unmap msi-pages
> > at
> > > > > reserved iova chosen by userspace.
> > > > >
> > > > > Signed-off-by: Bharat Bhushan 
> > > > > ---
> > > > >  drivers/vfio/vfio.c |  52 +++
> > > > >  drivers/vfio/vfio_iommu_type1.c | 111
> > > > 
> > > > >  include/linux/vfio.h|   9 +++-
> > > > >  3 files changed, 171 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > > > 2fb29df..a817d2d 100644
> > > > > --- a/drivers/vfio/vfio.c
> > > > > +++ b/drivers/vfio/vfio.c
> > > > > @@ -605,6 +605,58 @@ static int vfio_iommu_group_notifier(struct
> > > > notifier_block *nb,
> > > > >   return NOTIFY_OK;
> > > > >  }
> > > > >
> > > > > +int vfio_device_map_msi(struct vfio_device *device, uint64_t
> > msi_addr,
> > > > > + uint32_t size, uint64_t *msi_iova) {
> > > > > + struct vfio_container *container = device->group->container;
> > > > > + struct vfio_iommu_driver *driver;
> > > > > + int ret;
> > > > > +
> > > > > + /* Validate address and size */
> > > > > + if (!msi_addr || !size || !msi_iova)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + down_read(&container->group_lock);
> > > > > +
> > > > > + driver = container->iommu_driver;
> > > > > + if (!driver || !driver->ops || !driver->ops->msi_map) {
> > > > > + up_read(&container->group_lock);
> > > > > + return -EINVAL;
> > > > > + }
> > > > > +
> > > > > + ret = driver->ops->msi_map(container->iommu_data,
> > > > > +msi_addr, size, msi_iova);
> > > > > +
> > > > > + up_read(&container->group_lock);
> > > > > + return ret

Re: [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state

2015-10-06 Thread Alex Williamson
On Tue, 2015-10-06 at 08:53 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, October 06, 2015 4:15 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs
> > automap state
> > 
> > On Mon, 2015-10-05 at 06:00 +, Bhushan Bharat wrote:
> > > > -1138,6 +1156,8 @@
> > > > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > > > >   }
> > > > >   } else if (cmd == VFIO_IOMMU_GET_INFO) {
> > > > >   struct vfio_iommu_type1_info info;
> > > > > + struct iommu_domain_msi_maps msi_maps;
> > > > > + int ret;
> > > > >
> > > > >   minsz = offsetofend(struct vfio_iommu_type1_info,
> > > > iova_pgsizes);
> > > > >
> > > > > @@ -1149,6 +1169,18 @@ static long vfio_iommu_type1_ioctl(void
> > > > > *iommu_data,
> > > > >
> > > > >   info.flags = 0;
> > > > >
> > > > > + ret = vfio_domains_get_msi_maps(iommu, &msi_maps);
> > > > > + if (ret)
> > > > > + return ret;
> > > >
> > > > And now ioctl(VFIO_IOMMU_GET_INFO) no longer works for any
> > IOMMU
> > > > implementing domain_get_attr but not supporting
> > > > DOMAIN_ATTR_MSI_MAPPING.
> > >
> > > With this current patch version this will get the default assumed behavior
> > as you commented on previous patch.
> > 
> > How so?
> 
> You are right, the ioctl will return failure. But that should be ok, right?

Not remotely.  ioctl(VFIO_IOMMU_GET_INFO) can't suddenly stop working on
some platforms.

> > 
> > +   msi_maps->automap = true;
> > +   msi_maps->override_automap = false;
> > +
> > +   if (domain->ops->domain_get_attr)
> > +   ret = domain->ops->domain_get_attr(domain, attr,
> > + data);
> > 
> > If domain_get_attr is implemented, but DOMAIN_ATTR_MSI_MAPPING is
> > not, ret should be an error code.
> 
> Currently it returns same error code returned by 
> domain->ops->domain_get_attr(). 
> I do not think we want to complicate that we return an error to user-space 
> that msi's probably cannot be used but user-space can continue with Legacy 
> interrupt, or you want that?

I can't really parse your statement, but ioctl(VFIO_IOMMU_GET_INFO)
works today and it must work with your changes.  Your change should only
affect whether some flags are visible, MSI has worked just fine up to
this point on other platforms.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi interrupt

2015-10-06 Thread Alex Williamson
On Tue, 2015-10-06 at 08:32 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Tuesday, October 06, 2015 4:15 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi
> > interrupt
> > 
> > On Mon, 2015-10-05 at 07:20 +, Bhushan Bharat wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Saturday, October 03, 2015 4:17 AM
> > > > To: Bhushan Bharat-R65777 
> > > > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > > > christoffer.d...@linaro.org; eric.au...@linaro.org;
> > > > pranavku...@linaro.org; marc.zyng...@arm.com; will.dea...@arm.com
> > > > Subject: Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi
> > > > interrupt
> > > >
> > > > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > > > An MSI-address is allocated and programmed in pcie device during
> > > > > interrupt configuration. Now for a pass-through device, try to
> > > > > create the iommu mapping for this allocted/programmed msi-address.
> > > > > If the iommu mapping is created and the msi address programmed in
> > > > > the pcie device is different from msi-iova as per iommu
> > > > > programming then reconfigure the pci device to use msi-iova as msi
> > address.
> > > > >
> > > > > Signed-off-by: Bharat Bhushan 
> > > > > ---
> > > > >  drivers/vfio/pci/vfio_pci_intrs.c | 36
> > > > > ++--
> > > > >  1 file changed, 34 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c
> > > > > b/drivers/vfio/pci/vfio_pci_intrs.c
> > > > > index 1f577b4..c9690af 100644
> > > > > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > > > > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > > > > @@ -312,13 +312,23 @@ static int vfio_msi_set_vector_signal(struct
> > > > vfio_pci_device *vdev,
> > > > >   int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
> > > > >   char *name = msix ? "vfio-msix" : "vfio-msi";
> > > > >   struct eventfd_ctx *trigger;
> > > > > + struct msi_msg msg;
> > > > > + struct vfio_device *device;
> > > > > + uint64_t msi_addr, msi_iova;
> > > > >   int ret;
> > > > >
> > > > >   if (vector >= vdev->num_ctx)
> > > > >   return -EINVAL;
> > > > >
> > > > > + device = vfio_device_get_from_dev(&pdev->dev);
> > > >
> > > > Have you looked at this function?  I don't think we want to be doing
> > > > that every time we want to poke the interrupt configuration.
> > >
> > > I am trying to describe what I understood, a device can have many
> > interrupts and we should setup iommu only once, when called for the first
> > time to enable/setup interrupt.
> > > Similarly when disabling the interrupt we should iommu-unmap when
> > > called for the last enabled interrupt for that device. Now with this
> > > understanding, should I move this map-unmap to separate functions and
> > > call them from vfio_msi_set_block() rather than in
> > > vfio_msi_set_vector_signal()
> > 
> > Interrupts can be setup and torn down at any time and I don't see how one
> > function or the other makes much difference.
> > vfio_device_get_from_dev() is enough overhead that the data we need
> > should be cached if we're going to call it with some regularity.  Maybe
> > vfio_iommu_driver_ops.open() should be called with a pointer to the
> > vfio_device... or the vfio_group.
> 
> vfio_iommu_driver_ops.open() ? or do you mean vfio_pci_open() should be 
> called with vfio_device or vfio_group, and we will cache that in 
> vfio_pci_device ?

vfio_pci_open() is an implementation of vfio_iommu_driver_ops.open().
The internal API between vfio and vfio bus drivers would need to have a
parameter added.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 6/6] arm-smmu: Allow to set iommu mapping for MSI

2015-10-05 Thread Alex Williamson
On Mon, 2015-10-05 at 08:33 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Saturday, October 03, 2015 4:17 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 6/6] arm-smmu: Allow to set iommu mapping for
> > MSI
> > 
> > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > Finally ARM SMMU declare that iommu-mapping for MSI-pages are not set
> > > automatically and it should be set explicitly.
> > >
> > > Signed-off-by: Bharat Bhushan 
> > > ---
> > >  drivers/iommu/arm-smmu.c | 7 ++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index
> > > a3956fb..9d37e72 100644
> > > --- a/drivers/iommu/arm-smmu.c
> > > +++ b/drivers/iommu/arm-smmu.c
> > > @@ -1401,13 +1401,18 @@ static int arm_smmu_domain_get_attr(struct
> > iommu_domain *domain,
> > >   enum iommu_attr attr, void *data)  {
> > >   struct arm_smmu_domain *smmu_domain =
> > to_smmu_domain(domain);
> > > + struct iommu_domain_msi_maps *msi_maps;
> > >
> > >   switch (attr) {
> > >   case DOMAIN_ATTR_NESTING:
> > >   *(int *)data = (smmu_domain->stage ==
> > ARM_SMMU_DOMAIN_NESTED);
> > >   return 0;
> > >   case DOMAIN_ATTR_MSI_MAPPING:
> > > - /* Dummy handling added */
> > > + msi_maps = data;
> > > +
> > > + msi_maps->automap = false;
> > > + msi_maps->override_automap = true;
> > > +
> > >   return 0;
> > >   default:
> > >   return -ENODEV;
> > 
> > In previous discussions I understood one of the problems you were trying to
> > solve was having a limited number of MSI banks and while you may be able
> > to get isolated MSI banks for some number of users, it wasn't unlimited and
> > sharing may be required.  I don't see any of that addressed in this series.
> 
> That problem was on PowerPC. Infact there were two problems, one which MSI 
> bank to be used and second how to create iommu-mapping for device assigned to 
> userspace.
> First problem was PowerPC specific and that will be solved separately.
> For second problem, earlier I tried to added a couple of MSI specific ioctls 
> and you suggested (IIUC) that we should have a generic reserved-iova type of 
> API and then we can map MSI bank using reserved-iova and this will not 
> require involvement of user-space.
> 
> > 
> > Also, the management of reserved IOVAs vs MSI addresses looks really
> > dubious to me.  How does your platform pick an MSI address and what are
> > we breaking by covertly changing it?  We seem to be masking over at the
> > VFIO level, where there should be lower level interfaces doing the right 
> > thing
> > when we configure MSI on the device.
> 
> Yes, In my understanding the right solution should be:
>  1) VFIO driver should know what physical-msi-address will be used for 
> devices in an iommu-group.
> I did not find an generic API, on PowerPC I added some function in 
> ffrescale msi-driver and called from vfio-iommu-fsl-pamu.c (not yet 
> upstreamed).
>  2) VFIO driver should know what IOVA to be used for creating iommu-mapping 
> (VFIO APIs patch of this patch series)
>  3) VFIO driver will create the iommu-mapping using (1) and (2)
>  4) VFIO driver should be able to tell the msi-driver that for a given device 
> it should use different IOVA. So when composing the msi message (for the 
> devices is the given iommu-group) it should use that programmed iova as 
> MSI-address. This interface also needed to be developed.
> 
> I was not sure of which approach we should take. The current approach in the 
> patch is simple to develop so I went ahead to take input but I agree this 
> does not look very good.
> What do you think, should drop this approach and work out the approach as 
> described above.

I'm certainly not interested in applying an maintaining an interim
solution that isn't the right one.  It seems like VFIO is too involved
in this process in your example.  On x86 we have per vector isolation
and the only thing we're missing is reporting back of the region used by
MSI vectors as reserved IOVA space (but it's stan

Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region

2015-10-05 Thread Alex Williamson
On Mon, 2015-10-05 at 04:55 +, Bhushan Bharat wrote:
> Hi Alex,
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Saturday, October 03, 2015 4:16 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova
> > region
> > 
> > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > This Patch adds the VFIO APIs to add and remove reserved iova regions.
> > > The reserved iova region can be used for mapping some specific
> > > physical address in iommu.
> > >
> > > Currently we are planning to use this interface for adding iova
> > > regions for creating iommu of msi-pages. But the API are designed for
> > > future extension where some other physical address can be mapped.
> > >
> > > Signed-off-by: Bharat Bhushan 
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 201
> > +++-
> > >  include/uapi/linux/vfio.h   |  43 +
> > >  2 files changed, 243 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_iommu_type1.c index 57d8c37..fa5d3e4 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -59,6 +59,7 @@ struct vfio_iommu {
> > >   struct rb_root  dma_list;
> > >   boolv2;
> > >   boolnesting;
> > > + struct list_headreserved_iova_list;
> > 
> > This alignment leads to poor packing in the structure, put it above the 
> > bools.
> 
> ok
> 
> > 
> > >  };
> > >
> > >  struct vfio_domain {
> > > @@ -77,6 +78,15 @@ struct vfio_dma {
> > >   int prot;   /* IOMMU_READ/WRITE */
> > >  };
> > >
> > > +struct vfio_resvd_region {
> > > + dma_addr_t  iova;
> > > + size_t  size;
> > > + int prot;   /* IOMMU_READ/WRITE */
> > > + int refcount;   /* ref count of mappings */
> > > + uint64_tmap_paddr;  /* Mapped Physical Address
> > */
> > 
> > phys_addr_t
> 
> Ok,
> 
> > 
> > > + struct list_head next;
> > > +};
> > > +
> > >  struct vfio_group {
> > >   struct iommu_group  *iommu_group;
> > >   struct list_headnext;
> > > @@ -106,6 +116,38 @@ static struct vfio_dma *vfio_find_dma(struct
> > vfio_iommu *iommu,
> > >   return NULL;
> > >  }
> > >
> > > +/* This function must be called with iommu->lock held */ static bool
> > > +vfio_overlap_with_resvd_region(struct vfio_iommu *iommu,
> > > +dma_addr_t start, size_t size) {
> > > + struct vfio_resvd_region *region;
> > > +
> > > + list_for_each_entry(region, &iommu->reserved_iova_list, next) {
> > > + if (region->iova < start)
> > > + return (start - region->iova < region->size);
> > > + else if (start < region->iova)
> > > + return (region->iova - start < size);
> > 
> > <= on both of the return lines?
> 
> I think is should be "<" and not "=<", no ?

Yep, looks like you're right.  Maybe there's a more straightforward way
to do this.

> > 
> > > +
> > > + return (region->size > 0 && size > 0);
> > > + }
> > > +
> > > + return false;
> > > +}
> > > +
> > > +/* This function must be called with iommu->lock held */ static
> > > +struct vfio_resvd_region *vfio_find_resvd_region(struct vfio_iommu
> > *iommu,
> > > +  dma_addr_t start, size_t
> > size) {
> > > + struct vfio_resvd_region *region;
> > > +
> > > + list_for_each_entry(region, &iommu->reserved_iova_list, next)
> > > + if (region->iova == start && region->size == size)
> > > + return region;
> > > +
> > > + return NULL;
> > > +}
> > > +
> > >  static void vfio_link

Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI pages

2015-10-05 Thread Alex Williamson
On Mon, 2015-10-05 at 06:27 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Saturday, October 03, 2015 4:16 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI
> > pages
> > 
> > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > For MSI interrupts to work for a pass-through devices we need to have
> > > mapping of msi-pages in iommu. Now on some platforms (like x86) does
> > > this msi-pages mapping happens magically and in these case they
> > > chooses an iova which they somehow know that it will never overlap
> > > with guest memory. But this magic iova selection may not be always
> > > true for all platform (like PowerPC and ARM64).
> > >
> > > Also on x86 platform, there is no problem as long as running a
> > > x86-guest on x86-host but there can be issues when running a non-x86
> > > guest on
> > > x86 host or other userspace applications like (I think ODP/DPDK).
> > > As in these cases there can be chances that it overlaps with guest
> > > memory mapping.
> > 
> > Wow, it's amazing anything works... smoke and mirrors.
> > 
> > > This patch add interface to iommu-map and iommu-unmap msi-pages at
> > > reserved iova chosen by userspace.
> > >
> > > Signed-off-by: Bharat Bhushan 
> > > ---
> > >  drivers/vfio/vfio.c |  52 +++
> > >  drivers/vfio/vfio_iommu_type1.c | 111
> > 
> > >  include/linux/vfio.h|   9 +++-
> > >  3 files changed, 171 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c index
> > > 2fb29df..a817d2d 100644
> > > --- a/drivers/vfio/vfio.c
> > > +++ b/drivers/vfio/vfio.c
> > > @@ -605,6 +605,58 @@ static int vfio_iommu_group_notifier(struct
> > notifier_block *nb,
> > >   return NOTIFY_OK;
> > >  }
> > >
> > > +int vfio_device_map_msi(struct vfio_device *device, uint64_t msi_addr,
> > > + uint32_t size, uint64_t *msi_iova) {
> > > + struct vfio_container *container = device->group->container;
> > > + struct vfio_iommu_driver *driver;
> > > + int ret;
> > > +
> > > + /* Validate address and size */
> > > + if (!msi_addr || !size || !msi_iova)
> > > + return -EINVAL;
> > > +
> > > + down_read(&container->group_lock);
> > > +
> > > + driver = container->iommu_driver;
> > > + if (!driver || !driver->ops || !driver->ops->msi_map) {
> > > + up_read(&container->group_lock);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = driver->ops->msi_map(container->iommu_data,
> > > +msi_addr, size, msi_iova);
> > > +
> > > + up_read(&container->group_lock);
> > > + return ret;
> > > +}
> > > +
> > > +int vfio_device_unmap_msi(struct vfio_device *device, uint64_t
> > msi_iova,
> > > +   uint64_t size)
> > > +{
> > > + struct vfio_container *container = device->group->container;
> > > + struct vfio_iommu_driver *driver;
> > > + int ret;
> > > +
> > > + /* Validate address and size */
> > > + if (!msi_iova || !size)
> > > + return -EINVAL;
> > > +
> > > + down_read(&container->group_lock);
> > > +
> > > + driver = container->iommu_driver;
> > > + if (!driver || !driver->ops || !driver->ops->msi_unmap) {
> > > + up_read(&container->group_lock);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + ret = driver->ops->msi_unmap(container->iommu_data,
> > > +  msi_iova, size);
> > > +
> > > + up_read(&container->group_lock);
> > > + return ret;
> > > +}
> > > +
> > >  /**
> > >   * VFIO driver API
> > >   */
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > > b/drivers/vfio/vfio_

Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi interrupt

2015-10-05 Thread Alex Williamson
On Mon, 2015-10-05 at 07:20 +, Bhushan Bharat wrote:
> 
> 
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Saturday, October 03, 2015 4:17 AM
> > To: Bhushan Bharat-R65777 
> > Cc: kvm...@lists.cs.columbia.edu; kvm@vger.kernel.org;
> > christoffer.d...@linaro.org; eric.au...@linaro.org; pranavku...@linaro.org;
> > marc.zyng...@arm.com; will.dea...@arm.com
> > Subject: Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi
> > interrupt
> > 
> > On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> > > An MSI-address is allocated and programmed in pcie device during
> > > interrupt configuration. Now for a pass-through device, try to create
> > > the iommu mapping for this allocted/programmed msi-address.  If the
> > > iommu mapping is created and the msi address programmed in the pcie
> > > device is different from msi-iova as per iommu programming then
> > > reconfigure the pci device to use msi-iova as msi address.
> > >
> > > Signed-off-by: Bharat Bhushan 
> > > ---
> > >  drivers/vfio/pci/vfio_pci_intrs.c | 36
> > > ++--
> > >  1 file changed, 34 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c
> > > b/drivers/vfio/pci/vfio_pci_intrs.c
> > > index 1f577b4..c9690af 100644
> > > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > > @@ -312,13 +312,23 @@ static int vfio_msi_set_vector_signal(struct
> > vfio_pci_device *vdev,
> > >   int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
> > >   char *name = msix ? "vfio-msix" : "vfio-msi";
> > >   struct eventfd_ctx *trigger;
> > > + struct msi_msg msg;
> > > + struct vfio_device *device;
> > > + uint64_t msi_addr, msi_iova;
> > >   int ret;
> > >
> > >   if (vector >= vdev->num_ctx)
> > >   return -EINVAL;
> > >
> > > + device = vfio_device_get_from_dev(&pdev->dev);
> > 
> > Have you looked at this function?  I don't think we want to be doing that
> > every time we want to poke the interrupt configuration.
> 
> I am trying to describe what I understood, a device can have many interrupts 
> and we should setup iommu only once, when called for the first time to 
> enable/setup interrupt.
> Similarly when disabling the interrupt we should iommu-unmap when called for 
> the last enabled interrupt for that device. Now with this understanding, 
> should I move this map-unmap to separate functions and call them from 
> vfio_msi_set_block() rather than in vfio_msi_set_vector_signal()

Interrupts can be setup and torn down at any time and I don't see how
one function or the other makes much difference.
vfio_device_get_from_dev() is enough overhead that the data we need
should be cached if we're going to call it with some regularity.  Maybe
vfio_iommu_driver_ops.open() should be called with a pointer to the
vfio_device... or the vfio_group.

> >  Also note that
> > IOMMU mappings don't operate on devices, but groups, so maybe we want
> > to pass the group.
> 
> Yes, it operates on group. I hesitated to add an API to get group. Do you 
> suggest to that it is ok to add API to get group from device.

No, the above suggestion is probably better.

> > 
> > > + if (device == NULL)
> > > + return -EINVAL;
> > 
> > This would be a legitimate BUG_ON(!device)
> > 
> > > +
> > >   if (vdev->ctx[vector].trigger) {
> > >   free_irq(irq, vdev->ctx[vector].trigger);
> > > + get_cached_msi_msg(irq, &msg);
> > > + msi_iova = ((u64)msg.address_hi << 32) | msg.address_lo;
> > > + vfio_device_unmap_msi(device, msi_iova, PAGE_SIZE);
> > >   kfree(vdev->ctx[vector].name);
> > >   eventfd_ctx_put(vdev->ctx[vector].trigger);
> > >   vdev->ctx[vector].trigger = NULL;
> > > @@ -346,12 +356,11 @@ static int vfio_msi_set_vector_signal(struct
> > vfio_pci_device *vdev,
> > >* cached value of the message prior to enabling.
> > >*/
> > >   if (msix) {
> > > - struct msi_msg msg;
> > > -
> > >   get_cached_msi_msg(irq, &msg);
> > >   pci_write_msi_msg(irq, &msg);
> > >   }
> > >
> > > +
> > 
> > gratuitous newline
> > 
> > >   ret = request_ir

Re: [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state

2015-10-05 Thread Alex Williamson
On Mon, 2015-10-05 at 06:00 +, Bhushan Bharat wrote:
> > -1138,6 +1156,8 @@
> > > static long vfio_iommu_type1_ioctl(void *iommu_data,
> > >   }
> > >   } else if (cmd == VFIO_IOMMU_GET_INFO) {
> > >   struct vfio_iommu_type1_info info;
> > > + struct iommu_domain_msi_maps msi_maps;
> > > + int ret;
> > >
> > >   minsz = offsetofend(struct vfio_iommu_type1_info,
> > iova_pgsizes);
> > >
> > > @@ -1149,6 +1169,18 @@ static long vfio_iommu_type1_ioctl(void
> > > *iommu_data,
> > >
> > >   info.flags = 0;
> > >
> > > + ret = vfio_domains_get_msi_maps(iommu, &msi_maps);
> > > + if (ret)
> > > + return ret;
> > 
> > And now ioctl(VFIO_IOMMU_GET_INFO) no longer works for any IOMMU
> > implementing domain_get_attr but not supporting
> > DOMAIN_ATTR_MSI_MAPPING.
> 
> With this current patch version this will get the default assumed behavior as 
> you commented on previous patch. 

How so?

+   msi_maps->automap = true;
+   msi_maps->override_automap = false;
+
+   if (domain->ops->domain_get_attr)
+   ret = domain->ops->domain_get_attr(domain, attr, data);

If domain_get_attr is implemented, but DOMAIN_ATTR_MSI_MAPPING is not,
ret should be an error code.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/6] vfio-pci: Create iommu mapping for msi interrupt

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> An MSI-address is allocated and programmed in pcie device
> during interrupt configuration. Now for a pass-through device,
> try to create the iommu mapping for this allocted/programmed
> msi-address.  If the iommu mapping is created and the msi
> address programmed in the pcie device is different from
> msi-iova as per iommu programming then reconfigure the pci
> device to use msi-iova as msi address.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/pci/vfio_pci_intrs.c | 36 ++--
>  1 file changed, 34 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1f577b4..c9690af 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -312,13 +312,23 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   int irq = msix ? vdev->msix[vector].vector : pdev->irq + vector;
>   char *name = msix ? "vfio-msix" : "vfio-msi";
>   struct eventfd_ctx *trigger;
> + struct msi_msg msg;
> + struct vfio_device *device;
> + uint64_t msi_addr, msi_iova;
>   int ret;
>  
>   if (vector >= vdev->num_ctx)
>   return -EINVAL;
>  
> + device = vfio_device_get_from_dev(&pdev->dev);

Have you looked at this function?  I don't think we want to be doing
that every time we want to poke the interrupt configuration.  Also note
that IOMMU mappings don't operate on devices, but groups, so maybe we
want to pass the group.

> + if (device == NULL)
> + return -EINVAL;

This would be a legitimate BUG_ON(!device)

> +
>   if (vdev->ctx[vector].trigger) {
>   free_irq(irq, vdev->ctx[vector].trigger);
> + get_cached_msi_msg(irq, &msg);
> + msi_iova = ((u64)msg.address_hi << 32) | msg.address_lo;
> + vfio_device_unmap_msi(device, msi_iova, PAGE_SIZE);
>   kfree(vdev->ctx[vector].name);
>   eventfd_ctx_put(vdev->ctx[vector].trigger);
>   vdev->ctx[vector].trigger = NULL;
> @@ -346,12 +356,11 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>* cached value of the message prior to enabling.
>*/
>   if (msix) {
> - struct msi_msg msg;
> -
>   get_cached_msi_msg(irq, &msg);
>   pci_write_msi_msg(irq, &msg);
>   }
>  
> +

gratuitous newline

>   ret = request_irq(irq, vfio_msihandler, 0,
> vdev->ctx[vector].name, trigger);
>   if (ret) {
> @@ -360,6 +369,29 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   return ret;
>   }
>  
> + /* Re-program the new-iova in pci-device in case there is
> +  * different iommu-mapping created for programmed msi-address.
> +  */
> + get_cached_msi_msg(irq, &msg);
> + msi_iova = 0;
> + msi_addr = (u64)(msg.address_hi) << 32 | (u64)(msg.address_lo);
> + ret = vfio_device_map_msi(device, msi_addr, PAGE_SIZE, &msi_iova);
> + if (ret) {
> + free_irq(irq, vdev->ctx[vector].trigger);
> + kfree(vdev->ctx[vector].name);
> + eventfd_ctx_put(trigger);
> + return ret;
> + }
> +
> + /* Reprogram only if iommu-mapped iova is different from msi-address */
> + if (msi_iova && (msi_iova != msi_addr)) {
> + msg.address_hi = (u32)(msi_iova >> 32);
> + /* Keep Lower bits from original msi message address */
> + msg.address_lo &= PAGE_MASK;
> + msg.address_lo |= (u32)(msi_iova & 0x);

Seems like you're making some assumptions here that are dependent on the
architecture and maybe the platform.

> + pci_write_msi_msg(irq, &msg);
> + }
> +
>   vdev->ctx[vector].trigger = trigger;
>  
>   return 0;



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 2/6] iommu: Add interface to get msi-pages mapping attributes

2015-10-02 Thread Alex Williamson
[really ought to consider cc'ing the iommu list]

On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This APIs return the capability of automatically mapping msi-pages
> in iommu with some magic iova. Which is what currently most of
> iommu's does and is the default behaviour.
> 
> Further API returns whether iommu allows the user to define different
> iova for mai-page mapping for the domain. This is required when a msi
> capable device is directly assigned to user-space/VM and user-space/VM
> need to define a non-overlapping (from other dma-able address space)
> iova for msi-pages mapping in iommu.
> 
> This patch just define the interface and follow up patches will
> extend this interface.

This is backwards, generally you want to add the infrastructure and only
expose it once all the pieces are in place for it to work.  For
instance, patch 1/6 exposes a new userspace interface for vfio that
doesn't do anything yet.  How does the user know if it's there, *and*
works?

> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/iommu/arm-smmu.c|  3 +++
>  drivers/iommu/fsl_pamu_domain.c |  3 +++
>  drivers/iommu/iommu.c   | 14 ++
>  include/linux/iommu.h   |  9 -
>  4 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 66a803b..a3956fb 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1406,6 +1406,9 @@ static int arm_smmu_domain_get_attr(struct iommu_domain 
> *domain,
>   case DOMAIN_ATTR_NESTING:
>   *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
>   return 0;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + /* Dummy handling added */
> + return 0;
>   default:
>   return -ENODEV;
>   }
> diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
> index 1d45293..9a94430 100644
> --- a/drivers/iommu/fsl_pamu_domain.c
> +++ b/drivers/iommu/fsl_pamu_domain.c
> @@ -856,6 +856,9 @@ static int fsl_pamu_get_domain_attr(struct iommu_domain 
> *domain,
>   case DOMAIN_ATTR_FSL_PAMUV1:
>   *(int *)data = DOMAIN_ATTR_FSL_PAMUV1;
>   break;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + /* Dummy handling added */
> + break;
>   default:
>   pr_debug("Unsupported attribute type\n");
>   ret = -EINVAL;
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d4f527e..16c2eab 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -1216,6 +1216,7 @@ int iommu_domain_get_attr(struct iommu_domain *domain,
>   bool *paging;
>   int ret = 0;
>   u32 *count;
> + struct iommu_domain_msi_maps *msi_maps;
>  
>   switch (attr) {
>   case DOMAIN_ATTR_GEOMETRY:
> @@ -1236,6 +1237,19 @@ int iommu_domain_get_attr(struct iommu_domain *domain,
>   ret = -ENODEV;
>  
>   break;
> + case DOMAIN_ATTR_MSI_MAPPING:
> + msi_maps = data;
> +
> + /* Default MSI-pages are magically mapped with some iova and
> +  * do now allow to configure with different iova.
> +  */
> + msi_maps->automap = true;
> + msi_maps->override_automap = false;

There's no magic.  I think what you're trying to express is the
difference between platforms that support MSI within the IOMMU IOVA
space and thus need explicit IOMMU mappings vs platforms where MSI
mappings either bypass the IOMMU entirely or are setup implicitly with
interrupt remapping support.

Why does it make sense to impose any sort of defaults?  If the IOMMU
driver doesn't tell us what to do, I don't think we want to assume
anything.

> +
> + if (domain->ops->domain_get_attr)
> + ret = domain->ops->domain_get_attr(domain, attr, data);
> +
> + break;
>   default:
>   if (!domain->ops->domain_get_attr)
>   return -EINVAL;
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 0546b87..6d49f3f 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -83,6 +83,13 @@ struct iommu_domain {
>   struct iommu_domain_geometry geometry;
>  };
>  
> +struct iommu_domain_msi_maps {
> + dma_addr_t base_address;
> + dma_addr_t size;

size_t?

> + bool automap;
> + bool override_automap;
> +};
> +
>  enum iommu_cap {
>   IOMMU_CAP_CACHE_COHERENCY,  /* IOMMU can enforce cache coherent DMA
>  transactions */
> @@ -111,6 +118,7 @@ enum iommu_attr {
>   DOMAIN_ATTR_FSL_PAMU_ENABLE,
>   DOMAIN_ATTR_FSL_PAMUV1,
>   DOMAIN_ATTR_NESTING,/* two stages of translation */
> + DOMAIN_ATTR_MSI_MAPPING, /* Provides MSIs mapping in iommu */
>   DOMAIN_ATTR_MAX,
>  };
>  
> @@ -167,7 +175,6 @@ struct iommu_ops {
>   int (*domain_set_windows)(struct iommu_doma

Re: [RFC PATCH 1/6] vfio: Add interface for add/del reserved iova region

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This Patch adds the VFIO APIs to add and remove reserved iova
> regions. The reserved iova region can be used for mapping some
> specific physical address in iommu.
> 
> Currently we are planning to use this interface for adding iova
> regions for creating iommu of msi-pages. But the API are designed
> for future extension where some other physical address can be mapped.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 201 
> +++-
>  include/uapi/linux/vfio.h   |  43 +
>  2 files changed, 243 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 57d8c37..fa5d3e4 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -59,6 +59,7 @@ struct vfio_iommu {
>   struct rb_root  dma_list;
>   boolv2;
>   boolnesting;
> + struct list_headreserved_iova_list;

This alignment leads to poor packing in the structure, put it above the
bools.

>  };
>  
>  struct vfio_domain {
> @@ -77,6 +78,15 @@ struct vfio_dma {
>   int prot;   /* IOMMU_READ/WRITE */
>  };
>  
> +struct vfio_resvd_region {
> + dma_addr_t  iova;
> + size_t  size;
> + int prot;   /* IOMMU_READ/WRITE */
> + int refcount;   /* ref count of mappings */
> + uint64_tmap_paddr;  /* Mapped Physical Address */

phys_addr_t

> + struct list_head next;
> +};
> +
>  struct vfio_group {
>   struct iommu_group  *iommu_group;
>   struct list_headnext;
> @@ -106,6 +116,38 @@ static struct vfio_dma *vfio_find_dma(struct vfio_iommu 
> *iommu,
>   return NULL;
>  }
>  
> +/* This function must be called with iommu->lock held */
> +static bool vfio_overlap_with_resvd_region(struct vfio_iommu *iommu,
> +dma_addr_t start, size_t size)
> +{
> + struct vfio_resvd_region *region;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next) {
> + if (region->iova < start)
> + return (start - region->iova < region->size);
> + else if (start < region->iova)
> + return (region->iova - start < size);

<= on both of the return lines?

> +
> + return (region->size > 0 && size > 0);
> + }
> +
> + return false;
> +}
> +
> +/* This function must be called with iommu->lock held */
> +static
> +struct vfio_resvd_region *vfio_find_resvd_region(struct vfio_iommu *iommu,
> +  dma_addr_t start, size_t size)
> +{
> + struct vfio_resvd_region *region;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next)
> + if (region->iova == start && region->size == size)
> + return region;
> +
> + return NULL;
> +}
> +
>  static void vfio_link_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
>  {
>   struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> @@ -580,7 +622,8 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>  
>   mutex_lock(&iommu->lock);
>  
> - if (vfio_find_dma(iommu, iova, size)) {
> + if (vfio_find_dma(iommu, iova, size) ||
> + vfio_overlap_with_resvd_region(iommu, iova, size)) {
>   mutex_unlock(&iommu->lock);
>   return -EEXIST;
>   }
> @@ -626,6 +669,127 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu,
>   return ret;
>  }
>  
> +/* This function must be called with iommu->lock held */
> +static
> +int vfio_iommu_resvd_region_del(struct vfio_iommu *iommu,
> + dma_addr_t iova, size_t size, int prot)
> +{
> + struct vfio_resvd_region *res_region;

Have some consistency in naming, just use "region".
> +
> + res_region = vfio_find_resvd_region(iommu, iova, size);
> + /* Region should not be mapped in iommu */
> + if (res_region == NULL || res_region->map_paddr)
> + return -EINVAL;

Are these two separate errors?  !region is -EINVAL, but being mapped is
-EBUSY.

> +
> + list_del(&res_region->next);
> + kfree(res_region);
> + return 0;
> +}
> +
> +/* This function must be called with iommu->lock held */
> +static int vfio_iommu_resvd_region_add(struct vfio_iommu *iommu,
> +dma_addr_t iova, size_t size, int prot)
> +{
> + struct vfio_resvd_region *res_region;
> +
> + /* Check overlap with with dma maping and reserved regions */
> + if (vfio_find_dma(iommu, iova, size) ||
> + vfio_find_resvd_region(iommu, iova, size))
> + return -EEXIST;
> +
> + res_region = kzalloc(sizeof(*res_region), GFP_KERNEL);
> + if (res_region == NULL)
> + return -ENOMEM;
> +

Re: [RFC PATCH 6/6] arm-smmu: Allow to set iommu mapping for MSI

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> Finally ARM SMMU declare that iommu-mapping for MSI-pages are not
> set automatically and it should be set explicitly.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/iommu/arm-smmu.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index a3956fb..9d37e72 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1401,13 +1401,18 @@ static int arm_smmu_domain_get_attr(struct 
> iommu_domain *domain,
>   enum iommu_attr attr, void *data)
>  {
>   struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> + struct iommu_domain_msi_maps *msi_maps;
>  
>   switch (attr) {
>   case DOMAIN_ATTR_NESTING:
>   *(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
>   return 0;
>   case DOMAIN_ATTR_MSI_MAPPING:
> - /* Dummy handling added */
> + msi_maps = data;
> +
> + msi_maps->automap = false;
> + msi_maps->override_automap = true;
> +
>   return 0;
>   default:
>   return -ENODEV;

In previous discussions I understood one of the problems you were trying
to solve was having a limited number of MSI banks and while you may be
able to get isolated MSI banks for some number of users, it wasn't
unlimited and sharing may be required.  I don't see any of that
addressed in this series.

Also, the management of reserved IOVAs vs MSI addresses looks really
dubious to me.  How does your platform pick an MSI address and what are
we breaking by covertly changing it?  We seem to be masking over at the
VFIO level, where there should be lower level interfaces doing the right
thing when we configure MSI on the device.

The problem of reporting "automap" base address isn't addressed more
than leaving some unused field in iommu_domain_msi_maps.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 4/6] vfio: Add interface to iommu-map/unmap MSI pages

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> For MSI interrupts to work for a pass-through devices we need
> to have mapping of msi-pages in iommu. Now on some platforms
> (like x86) does this msi-pages mapping happens magically and in these
> case they chooses an iova which they somehow know that it will never
> overlap with guest memory. But this magic iova selection
> may not be always true for all platform (like PowerPC and ARM64).
> 
> Also on x86 platform, there is no problem as long as running a x86-guest
> on x86-host but there can be issues when running a non-x86 guest on
> x86 host or other userspace applications like (I think ODP/DPDK).
> As in these cases there can be chances that it overlaps with guest
> memory mapping.

Wow, it's amazing anything works... smoke and mirrors.

> This patch add interface to iommu-map and iommu-unmap msi-pages at
> reserved iova chosen by userspace.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio.c |  52 +++
>  drivers/vfio/vfio_iommu_type1.c | 111 
> 
>  include/linux/vfio.h|   9 +++-
>  3 files changed, 171 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 2fb29df..a817d2d 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -605,6 +605,58 @@ static int vfio_iommu_group_notifier(struct 
> notifier_block *nb,
>   return NOTIFY_OK;
>  }
>  
> +int vfio_device_map_msi(struct vfio_device *device, uint64_t msi_addr,
> + uint32_t size, uint64_t *msi_iova)
> +{
> + struct vfio_container *container = device->group->container;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + /* Validate address and size */
> + if (!msi_addr || !size || !msi_iova)
> + return -EINVAL;
> +
> + down_read(&container->group_lock);
> +
> + driver = container->iommu_driver;
> + if (!driver || !driver->ops || !driver->ops->msi_map) {
> + up_read(&container->group_lock);
> + return -EINVAL;
> + }
> +
> + ret = driver->ops->msi_map(container->iommu_data,
> +msi_addr, size, msi_iova);
> +
> + up_read(&container->group_lock);
> + return ret;
> +}
> +
> +int vfio_device_unmap_msi(struct vfio_device *device, uint64_t msi_iova,
> +   uint64_t size)
> +{
> + struct vfio_container *container = device->group->container;
> + struct vfio_iommu_driver *driver;
> + int ret;
> +
> + /* Validate address and size */
> + if (!msi_iova || !size)
> + return -EINVAL;
> +
> + down_read(&container->group_lock);
> +
> + driver = container->iommu_driver;
> + if (!driver || !driver->ops || !driver->ops->msi_unmap) {
> + up_read(&container->group_lock);
> + return -EINVAL;
> + }
> +
> + ret = driver->ops->msi_unmap(container->iommu_data,
> +  msi_iova, size);
> +
> + up_read(&container->group_lock);
> + return ret;
> +}
> +
>  /**
>   * VFIO driver API
>   */
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 3315fb6..ab376c2 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -1003,12 +1003,34 @@ out_free:
>   return ret;
>  }
>  
> +static void vfio_iommu_unmap_all_reserved_regions(struct vfio_iommu *iommu)
> +{
> + struct vfio_resvd_region *region;
> + struct vfio_domain *d;
> +
> + list_for_each_entry(region, &iommu->reserved_iova_list, next) {
> + list_for_each_entry(d, &iommu->domain_list, next) {
> + if (!region->map_paddr)
> + continue;
> +
> + if (!iommu_iova_to_phys(d->domain, region->iova))
> + continue;
> +
> + iommu_unmap(d->domain, region->iova, PAGE_SIZE);

PAGE_SIZE?  Why not region->size?

> + region->map_paddr = 0;
> + cond_resched();
> + }
> + }
> +}
> +
>  static void vfio_iommu_unmap_unpin_all(struct vfio_iommu *iommu)
>  {
>   struct rb_node *node;
>  
>   while ((node = rb_first(&iommu->dma_list)))
>   vfio_remove_dma(iommu, rb_entry(node, struct vfio_dma, node));
> +
> + vfio_iommu_unmap_all_reserved_regions(iommu);
>  }
>  
>  static void vfio_iommu_type1_detach_group(void *iommu_data,
> @@ -1048,6 +1070,93 @@ done:
>   mutex_unlock(&iommu->lock);
>  }
>  
> +static int vfio_iommu_type1_msi_map(void *iommu_data, uint64_t msi_addr,
> + uint64_t size, uint64_t *msi_iova)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct vfio_resvd_region *region;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> +
> + /* Do not try ceate iommu-mapping if msi reconfig not allowed */
> + if (!iommu->allow_m

Re: [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state

2015-10-02 Thread Alex Williamson
On Wed, 2015-09-30 at 20:26 +0530, Bharat Bhushan wrote:
> This patch allows the user-space to know whether msi-pages
> are automatically mapped with some magic iova or not.
> 
> Even if the msi-pages are automatically mapped, still user-space
> wants to over-ride the automatic iova selection for msi-mapping.
> For this user-space need to know whether it is allowed to change
> the automatic mapping or not and this API provides this mechanism.
> Follow up patches will provide how to over-ride this.
> 
> Signed-off-by: Bharat Bhushan 
> ---
>  drivers/vfio/vfio_iommu_type1.c | 32 
>  include/uapi/linux/vfio.h   |  3 +++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index fa5d3e4..3315fb6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -59,6 +59,7 @@ struct vfio_iommu {
>   struct rb_root  dma_list;
>   boolv2;
>   boolnesting;
> + boolallow_msi_reconfig;
>   struct list_headreserved_iova_list;
>  };
>  
> @@ -1117,6 +1118,23 @@ static int vfio_domains_have_iommu_cache(struct 
> vfio_iommu *iommu)
>   return ret;
>  }
>  
> +static
> +int vfio_domains_get_msi_maps(struct vfio_iommu *iommu,
> +   struct iommu_domain_msi_maps *msi_maps)
> +{
> + struct vfio_domain *d;
> + int ret;
> +
> + mutex_lock(&iommu->lock);
> + /* All domains have same msi-automap property, pick first */
> + d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
> + ret = iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_MAPPING,
> + msi_maps);
> + mutex_unlock(&iommu->lock);
> +
> + return ret;
> +}
> +
>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>  unsigned int cmd, unsigned long arg)
>  {
> @@ -1138,6 +1156,8 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>   }
>   } else if (cmd == VFIO_IOMMU_GET_INFO) {
>   struct vfio_iommu_type1_info info;
> + struct iommu_domain_msi_maps msi_maps;
> + int ret;
>  
>   minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
>  
> @@ -1149,6 +1169,18 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>  
>   info.flags = 0;
>  
> + ret = vfio_domains_get_msi_maps(iommu, &msi_maps);
> + if (ret)
> + return ret;

And now ioctl(VFIO_IOMMU_GET_INFO) no longer works for any IOMMU
implementing domain_get_attr but not supporting DOMAIN_ATTR_MSI_MAPPING.

> +
> + if (msi_maps.override_automap) {
> + info.flags |= VFIO_IOMMU_INFO_MSI_ALLOW_RECONFIG;
> + iommu->allow_msi_reconfig = true;
> + }
> +
> + if (msi_maps.automap)
> + info.flags |= VFIO_IOMMU_INFO_MSI_AUTOMAP;
> +
>   info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>  
>   return copy_to_user((void __user *)arg, &info, minsz);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 1abd1a9..9998f6e 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -391,6 +391,9 @@ struct vfio_iommu_type1_info {
>   __u32   argsz;
>   __u32   flags;
>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0) /* supported page sizes info */
> +#define VFIO_IOMMU_INFO_MSI_AUTOMAP (1 << 1) /* MSI pages are auto-mapped
> +in iommu */
> +#define VFIO_IOMMU_INFO_MSI_ALLOW_RECONFIG (1 << 2) /* Allows reconfig 
> automap*/
>   __u64   iova_pgsizes;   /* Bitmap of supported page sizes */
>  };
>  

Once again, exposing interfaces to the user before they actually do
anything is backwards.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: Allow the Hyper-V vendor ID to be specified

2015-10-02 Thread Alex Williamson
According to Microsoft documentation, the signature in the standard
hypervisor CPUID leaf at 0x4000 identifies the Vendor ID and is
for reporting and diagnostic purposes only.  We can therefore allow
the user to change it to whatever they want, within the 12 character
limit.  Add a new hyperv-vendor-id option to the -cpu flag to allow
for this, ex:

 -cpu host,hv_time,hv_vendor_id=KeenlyKVM

Link: http://msdn.microsoft.com/library/windows/hardware/hh975392
Signed-off-by: Alex Williamson 
---
 target-i386/cpu-qom.h |1 +
 target-i386/cpu.c |1 +
 target-i386/kvm.c |   14 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/target-i386/cpu-qom.h b/target-i386/cpu-qom.h
index c35b624..6c1eaaa 100644
--- a/target-i386/cpu-qom.h
+++ b/target-i386/cpu-qom.h
@@ -88,6 +88,7 @@ typedef struct X86CPU {
 bool hyperv_vapic;
 bool hyperv_relaxed_timing;
 int hyperv_spinlock_attempts;
+char *hyperv_vendor_id;
 bool hyperv_time;
 bool hyperv_crash;
 bool check_cpuid;
diff --git a/target-i386/cpu.c b/target-i386/cpu.c
index bd411b9..101c405 100644
--- a/target-i386/cpu.c
+++ b/target-i386/cpu.c
@@ -3122,6 +3122,7 @@ static Property x86_cpu_properties[] = {
 DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
 DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
 DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
+DEFINE_PROP_STRING("hv-vendor-id", X86CPU, hyperv_vendor_id),
 DEFINE_PROP_END_OF_LIST()
 };
 
diff --git a/target-i386/kvm.c b/target-i386/kvm.c
index 7b0ba17..85aa612 100644
--- a/target-i386/kvm.c
+++ b/target-i386/kvm.c
@@ -489,7 +489,19 @@ int kvm_arch_init_vcpu(CPUState *cs)
 if (hyperv_enabled(cpu)) {
 c = &cpuid_data.entries[cpuid_i++];
 c->function = HYPERV_CPUID_VENDOR_AND_MAX_FUNCTIONS;
-memcpy(signature, "Microsoft Hv", 12);
+if (!cpu->hyperv_vendor_id) {
+memcpy(signature, "Microsoft Hv", 12);
+} else {
+size_t len = strlen(cpu->hyperv_vendor_id);
+
+if (len > 12) {
+fprintf(stderr,
+"hyperv-vendor-id too long, limited to 12 charaters");
+abort();
+}
+memset(signature, 0, 12);
+memcpy(signature, cpu->hyperv_vendor_id, len);
+}
 c->eax = HYPERV_CPUID_MIN;
 c->ebx = signature[0];
 c->ecx = signature[1];

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PCI passthrough problem

2015-10-01 Thread Alex Williamson
On Thu, 2015-10-01 at 22:38 -0400, Phil (list) wrote:
> On Thu, 2015-10-01 at 08:32 -0400, Mauricio Tavares wrote:
> > On Thu, Oct 1, 2015 at 3:27 AM, Phil (list) 
> > wrote:
> > > If this isn't the right place to ask, any pointers to the correct
> > > place
> > > are appreciated...
> > > 
> > > I'm trying to see if I can get PCI passthrough working for a video
> > > capture card (Hauppauge Colossus 1x PCIe) under a Windows XP guest
> > > (32
> > > -bit).  Things appear to be somewhat working (Windows is seeing the
> > > device, the drivers successfully installed, and device manager
> > > indicates everything is working) however when I fire up the capture
> > > application, it is not able to find the device despite Windows
> > > recognizing it (no errors, it just doesn't 'see' any installed
> > > capture
> > > devices).  There is also a secondary capture/viewer application
> > > that
> > > won't even install due to not being able to find a capture card. 
> > >  Since
> > > that wasn't the behavior when running it natively under Windows,
> > > I'm
> > > assuming that the issue is related to PCI passthrough but it's
> > > difficult to be certain since I'm not seeing any errors beyond the
> > > capture applications not being able to find the device.
> > > 
> >   I think you need to find out if the problem follows the
> > program,
> > the card, or the passthrough thingie. For instance, is there any
> > other
> > program you can run to see if it sees the card? If you can't think of
> > anything, you could run a, say, ubuntu/fedora livecd (start you vm
> > client and tell it to boot from iso) and see if it can see and use
> > the
> > card.
> > 
> 
> I only have the two capture apps that came with the card as I don't
> really use Windows for much other than this card anymore.
> 
> To try to verify that everything is fine from a hardware / Windows
> driver standpoint: I took a spare drive and performed a bare metal Win
> XP install, installed the drivers, and then the capture software (i.e.
> the same sequence and software versions as I used in the VM) and
> everything works properly (i.e. both capture applications were able to
> detect and use the capture card as expected).   Other than using a
> different hard drive, all other system hardware was identical.  So that
> would seem to rule out everything from the hardware through to the
> Windows applications and leave it back in the realm of kvm/PCI
> passthrough.
> 
> Unfortunately, no Linux drivers exist for this card (i.e. the reason
> I'm attempting to use it under Windows in a VM) so any other Linux
> distro would have about the same level of support in that it would
> recognize that the PCI card exists but then not be able to do anything
> with it.  If you're thinking that there is a problem with version of
> kvm in Debian, I would be open to trying another distro if that would
> help troubleshoot it.  I'm also reasonably comfortable navigating
> around kvm, it's the PCI passthrough functionality that is new to me.

Are you using vfio to do the device assignment or legacy KVM device
assignment.  If the latter, try the former.  Since you're using a 32-bit
Windows guest, what CPU model are you exposing to the VM?  Windows can
be rather particular about enabling MSI for devices if the processor
model seen by the VM is too old (does the device support MSI?).  '-cpu
host' might help or "host-passthrough" if using libvirt.  You can look
in /proc/interrupts on the host and see if you're getting interrupts
(non-zero count on at least one of the CPUs for the interrupt associated
with the device).  If instead the device is using INTx interrupts,
interrupt masking might be broken.  You can try using the nointxmask=1
module option to vfio-pci, for force masking at the APIC rather than the
device, but be forewarned that you'll need to make the interrupt for the
device exclusive, either by locating it in a slot where it won't share
interrupts or unloading drivers from devices sharing the interrupt line.

There's always the chance that the device is simply not compatible with
PCI device assignment.  We do rely on some degree of good behavior on
the part of the device.  Some environments also expect to find the
device behind a PCIe root port, which is not the topology we expose on
the default 440fx VM chipset.  It's possible that such devices might
work on the Q35 chipset or by placing the device behind a pci-bridge to
fool the software.  It's really hard to tell what might be wrong,
especially since the driver appears to work and only the application
fails, and it's all proprietary code within the black box of a VM.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-21 Thread Alex Williamson
On Mon, 2015-09-21 at 22:11 +1000, Gavin Shan wrote:
> On Mon, Sep 21, 2015 at 11:42:28AM +1000, David Gibson wrote:
> >On Sat, Sep 19, 2015 at 04:22:47PM +1000, David Gibson wrote:
> >> On Fri, Sep 18, 2015 at 09:47:32AM -0600, Alex Williamson wrote:
> >> > On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> >> > > This allows to accept IOMMU group (PE) ID from the parameter from 
> >> > > userland
> >> > > when handling EEH operation so that the operation only affects the 
> >> > > target
> >> > > IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from 
> >> > > userland
> >> > > is invalid, all IOMMU groups (PEs) attached to the specified container 
> >> > > are
> >> > > affected as before.
> >> > > 
> >> > > Gavin Shan (2):
> >> > >   drivers/vfio: Support EEH API revision
> >> > >   drivers/vfio: Support IOMMU group for EEH operations
> >> > > 
> >> > >  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> >> > > ++---
> >> > >  drivers/vfio/vfio_spapr_eeh.c   | 46 
> >> > > ++
> >> > >  include/linux/vfio.h| 13 +++---
> >> > >  include/uapi/linux/vfio.h   |  6 +
> >> > >  4 files changed, 93 insertions(+), 22 deletions(-)
> >> > 
> >> > This interface is terrible.  A function named foo_enabled() should
> >> > return a bool, yes or no, don't try to overload it to also return a
> >> > version.
> >> 
> >> Sorry, that one's my fault.  I suggested that approach to Gavin
> >> without really thinking it through.
> >> 
> >> 
> >> > AFAICT, patch 2/2 breaks current users by changing the offset
> >> > of the union in struct vfio_eeh_pe_err.
> >> 
> >> Yeah, this one's ugly.  We have to preserve the offset, but that means
> >> putting the group in a very awkward place.  Especially since I'm not
> >> sure if there even are any existing users of the single extant union
> >> branch.
> >> 
> >> Sigh.
> >> 
> >> > Also, we generally pass group
> >> > file descriptors rather than a group ID because we can prove the
> >> > ownership of the group through the file descriptor and we don't need to
> >> > worry about races with the group because we can hold a reference to it.
> >
> >Duh.  I finally realised the better, simpler, obvious solution.
> >
> >Rather than changing the parameter structure, we should move the
> >ioctl()s so they're on the group fd instead of the container fd.
> >
> >Obviously we need to keep it on the container fd for backwards compat,
> >but I think we should just error out if there is more than one group
> >in the container there.
> >
> >We will need a new capability too, obviously.  VFIO_EEH_GROUPFD maybe?
> >
> 
> Yeah, the patches should be marked as "RFC" actually as they're actually
> prototypes. I agree with David that the EEH ioctl commands should be routed
> through IOMMU group as I proposed long time ago. However, if we're going
> to do it now, we have to maintain two set the interfaces: one handled by
> container's ioctl() and another one is handled by IOMMU group's ioctl().
> Would it be a problem?
> 
> Actually, the code change is made based on the fact: nobody is using
> the union (struct vfio_eeh_pe_err) yet before the QEMU changes to do
> error injection gets merged by David. So I think it's fine to introduce
> another field in struct vfio_eeh_pe_op though there is gap?

We really need to get away from this mindset of assuming that we know
every user of the code and every dependency it may have.  The reality is
that this is an exposed ABI and we shouldn't break it just because we
don't know of any users.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 00/18] Add VT-d Posted-Interrupts support - including prerequisite series

2015-09-18 Thread Alex Williamson
On Fri, 2015-09-18 at 16:58 +0200, Paolo Bonzini wrote:
> 
> On 18/09/2015 16:29, Feng Wu wrote:
> > VT-d Posted-Interrupts is an enhancement to CPU side Posted-Interrupt.
> > With VT-d Posted-Interrupts enabled, external interrupts from
> > direct-assigned devices can be delivered to guests without VMM
> > intervention when guest is running in non-root mode.
> > 
> > You can find the VT-d Posted-Interrtups Spec. in the following URL:
> > http://www.intel.com/content/www/us/en/intelligent-systems/intel-technology/vt-directed-io-spec.html
> 
> Thanks.  I will squash patches 2 and 14 together, and drop patch 3.
> 
> Signed-off-bys are missing in patch 1 and 4.  The patches exist
> elsewhere in the mailing list archives, so not a big deal.  Or just
> reply to them with the S-o-b line.
> 
> Alex, can you ack the series and review patch 12?

I sent an ack for 12 separately, I got a bit lost in 16 & 17, but for
all the others that don't already have some tag from me,

Reviewed-by: Alex Williamson 

> 
> Joerg, can you ack patch 18?
> 
> Paolo
> 
> > v9:
> > - Include the whole series:
> > [01/18]: irq bypasser manager
> > [02/18] - [06/18]: Common non-architecture part for VT-d PI and ARM side 
> > forwarded irq
> > [07/18] - [18/18]: VT-d PI part
> > 
> > v8:
> > refer to the changelog in each patch
> > 
> > v7:
> > * Define two weak irq bypass callbacks:
> >   - kvm_arch_irq_bypass_start()
> >   - kvm_arch_irq_bypass_stop()
> > * Remove the x86 dummy implementation of the above two functions.
> > * Print some useful information instead of WARN_ON() when the
> >   irq bypass consumer unregistration fails.
> > * Fix an issue when calling pi_pre_block and pi_post_block.
> > 
> > v6:
> > * Rebase on 4.2.0-rc6
> > * Rebase on https://lkml.org/lkml/2015/8/6/526 and 
> > http://www.gossamer-threads.com/lists/linux/kernel/2235623
> > * Make the add_consumer and del_consumer callbacks static
> > * Remove pointless INIT_LIST_HEAD to 'vdev->ctx[vector].producer.node)'
> > * Use dev_info instead of WARN_ON() when irq_bypass_register_producer fails
> > * Remove optional dummy callbacks for irq producer
> > 
> > v4:
> > * For lowest-priority interrupt, only support single-CPU destination
> > interrupts at the current stage, more common lowest priority support
> > will be added later.
> > * Accoring to Marcelo's suggestion, when vCPU is blocked, we handle
> > the posted-interrupts in the HLT emulation path.
> > * Some small changes (coding style, typo, add some code comments)
> > 
> > v3:
> > * Adjust the Posted-interrupts Descriptor updating logic when vCPU is
> >   preempted or blocked.
> > * KVM_DEV_VFIO_DEVICE_POSTING_IRQ --> KVM_DEV_VFIO_DEVICE_POST_IRQ
> > * __KVM_HAVE_ARCH_KVM_VFIO_POSTING --> __KVM_HAVE_ARCH_KVM_VFIO_POST
> > * Add KVM_DEV_VFIO_DEVICE_UNPOST_IRQ attribute for VFIO irq, which
> >   can be used to change back to remapping mode.
> > * Fix typo
> > 
> > v2:
> > * Use VFIO framework to enable this feature, the VFIO part of this series is
> >   base on Eric's patch "[PATCH v3 0/8] KVM-VFIO IRQ forward control"
> > * Rebase this patchset on 
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git,
> >   then revise some irq logic based on the new hierarchy irqdomain patches 
> > provided
> >   by Jiang Liu 
> > 
> > 
> > *** BLURB HERE ***
> > 
> > Alex Williamson (1):
> >   virt: IRQ bypass manager
> > 
> > Eric Auger (4):
> >   KVM: arm/arm64: select IRQ_BYPASS_MANAGER
> >   KVM: create kvm_irqfd.h
> >   KVM: introduce kvm_arch functions for IRQ bypass
> >   KVM: eventfd: add irq bypass consumer management
> > 
> > Feng Wu (13):
> >   KVM: x86: select IRQ_BYPASS_MANAGER
> >   KVM: Extend struct pi_desc for VT-d Posted-Interrupts
> >   KVM: Add some helper functions for Posted-Interrupts
> >   KVM: Define a new interface kvm_intr_is_single_vcpu()
> >   KVM: Make struct kvm_irq_routing_table accessible
> >   KVM: make kvm_set_msi_irq() public
> >   vfio: Register/unregister irq_bypass_producer
> >   KVM: x86: Update IRTE for posted-interrupts
> >   KVM: Implement IRQ bypass consumer callbacks for x86
> >   KVM: Add an arch specific hooks in 'struct kvm_kernel_irqfd'
> >   KVM: Update Posted-Interrupts Descriptor when vCPU is preempted
> >   KVM: Update Posted-Interrupts Descriptor when vCPU is blocked
> >   iommu/vt-d: Add a command line parameter for VT-d posted-interrupts
> > 

Re: [PATCH v9 12/18] vfio: Register/unregister irq_bypass_producer

2015-09-18 Thread Alex Williamson
On Fri, 2015-09-18 at 22:29 +0800, Feng Wu wrote:
> This patch adds the registration/unregistration of an
> irq_bypass_producer for MSI/MSIx on vfio pci devices.
> 
> Signed-off-by: Feng Wu 

On nit, Paolo could you please fix the spelling of "registration" in the
dev_info, otherwise:

Acked-by: Alex Williamson 


> ---
> v8:
> - Merge "[PATCH v7 08/17] vfio: Select IRQ_BYPASS_MANAGER for vfio PCI 
> devices"
>   into this patch.
> 
> v6:
> - Make the add_consumer and del_consumer callbacks static
> - Remove pointless INIT_LIST_HEAD to 'vdev->ctx[vector].producer.node)'
> - Use dev_info instead of WARN_ON() when irq_bypass_register_producer fails
> - Remove optional dummy callbacks for irq producer
> 
>  drivers/vfio/pci/Kconfig| 1 +
>  drivers/vfio/pci/vfio_pci_intrs.c   | 9 +
>  drivers/vfio/pci/vfio_pci_private.h | 2 ++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
> index 579d83b..02912f1 100644
> --- a/drivers/vfio/pci/Kconfig
> +++ b/drivers/vfio/pci/Kconfig
> @@ -2,6 +2,7 @@ config VFIO_PCI
>   tristate "VFIO support for PCI devices"
>   depends on VFIO && PCI && EVENTFD
>   select VFIO_VIRQFD
> + select IRQ_BYPASS_MANAGER
>   help
> Support for the PCI VFIO bus driver.  This is required to make
> use of PCI drivers using the VFIO framework.
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 1f577b4..c65299d 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -319,6 +319,7 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>  
>   if (vdev->ctx[vector].trigger) {
>   free_irq(irq, vdev->ctx[vector].trigger);
> + irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
>   kfree(vdev->ctx[vector].name);
>   eventfd_ctx_put(vdev->ctx[vector].trigger);
>   vdev->ctx[vector].trigger = NULL;
> @@ -360,6 +361,14 @@ static int vfio_msi_set_vector_signal(struct 
> vfio_pci_device *vdev,
>   return ret;
>   }
>  
> + vdev->ctx[vector].producer.token = trigger;
> + vdev->ctx[vector].producer.irq = irq;
> + ret = irq_bypass_register_producer(&vdev->ctx[vector].producer);
> + if (unlikely(ret))
> + dev_info(&pdev->dev,
> + "irq bypass producer (token %p) registeration fails: %d\n",
> + vdev->ctx[vector].producer.token, ret);
> +
>   vdev->ctx[vector].trigger = trigger;
>  
>   return 0;
> diff --git a/drivers/vfio/pci/vfio_pci_private.h 
> b/drivers/vfio/pci/vfio_pci_private.h
> index ae0e1b4..0e7394f 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -13,6 +13,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #ifndef VFIO_PCI_PRIVATE_H
>  #define VFIO_PCI_PRIVATE_H
> @@ -29,6 +30,7 @@ struct vfio_pci_irq_ctx {
>   struct virqfd   *mask;
>   char*name;
>   boolmasked;
> + struct irq_bypass_producer  producer;
>  };
>  
>  struct vfio_pci_device {



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] VFIO: Accept IOMMU group (PE) ID

2015-09-18 Thread Alex Williamson
On Fri, 2015-09-18 at 16:24 +1000, Gavin Shan wrote:
> This allows to accept IOMMU group (PE) ID from the parameter from userland
> when handling EEH operation so that the operation only affects the target
> IOMMU group (PE). If the IOMMU group (PE) ID in the parameter from userland
> is invalid, all IOMMU groups (PEs) attached to the specified container are
> affected as before.
> 
> Gavin Shan (2):
>   drivers/vfio: Support EEH API revision
>   drivers/vfio: Support IOMMU group for EEH operations
> 
>  drivers/vfio/vfio_iommu_spapr_tce.c | 50 
> ++---
>  drivers/vfio/vfio_spapr_eeh.c   | 46 ++
>  include/linux/vfio.h| 13 +++---
>  include/uapi/linux/vfio.h   |  6 +
>  4 files changed, 93 insertions(+), 22 deletions(-)

This interface is terrible.  A function named foo_enabled() should
return a bool, yes or no, don't try to overload it to also return a
version.  AFAICT, patch 2/2 breaks current users by changing the offset
of the union in struct vfio_eeh_pe_err.  Also, we generally pass group
file descriptors rather than a group ID because we can prove the
ownership of the group through the file descriptor and we don't need to
worry about races with the group because we can hold a reference to it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 1/2] target-i386: disable LINT0 after reset

2015-09-15 Thread Alex Williamson
On Mon, 2015-04-13 at 02:32 +0300, Nadav Amit wrote:
> Due to old Seabios bug, QEMU reenable LINT0 after reset. This bug is long gone
> and therefore this hack is no longer needed.  Since it violates the
> specifications, it is removed.
> 
> Signed-off-by: Nadav Amit 
> ---
>  hw/intc/apic_common.c | 9 -
>  1 file changed, 9 deletions(-)

Please see bug: https://bugs.launchpad.net/qemu/+bug/1488363

Is this bug perhaps not as long gone as we thought, or is there
something else going on here?  Thanks,

Alex

> diff --git a/hw/intc/apic_common.c b/hw/intc/apic_common.c
> index 042e960..d38d24b 100644
> --- a/hw/intc/apic_common.c
> +++ b/hw/intc/apic_common.c
> @@ -243,15 +243,6 @@ static void apic_reset_common(DeviceState *dev)
>  info->vapic_base_update(s);
>  
>  apic_init_reset(dev);
> -
> -if (bsp) {
> -/*
> - * LINT0 delivery mode on CPU #0 is set to ExtInt at initialization
> - * time typically by BIOS, so PIC interrupt can be delivered to the
> - * processor when local APIC is enabled.
> - */
> -s->lvt[APIC_LVT_LINT0] = 0x700;
> -}
>  }
>  
>  /* This function is only used for old state version 1 and 2 */



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio/pci: Use kernel VPD access functions

2015-09-14 Thread Alex Williamson
On Sat, 2015-09-12 at 01:11 +, Rustad, Mark D wrote:
> Alex,
> 
> > On Sep 11, 2015, at 11:16 AM, Alex Williamson  
> > wrote:
> > 
> > RFC - Is this something we should do?
> 
> Superficially this looks pretty good. I need to think harder to be sure of 
> the details.
> 
> > Should we consider providing
> > similar emulation through PCI sysfs to allow lspci to also make use
> > of the vpd interfaces?
> 
> It looks to me like lspci already uses the vpd attribute in sysfs to access 
> VPD, so maybe nothing more than this is needed. No doubt lspci can be coerced 
> into accessing VPD directly, but is that really worth going after? I'm not so 
> sure.
> 
> An strace of lspci accessing a device with VPD shows me:
> 
> write(1, "\tCapabilities: [e0] Vital Produc"..., 39   Capabilities: [e0] 
> Vital Product Data
> ) = 39
> open("/sys/bus/pci/devices/:02:00.0/vpd", O_RDONLY) = 4
> ^^^ accesses to this should be safe, 
> I think
> 
> pread(4, "\202", 1, 0)  = 1
> pread(4, "\10\0", 2, 1) = 2
> pread(4, "PVL Dell", 8, 3)  = 8
> write(1, "\t\tProduct Name: PVL Dell\n", 25   Product Name: PVL Dell
> ) = 25
> 
> and so forth.

Oh good, so aside from some rouge admin poking around with setpci access
through pci-sysfs is hopefully not an issue.  Thanks for looking into
it.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: Whitelist PCI bridges

2015-09-11 Thread Alex Williamson
When determining whether a group is viable, we already allow devices
bound to pcieport.  Generalize this to include any PCI bridge device.

Signed-off-by: Alex Williamson 
---
 drivers/vfio/vfio.c |   31 +--
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 563c510..1c0f98c 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -438,16 +439,33 @@ static struct vfio_device *vfio_group_get_device(struct 
vfio_group *group,
 }
 
 /*
- * Whitelist some drivers that we know are safe (no dma) or just sit on
- * a device.  It's not always practical to leave a device within a group
- * driverless as it could get re-bound to something unsafe.
+ * Some drivers, like pci-stub, are only used to prevent other drivers from
+ * claiming a device and are therefore perfectly legitimate for a user owned
+ * group.  The pci-stub driver has no dependencies on DMA or the IOVA mapping
+ * of the device, but it does prevent the user from having direct access to
+ * the device, which is useful in some circumstances.
+ *
+ * We also assume that we can include PCI interconnect devices, ie. bridges.
+ * IOMMU grouping on PCI necessitates that if we lack isolation on a bridge
+ * then all of the downstream devices will be part of the same IOMMU group as
+ * the bridge.  Thus, if placing the bridge into the user owned IOVA space
+ * breaks anything, it only does so for user owned devices downstream.  Note
+ * that error notification via MSI can be affected for platforms that handle
+ * MSI within the same IOVA space as DMA.
  */
-static const char * const vfio_driver_whitelist[] = { "pci-stub", "pcieport" };
+static const char * const vfio_driver_whitelist[] = { "pci-stub" };
 
-static bool vfio_whitelisted_driver(struct device_driver *drv)
+static bool vfio_dev_whitelisted(struct device *dev, struct device_driver *drv)
 {
int i;
 
+   if (dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+
+   if (pdev->hdr_type != PCI_HEADER_TYPE_NORMAL)
+   return true;
+   }
+
for (i = 0; i < ARRAY_SIZE(vfio_driver_whitelist); i++) {
if (!strcmp(drv->name, vfio_driver_whitelist[i]))
return true;
@@ -462,6 +480,7 @@ static bool vfio_whitelisted_driver(struct device_driver 
*drv)
  *  - driver-less
  *  - bound to a vfio driver
  *  - bound to a whitelisted driver
+ *  - a PCI interconnect device
  *
  * We use two methods to determine whether a device is bound to a vfio
  * driver.  The first is to test whether the device exists in the vfio
@@ -486,7 +505,7 @@ static int vfio_dev_viable(struct device *dev, void *data)
}
mutex_unlock(&group->unbound_lock);
 
-   if (!ret || !drv || vfio_whitelisted_driver(drv))
+   if (!ret || !drv || vfio_dev_whitelisted(dev, drv))
return 0;
 
device = vfio_group_get_device(group, dev);

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] vfio/pci: Use kernel VPD access functions

2015-09-11 Thread Alex Williamson
The PCI VPD capability operates on a set of window registers in PCI
config space.  Writing to the address register triggers either a read
or write, depending on the setting of the PCI_VPD_ADDR_F bit within
the address register.  The data register provides either the source
for writes or the target for reads.

This model is susceptible to being broken by concurrent access, for
which the kernel has adopted a set of access functions to serialize
these registers.  Additionally, commits like 932c435caba8 ("PCI: Add
dev_flags bit to access VPD through function 0") and 7aa6ca4d39ed
("PCI: Add VPD function 0 quirk for Intel Ethernet devices") indicate
that VPD registers can be shared between functions on multifunction
devices creating dependencies between otherwise independent devices.

Fortunately it's quite easy to emulate the VPD registers, simply
storing copies of the address and data registers in memory and
triggering a VPD read or write on writes to the address register.
This allows vfio users to avoid seeing spurious register changes from
accesses on other devices and enables the use of shared quirks in the
host kernel.  We can theoretically still race with access through
sysfs, but the window of opportunity is much smaller.

Signed-off-by: Alex Williamson 
---

RFC - Is this something we should do?  Should we consider providing
similar emulation through PCI sysfs to allow lspci to also make use
of the vpd interfaces?

 drivers/vfio/pci/vfio_pci_config.c |   70 +++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index ff75ca3..a8657ef 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -671,6 +671,73 @@ static int __init init_pci_cap_pm_perm(struct perm_bits 
*perm)
return 0;
 }
 
+static int vfio_vpd_config_write(struct vfio_pci_device *vdev, int pos,
+int count, struct perm_bits *perm,
+int offset, __le32 val)
+{
+   struct pci_dev *pdev = vdev->pdev;
+   __le16 *paddr = (__le16 *)(vdev->vconfig + pos - offset + PCI_VPD_ADDR);
+   __le32 *pdata = (__le32 *)(vdev->vconfig + pos - offset + PCI_VPD_DATA);
+   u16 addr;
+   u32 data;
+
+   /*
+* Write through to emulation.  If the write includes the upper byte
+* of PCI_VPD_ADDR, then the PCI_VPD_ADDR_F bit is written and we
+* have work to do.
+*/
+   count = vfio_default_config_write(vdev, pos, count, perm, offset, val);
+   if (count < 0 || offset > PCI_VPD_ADDR + 1 ||
+   offset + count <= PCI_VPD_ADDR + 1)
+   return count;
+
+   addr = le16_to_cpu(*paddr);
+
+   if (addr & PCI_VPD_ADDR_F) {
+   data = le32_to_cpu(*pdata);
+   if (pci_write_vpd(pdev, addr & ~PCI_VPD_ADDR_F, 4, &data) != 4)
+   return count;
+   } else {
+   if (pci_read_vpd(pdev, addr, 4, &data) != 4)
+   return count;
+   *pdata = cpu_to_le32(data);
+   }
+
+   /*
+* Toggle PCI_VPD_ADDR_F in the emulated PCI_VPD_ADDR register to
+* signal completion.  If an error occurs above, we assume that not
+* toggling this bit will induce a driver timeout.
+*/
+   addr ^= PCI_VPD_ADDR_F;
+   *paddr = cpu_to_le16(addr);
+
+   return count;
+}
+
+/* Permissions for Vital Product Data capability */
+static int __init init_pci_cap_vpd_perm(struct perm_bits *perm)
+{
+   if (alloc_perm_bits(perm, pci_cap_length[PCI_CAP_ID_VPD]))
+   return -ENOMEM;
+
+   perm->writefn = vfio_vpd_config_write;
+
+   /*
+* We always virtualize the next field so we can remove
+* capabilities from the chain if we want to.
+*/
+   p_setb(perm, PCI_CAP_LIST_NEXT, (u8)ALL_VIRT, NO_WRITE);
+
+   /*
+* Both the address and data registers are virtualized to
+* enable access through the pci_vpd_read/write functions
+*/
+   p_setw(perm, PCI_VPD_ADDR, (u16)ALL_VIRT, (u16)ALL_WRITE);
+   p_setd(perm, PCI_VPD_DATA, ALL_VIRT, ALL_WRITE);
+
+   return 0;
+}
+
 /* Permissions for PCI-X capability */
 static int __init init_pci_cap_pcix_perm(struct perm_bits *perm)
 {
@@ -790,6 +857,7 @@ void vfio_pci_uninit_perm_bits(void)
free_perm_bits(&cap_perms[PCI_CAP_ID_BASIC]);
 
free_perm_bits(&cap_perms[PCI_CAP_ID_PM]);
+   free_perm_bits(&cap_perms[PCI_CAP_ID_VPD]);
free_perm_bits(&cap_perms[PCI_CAP_ID_PCIX]);
free_perm_bits(&cap_perms[PCI_CAP_ID_EXP]);
free_perm_bits(&cap_perms[PCI_CAP_ID_AF]);
@@ -807,7 +875,7 @@ int __init vfio_pci_init_perm_bits(void)
 
/* Capabilities */
ret |= init_pci_cap_pm_perm(&cap_perms[PCI_CAP_ID_PM]);
-   cap_p

  1   2   3   4   5   6   7   8   9   10   >