Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-12-01 Thread Alexey Kardashevskiy
On 29/11/11 16:48, Alex Williamson wrote:
> On Tue, 2011-11-29 at 15:34 +1100, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> On 29/11/11 14:46, Alex Williamson wrote:
>>> On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
 Hi!

 I tried (successfully) to run it on POWER and while doing that I found 
 some issues. I'll try to
 explain them in separate mails.
>>>
>>> Great!
>>>
 IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer 
 want to know
 a DMA window, i.e. its start and length in the PHB address space. This 
 comes from
 hardware. On X86 (correct if I am wrong), every device driver in the guest 
 allocates
 memory from the same pool.
>>>
>>> Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
>>> device.
>>>
  On POWER, device drivers get DMA window and allocate pages
 for DMA within this window. In the case of VFIO, that means that QEMU has 
 to
 preallocate this DMA window before running a quest, pass it to a guest (via
 device tree) and then a guest tells the host what pages are taken/released 
 by
 calling map/unmap callbacks of iommu_ops. Deallocation is made in a device 
 detach
 callback as I did not want to add more ioctls.
 So, there are 2 patches:

 - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via 
 IOMMU API on
 POWER.
 btw do we need an additional capability bit for it?

 KERNEL PATCH:

 diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
 index 10615ad..a882e08 100644
 --- a/drivers/iommu/iommu.c
 +++ b/drivers/iommu/iommu.c
 @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned 
 int *groupid)
return -ENODEV;
  }
  EXPORT_SYMBOL_GPL(iommu_device_group);
 +
 +int iommu_setup(struct iommu_domain *domain,
 +  size_t requested_size, size_t *allocated_size,
 +  phys_addr_t *start_address)
 +{
 +  return domain->ops->setup(domain, requested_size, allocated_size,
 +  start_address);
 +}
 +EXPORT_SYMBOL_GPL(iommu_setup);
>>>
>>> requested_size seems redundant both here and in struct vfio_setup.  We
>>> can just pre-load size/start with desired values.  I assume x86 IOMMUs
>>> would ignore requested values and return start = 0 and size = hardware
>>> decoder address bits.  The IOMMU API currently allows:
>>>
>>> iommu_domain_alloc
>>> [iommu_attach_device]
>>> [iommu_map]
>>> [iommu_unmap]
>>> [iommu_detach_device]
>>> iommu_domain_free
>>>
>>> where everything between alloc and free can be called in any order.  How
>>> does setup fit into that model?
>>
>> This is why I posted a QEMU patch :)
> 
> Right, but qemu/vfio is by no means the defacto standard of how one must
> use the IOMMU API.  KVM currently orders the map vs attach differently.
> When is it valid to call setup when factoring in hot attached/detached
> devices?
> 


>>> For this it seems like we'd almost want
>>> to combine alloc, setup, and the first attach into a single call (ie.
>>> create a domain with this initial device and these parameters), then
>>> subsequent attaches would only allow compatible devices.
>>
>>
>> Not exactly. This setup is more likely to get combined with domain alloc 
>> only.
> 
> At domain_alloc we don't have any association to actual hardware other
> than a bus_type, how would you know which iommu is being setup?


Yes. This is exact problem. We do have preallocated PEs (aka groups) on POWER 
but cannot
use this information during setup until the first device is attached.

Generally speaking, we should not be adding devices to an IOMMU domain, we 
should be adding groups
instead as it is the smallest entity which IOMMU can handle. At least in API, 
it is better to have
as simple as the idea it is implementing.

For example, we could implement a tool to put devices into a group (at the 
moment on POWER it would
check that a domain has no more than just a single group in it). We could pass 
this group ID to QEMU
instead of passing "-device vfio-pci ..." (as we still must pass _all_ devices 
of a group to QEMU).

Sure, we'll have change VFIO to tell QEMU what devices are included in what 
group, quite easy to do.

It is a tree: domain -> group -> device. Lets reflect it in the API.




>> On POWER, we have iommu_table per DMA window which can be or can be not 
>> shared
>> between devices. At the moment there is one window per PCIe _device_ (so 
>> multiple
>> functions of multiport network adapter share one DMA window) and one window 
>> for
>> all the devices behind PCIe-to-PCI bridge. It is more or less so.
>>
>>
>>> I'm a little confused though, is the window determined by hardware or is
>>> it configurable via requested_size?
>>
>>
>> The window parameters are calculated by software and then written to 
>> hardware so
>> hardware does filtering and prevents bad devices from memory corrup

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-29 Thread Alex Williamson
On Mon, 2011-11-28 at 20:54 -0700, Alex Williamson wrote:
> On Tue, 2011-11-29 at 13:01 +1100, Alexey Kardashevskiy wrote:
> > Hi all,
> > 
> > Another problem I hit on POWER - MSI interrupts allocation. The existing 
> > VFIO does not expect a PBH
> > to support less interrupts that a device might request. In my case, PHB's 
> > limit is 8 interrupts
> > while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to 
> > demonstrate the idea.
> 
> Seems reasonable.  I assume we'd need similar for vfio_pci_setup_msi,
> though I haven't seen anything use more than a single MSI interrupt.
> Thanks,

Hmm, I'm thinking maybe we should reflect the pci_enable_msix() behavior
directly and let the caller decide if they want to make do with fewer
MSI vectors.  So the ioctl will return <0: error, 0:success, >0: number
of MSIs we think we can setup, without actually setting them.  Sound
good?

BTW, github now has updated trees:

git://github.com/awilliam/linux-vfio.git vfio-next-2029
git://github.com/awilliam/qemu-vfio.git vfio-ng

Thanks,

Alex

> > KERNEL patch:
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> > b/drivers/vfio/pci/vfio_pci_intrs.c
> > index 7d45c6b..d44b9bf 100644
> > --- a/drivers/vfio/pci/vfio_pci_intrs.c
> > +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> > @@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, 
> > int nvec, int __user *inta
> > vdev->msix[i].entry = i;
> > vdev->ev_msix[i] = ctx;
> > }
> > -   if (!ret)
> > +   if (!ret) {
> > ret = pci_enable_msix(pdev, vdev->msix, nvec);
> > +   /*
> > +  The kernel is unable to allocate requested number of IRQs
> > +  and returned the available number.
> > +*/
> > +   if (0 < ret) {
> > +   ret = pci_enable_msix(pdev, vdev->msix, ret);
> > +   }
> > +   }
> > vdev->msix_nvec = 0;
> > -   for (i = 0; i < nvec && !ret; i++) {
> > -   ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> > - "vfio", vdev->ev_msix[i]);
> > -   if (ret)
> > -   break;
> > -   vdev->msix_nvec = i+1;
> > +   if (0 == ret) {
> > +   vdev->msix_nvec = 0;
> > +   ret = 0;
> > +   for (i = 0; i < nvec && !ret; i++) {
> > +   ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> > +   "vfio", vdev->ev_msix[i]);
> > +   if (ret)
> > +   break;
> > +   vdev->msix_nvec = i+1;
> > +   }
> > +   if ((0 == vdev->msix_nvec) && (0 != ret))
> > +   vfio_pci_drop_msix(vdev);
> > +   else
> > +   ret = vdev->msix_nvec;
> > }
> > -   if (ret)
> > -   vfio_pci_drop_msix(vdev);
> > +
> > return ret;
> >  }
> > 
> > === end ===
> > 
> > 
> > QEMU patch:
> > 
> > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> > index 020961a..980eec7 100644
> > --- a/hw/vfio_pci.c
> > +++ b/hw/vfio_pci.c
> > @@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
> >  }
> >  }
> > 
> > -if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
> > +ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
> > +if (0 > ret) {
> >  fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
> >  strerror(errno));
> >  for (i = 0; i < vdev->nr_vectors; i++) {
> > @@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
> >  qemu_free(vdev->msi_vectors);
> >  vdev->nr_vectors = 0;
> >  return;
> > +} else if (0 < ret) {
> > +vdev->nr_vectors = ret;
> >  }
> > 
> >  vdev->interrupt = msix ? INT_MSIX : INT_MSI;
> > 
> > 
> > === end ===
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alex Williamson
On Tue, 2011-11-29 at 15:34 +1100, Alexey Kardashevskiy wrote:
> Hi!
> 
> On 29/11/11 14:46, Alex Williamson wrote:
> > On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
> >> Hi!
> >>
> >> I tried (successfully) to run it on POWER and while doing that I found 
> >> some issues. I'll try to
> >> explain them in separate mails.
> > 
> > Great!
> > 
> >> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer 
> >> want to know
> >> a DMA window, i.e. its start and length in the PHB address space. This 
> >> comes from
> >> hardware. On X86 (correct if I am wrong), every device driver in the guest 
> >> allocates
> >> memory from the same pool.
> > 
> > Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
> > device.
> > 
> >>  On POWER, device drivers get DMA window and allocate pages
> >> for DMA within this window. In the case of VFIO, that means that QEMU has 
> >> to
> >> preallocate this DMA window before running a quest, pass it to a guest (via
> >> device tree) and then a guest tells the host what pages are taken/released 
> >> by
> >> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device 
> >> detach
> >> callback as I did not want to add more ioctls.
> >> So, there are 2 patches:
> >>
> >> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via 
> >> IOMMU API on
> >> POWER.
> >> btw do we need an additional capability bit for it?
> >>
> >> KERNEL PATCH:
> >>
> >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> >> index 10615ad..a882e08 100644
> >> --- a/drivers/iommu/iommu.c
> >> +++ b/drivers/iommu/iommu.c
> >> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned 
> >> int *groupid)
> >>return -ENODEV;
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_device_group);
> >> +
> >> +int iommu_setup(struct iommu_domain *domain,
> >> +  size_t requested_size, size_t *allocated_size,
> >> +  phys_addr_t *start_address)
> >> +{
> >> +  return domain->ops->setup(domain, requested_size, allocated_size,
> >> +  start_address);
> >> +}
> >> +EXPORT_SYMBOL_GPL(iommu_setup);
> > 
> > requested_size seems redundant both here and in struct vfio_setup.  We
> > can just pre-load size/start with desired values.  I assume x86 IOMMUs
> > would ignore requested values and return start = 0 and size = hardware
> > decoder address bits.  The IOMMU API currently allows:
> > 
> > iommu_domain_alloc
> > [iommu_attach_device]
> > [iommu_map]
> > [iommu_unmap]
> > [iommu_detach_device]
> > iommu_domain_free
> > 
> > where everything between alloc and free can be called in any order.  How
> > does setup fit into that model?
> 
> This is why I posted a QEMU patch :)

Right, but qemu/vfio is by no means the defacto standard of how one must
use the IOMMU API.  KVM currently orders the map vs attach differently.
When is it valid to call setup when factoring in hot attached/detached
devices?

> > For this it seems like we'd almost want
> > to combine alloc, setup, and the first attach into a single call (ie.
> > create a domain with this initial device and these parameters), then
> > subsequent attaches would only allow compatible devices.
> 
> 
> Not exactly. This setup is more likely to get combined with domain alloc only.

At domain_alloc we don't have any association to actual hardware other
than a bus_type, how would you know which iommu is being setup?

> On POWER, we have iommu_table per DMA window which can be or can be not shared
> between devices. At the moment there is one window per PCIe _device_ (so 
> multiple
> functions of multiport network adapter share one DMA window) and one window 
> for
> all the devices behind PCIe-to-PCI bridge. It is more or less so.
> 
> 
> > I'm a little confused though, is the window determined by hardware or is
> > it configurable via requested_size?
> 
> 
> The window parameters are calculated by software and then written to hardware 
> so
> hardware does filtering and prevents bad devices from memory corruption.
> 
> 
> > David had suggested that we could
> > implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:
> > 
> > struct vfio_iommu_info {
> > __u32   argsz;
> > __u32   flags;
> > __u64   iova_max;   /* Maximum IOVA address */
> > __u64   iova_min;   /* Minimum IOVA address */
> > __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
> > };
> > 
> > The thought being a TBD IOMMU API interface reports the hardware
> > determined IOVA range and we could fudge it on x86 for now reporting
> > 0/~0.  Maybe we should replace iova_max/iova_min with
> > iova_base/iova_size and allow the caller to request a size by setting
> > iova_size and matching bit in the flags.
> 
> 
> No, we need some sort of SET_INFO, not GET as we want QEMU to decide on a DMA
> window size.

Right, GET_INFO is no longer the right name if we're really passing in
requests.  Maybe it becomes VFIO_IOMMU_SETUP as you 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alexey Kardashevskiy
Hi!

On 29/11/11 14:46, Alex Williamson wrote:
> On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
>> Hi!
>>
>> I tried (successfully) to run it on POWER and while doing that I found some 
>> issues. I'll try to
>> explain them in separate mails.
> 
> Great!
> 
>> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want 
>> to know
>> a DMA window, i.e. its start and length in the PHB address space. This comes 
>> from
>> hardware. On X86 (correct if I am wrong), every device driver in the guest 
>> allocates
>> memory from the same pool.
> 
> Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
> device.
> 
>>  On POWER, device drivers get DMA window and allocate pages
>> for DMA within this window. In the case of VFIO, that means that QEMU has to
>> preallocate this DMA window before running a quest, pass it to a guest (via
>> device tree) and then a guest tells the host what pages are taken/released by
>> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device 
>> detach
>> callback as I did not want to add more ioctls.
>> So, there are 2 patches:
>>
>> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via 
>> IOMMU API on
>> POWER.
>> btw do we need an additional capability bit for it?
>>
>> KERNEL PATCH:
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 10615ad..a882e08 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int 
>> *groupid)
>>  return -ENODEV;
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_device_group);
>> +
>> +int iommu_setup(struct iommu_domain *domain,
>> +size_t requested_size, size_t *allocated_size,
>> +phys_addr_t *start_address)
>> +{
>> +return domain->ops->setup(domain, requested_size, allocated_size,
>> +start_address);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_setup);
> 
> requested_size seems redundant both here and in struct vfio_setup.  We
> can just pre-load size/start with desired values.  I assume x86 IOMMUs
> would ignore requested values and return start = 0 and size = hardware
> decoder address bits.  The IOMMU API currently allows:
> 
> iommu_domain_alloc
> [iommu_attach_device]
> [iommu_map]
> [iommu_unmap]
> [iommu_detach_device]
> iommu_domain_free
> 
> where everything between alloc and free can be called in any order.  How
> does setup fit into that model?

This is why I posted a QEMU patch :)

> For this it seems like we'd almost want
> to combine alloc, setup, and the first attach into a single call (ie.
> create a domain with this initial device and these parameters), then
> subsequent attaches would only allow compatible devices.


Not exactly. This setup is more likely to get combined with domain alloc only.
On POWER, we have iommu_table per DMA window which can be or can be not shared
between devices. At the moment there is one window per PCIe _device_ (so 
multiple
functions of multiport network adapter share one DMA window) and one window for
all the devices behind PCIe-to-PCI bridge. It is more or less so.


> I'm a little confused though, is the window determined by hardware or is
> it configurable via requested_size?


The window parameters are calculated by software and then written to hardware so
hardware does filtering and prevents bad devices from memory corruption.


> David had suggested that we could
> implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:
> 
> struct vfio_iommu_info {
> __u32   argsz;
> __u32   flags;
> __u64   iova_max;   /* Maximum IOVA address */
> __u64   iova_min;   /* Minimum IOVA address */
> __u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
> };
> 
> The thought being a TBD IOMMU API interface reports the hardware
> determined IOVA range and we could fudge it on x86 for now reporting
> 0/~0.  Maybe we should replace iova_max/iova_min with
> iova_base/iova_size and allow the caller to request a size by setting
> iova_size and matching bit in the flags.


No, we need some sort of SET_INFO, not GET as we want QEMU to decide on a DMA
window size.

Or simply add these parameters to domain allocation callback.


>> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
>> index 029dae3..57fb70d 100644
>> --- a/drivers/vfio/vfio_iommu.c
>> +++ b/drivers/vfio/vfio_iommu.c
>> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
>>
>>  if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
>>  ret = -EFAULT;
>> +
>> +} else if (cmd == VFIO_IOMMU_SETUP) {
>> +struct vfio_setup setup;
>> +size_t allocated_size = 0;
>> +phys_addr_t start_address = 0;
>> +
>> +if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
>> +return -EFAULT;
>> +
>> +printk("udomain %p, priv=%p\n

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alex Williamson
On Tue, 2011-11-29 at 13:01 +1100, Alexey Kardashevskiy wrote:
> Hi all,
> 
> Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO 
> does not expect a PBH
> to support less interrupts that a device might request. In my case, PHB's 
> limit is 8 interrupts
> while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to 
> demonstrate the idea.

Seems reasonable.  I assume we'd need similar for vfio_pci_setup_msi,
though I haven't seen anything use more than a single MSI interrupt.
Thanks,

Alex

> KERNEL patch:
> 
> diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
> b/drivers/vfio/pci/vfio_pci_intrs.c
> index 7d45c6b..d44b9bf 100644
> --- a/drivers/vfio/pci/vfio_pci_intrs.c
> +++ b/drivers/vfio/pci/vfio_pci_intrs.c
> @@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, 
> int nvec, int __user *inta
>   vdev->msix[i].entry = i;
>   vdev->ev_msix[i] = ctx;
>   }
> - if (!ret)
> + if (!ret) {
>   ret = pci_enable_msix(pdev, vdev->msix, nvec);
> + /*
> +The kernel is unable to allocate requested number of IRQs
> +and returned the available number.
> +  */
> + if (0 < ret) {
> + ret = pci_enable_msix(pdev, vdev->msix, ret);
> + }
> + }
>   vdev->msix_nvec = 0;
> - for (i = 0; i < nvec && !ret; i++) {
> - ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> -   "vfio", vdev->ev_msix[i]);
> - if (ret)
> - break;
> - vdev->msix_nvec = i+1;
> + if (0 == ret) {
> + vdev->msix_nvec = 0;
> + ret = 0;
> + for (i = 0; i < nvec && !ret; i++) {
> + ret = request_irq(vdev->msix[i].vector, msihandler, 0,
> + "vfio", vdev->ev_msix[i]);
> + if (ret)
> + break;
> + vdev->msix_nvec = i+1;
> + }
> + if ((0 == vdev->msix_nvec) && (0 != ret))
> + vfio_pci_drop_msix(vdev);
> + else
> + ret = vdev->msix_nvec;
>   }
> - if (ret)
> - vfio_pci_drop_msix(vdev);
> +
>   return ret;
>  }
> 
> === end ===
> 
> 
> QEMU patch:
> 
> diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> index 020961a..980eec7 100644
> --- a/hw/vfio_pci.c
> +++ b/hw/vfio_pci.c
> @@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
>  }
>  }
> 
> -if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
> +ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
> +if (0 > ret) {
>  fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
>  strerror(errno));
>  for (i = 0; i < vdev->nr_vectors; i++) {
> @@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
>  qemu_free(vdev->msi_vectors);
>  vdev->nr_vectors = 0;
>  return;
> +} else if (0 < ret) {
> +vdev->nr_vectors = ret;
>  }
> 
>  vdev->interrupt = msix ? INT_MSIX : INT_MSI;
> 
> 
> === end ===


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alex Williamson
On Tue, 2011-11-29 at 12:52 +1100, Alexey Kardashevskiy wrote:
> Hi!
> 
> I tried (successfully) to run it on POWER and while doing that I found some 
> issues. I'll try to
> explain them in separate mails.

Great!

> IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want 
> to know
> a DMA window, i.e. its start and length in the PHB address space. This comes 
> from
> hardware. On X86 (correct if I am wrong), every device driver in the guest 
> allocates
> memory from the same pool.

Yes, current VT-d/AMD-Vi provide independent IOVA spaces for each
device.

>  On POWER, device drivers get DMA window and allocate pages
> for DMA within this window. In the case of VFIO, that means that QEMU has to
> preallocate this DMA window before running a quest, pass it to a guest (via
> device tree) and then a guest tells the host what pages are taken/released by
> calling map/unmap callbacks of iommu_ops. Deallocation is made in a device 
> detach
> callback as I did not want to add more ioctls.
> So, there are 2 patches:
> 
> - new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via 
> IOMMU API on
> POWER.
> btw do we need an additional capability bit for it?
> 
> KERNEL PATCH:
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 10615ad..a882e08 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int 
> *groupid)
>   return -ENODEV;
>  }
>  EXPORT_SYMBOL_GPL(iommu_device_group);
> +
> +int iommu_setup(struct iommu_domain *domain,
> + size_t requested_size, size_t *allocated_size,
> + phys_addr_t *start_address)
> +{
> + return domain->ops->setup(domain, requested_size, allocated_size,
> + start_address);
> +}
> +EXPORT_SYMBOL_GPL(iommu_setup);

requested_size seems redundant both here and in struct vfio_setup.  We
can just pre-load size/start with desired values.  I assume x86 IOMMUs
would ignore requested values and return start = 0 and size = hardware
decoder address bits.  The IOMMU API currently allows:

iommu_domain_alloc
[iommu_attach_device]
[iommu_map]
[iommu_unmap]
[iommu_detach_device]
iommu_domain_free

where everything between alloc and free can be called in any order.  How
does setup fit into that model?  For this it seems like we'd almost want
to combine alloc, setup, and the first attach into a single call (ie.
create a domain with this initial device and these parameters), then
subsequent attaches would only allow compatible devices.

I'm a little confused though, is the window determined by hardware or is
it configurable via requested_size?  David had suggested that we could
implement a VFIO_IOMMU_GET_INFO ioctl that returns something like:

struct vfio_iommu_info {
__u32   argsz;
__u32   flags;
__u64   iova_max;   /* Maximum IOVA address */
__u64   iova_min;   /* Minimum IOVA address */
__u64   pgsize_bitmap;  /* Bitmap of supported page sizes */
};

The thought being a TBD IOMMU API interface reports the hardware
determined IOVA range and we could fudge it on x86 for now reporting
0/~0.  Maybe we should replace iova_max/iova_min with
iova_base/iova_size and allow the caller to request a size by setting
iova_size and matching bit in the flags.

> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> index 029dae3..57fb70d 100644
> --- a/drivers/vfio/vfio_iommu.c
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,
> 
>   if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
>   ret = -EFAULT;
> +
> + } else if (cmd == VFIO_IOMMU_SETUP) {
> + struct vfio_setup setup;
> + size_t allocated_size = 0;
> + phys_addr_t start_address = 0;
> +
> + if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
> + return -EFAULT;
> +
> + printk("udomain %p, priv=%p\n", iommu->domain, 
> iommu->domain->priv);
> + ret = iommu_setup(iommu->domain, setup.requested_size,
> + &allocated_size, &start_address);
> + setup.allocated_size = allocated_size;
> + setup.start_address = start_address;
> +
> + if (!ret && copy_to_user((void __user *)arg, &setup, sizeof 
> setup))
> + ret = -EFAULT;
>   }
>   return ret;
>  }
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 93617e7..355cf8b 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -45,6 +45,7 @@ struct iommu_domain {
> 
>  #define IOMMU_CAP_CACHE_COHERENCY0x1
>  #define IOMMU_CAP_INTR_REMAP 0x2 /* isolates device intrs */
> +#define IOMMU_CAP_SETUP_REQUIRED 0x3 /* requires setup to be called 
> */
> 
>  #ifdef CONFIG_IOMMU_API
> 
> @@ -62,6 +63,9 @@ struct iommu_op

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alexey Kardashevskiy
Hi all again,

It was actually the very first problem - endianess :-)
I am still not sure what format is better for cached config space or whether we 
should cache it all.

Also, as Benh already mentioned, vfio_virt_init reads a config space to a cache 
by pci_read_config_dword
for the whole space while some devices may not like it as they might 
distinguish length of PCI
transactions.



KERNEL patch:

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index b3bab99..9d563b4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -757,6 +757,16 @@ static int vfio_virt_init(struct vfio_pci_device *vdev)
vdev->rbar[5] = *(u32 *)&vdev->vconfig[PCI_BASE_ADDRESS_5];
vdev->rbar[6] = *(u32 *)&vdev->vconfig[PCI_ROM_ADDRESS];

+   /*
+* As pci_read_config_ returns data in native format,
+* and the cached copy is used in assumption that it is
+* native PCI format, fix endianness in the cached copy.
+*/
+   lp = (u32 *)vdev->vconfig;
+   for (i = 0; i < pdev->cfg_size/sizeof(u32); i++, lp++) {
+   *lp = cpu_to_le32(*lp);
+   }
+
/* for sr-iov devices */
vdev->vconfig[PCI_VENDOR_ID] = pdev->vendor & 0xFF;
vdev->vconfig[PCI_VENDOR_ID+1] = pdev->vendor >> 8;
@@ -807,18 +817,18 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
else
mask = 0;
lp = (u32 *)(vdev->vconfig + PCI_BASE_ADDRESS_0 + 4*bar);
-   *lp &= (u32)mask;
+   *lp &= cpu_to_le32((u32)mask);

if (pci_resource_flags(pdev, bar) & IORESOURCE_IO)
-   *lp |= PCI_BASE_ADDRESS_SPACE_IO;
+   *lp |= cpu_to_le32(PCI_BASE_ADDRESS_SPACE_IO);
else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
-   *lp |= PCI_BASE_ADDRESS_SPACE_MEMORY;
+   *lp |= cpu_to_le32(PCI_BASE_ADDRESS_SPACE_MEMORY);
if (pci_resource_flags(pdev, bar) & IORESOURCE_PREFETCH)
-   *lp |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+   *lp |= 
cpu_to_le32(PCI_BASE_ADDRESS_MEM_PREFETCH);
if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM_64) {
-   *lp |= PCI_BASE_ADDRESS_MEM_TYPE_64;
+   *lp |= 
cpu_to_le32(PCI_BASE_ADDRESS_MEM_TYPE_64);
lp++;
-   *lp &= (u32)(mask >> 32);
+   *lp &= cpu_to_le32((u32)(mask >> 32));
bar++;
}
}
@@ -830,7 +840,7 @@ static void vfio_bar_fixup(struct vfio_pci_device *vdev)
} else
mask = 0;
lp = (u32 *)(vdev->vconfig + PCI_ROM_ADDRESS);
-   *lp &= (u32)mask;
+   *lp &= cpu_to_le32((u32)mask);

vdev->bardirty = 0;
 }


=== end ===




QEMU patch:


diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 980eec7..1c97c35 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -405,6 +405,8 @@ static void vfio_resource_write(void *opaque, 
target_phys_addr_t addr,
 {
 PCIResource *res = opaque;

+fprintf(stderr, "change endianness\n");
+
 if (pwrite(res->fd, &data, size, res->offset + addr) != size) {
 fprintf(stderr, "%s(,0x%"PRIx64", 0x%"PRIx64", %d) failed: %s\n",
 __FUNCTION__, addr, data, size, strerror(errno));
@@ -429,6 +431,9 @@ static uint64_t vfio_resource_read(void *opaque,
 DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64"\n",
 __FUNCTION__, res->bar, addr, size, data);

+data = le32_to_cpu(data);
+DPRINTF("%s(BAR%d+0x%"PRIx64", %d) = 0x%"PRIx64" --- CPU\n",
+__FUNCTION__, res->bar, addr, size, data);
 return data;
 }

@@ -454,13 +459,25 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, 
uint32_t addr, int len)

 val = pci_default_read_config(pdev, addr, len);
 } else {
-if (pread(vdev->fd, &val, len, vdev->config_offset + addr) != len) {
+u8 buf[4] = {0};
+if (pread(vdev->fd, buf, len, vdev->config_offset + addr) != len) {
 fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %s\n",
 __FUNCTION__, vdev->host.seg, vdev->host.bus,
 vdev->host.dev, vdev->host.func, addr, len,
 strerror(errno));
 return -1;
 }
+   switch (len) {
+case 1: val = buf[0]; break;
+case 2: val = le16_to_cpupu((uint16_t*)buf); break;
+case 4: val = le32_to_cpupu((uint32_t*)buf); break;
+default:
+fprintf(stderr, "%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: 
%s\n",
+__FUNCTION__, vdev->host.seg, vdev->host.bus,
+vdev->host.dev, vdev->host.func,

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alexey Kardashevskiy
Hi all,

Another problem I hit on POWER - MSI interrupts allocation. The existing VFIO 
does not expect a PBH
to support less interrupts that a device might request. In my case, PHB's limit 
is 8 interrupts
while my test card (10Gb ethernet CXGB3) wants 9. Below are the patches to 
demonstrate the idea.


KERNEL patch:

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c 
b/drivers/vfio/pci/vfio_pci_intrs.c
index 7d45c6b..d44b9bf 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -458,17 +458,32 @@ int vfio_pci_setup_msix(struct vfio_pci_device *vdev, int 
nvec, int __user *inta
vdev->msix[i].entry = i;
vdev->ev_msix[i] = ctx;
}
-   if (!ret)
+   if (!ret) {
ret = pci_enable_msix(pdev, vdev->msix, nvec);
+   /*
+  The kernel is unable to allocate requested number of IRQs
+  and returned the available number.
+*/
+   if (0 < ret) {
+   ret = pci_enable_msix(pdev, vdev->msix, ret);
+   }
+   }
vdev->msix_nvec = 0;
-   for (i = 0; i < nvec && !ret; i++) {
-   ret = request_irq(vdev->msix[i].vector, msihandler, 0,
- "vfio", vdev->ev_msix[i]);
-   if (ret)
-   break;
-   vdev->msix_nvec = i+1;
+   if (0 == ret) {
+   vdev->msix_nvec = 0;
+   ret = 0;
+   for (i = 0; i < nvec && !ret; i++) {
+   ret = request_irq(vdev->msix[i].vector, msihandler, 0,
+   "vfio", vdev->ev_msix[i]);
+   if (ret)
+   break;
+   vdev->msix_nvec = i+1;
+   }
+   if ((0 == vdev->msix_nvec) && (0 != ret))
+   vfio_pci_drop_msix(vdev);
+   else
+   ret = vdev->msix_nvec;
}
-   if (ret)
-   vfio_pci_drop_msix(vdev);
+
return ret;
 }

=== end ===


QEMU patch:

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index 020961a..980eec7 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -341,7 +341,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
 }
 }

-if (ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds)) {
+ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQ_EVENTFDS, fds);
+if (0 > ret) {
 fprintf(stderr, "vfio: Error: Failed to setup MSI/X fds %s\n",
 strerror(errno));
 for (i = 0; i < vdev->nr_vectors; i++) {
@@ -355,6 +356,8 @@ static void vfio_enable_msi(VFIODevice *vdev, bool msix)
 qemu_free(vdev->msi_vectors);
 vdev->nr_vectors = 0;
 return;
+} else if (0 < ret) {
+vdev->nr_vectors = ret;
 }

 vdev->interrupt = msix ? INT_MSIX : INT_MSI;


=== end ===




On 29/11/11 12:52, Alexey Kardashevskiy wrote:
> Hi!
> 
> I tried (successfully) to run it on POWER and while doing that I found some 
> issues. I'll try to
> explain them in separate mails.
> 
> 
> 
> On 04/11/11 07:12, Alex Williamson wrote:
>> VFIO provides a secure, IOMMU based interface for user space
>> drivers, including device assignment to virtual machines.
>> This provides the base management of IOMMU groups, devices,
>> and IOMMU objects.  See Documentation/vfio.txt included in
>> this patch for user and kernel API description.
>>
>> Note, this implements the new API discussed at KVM Forum
>> 2011, as represented by the drvier version 0.2.  It's hoped
>> that this provides a modular enough interface to support PCI
>> and non-PCI userspace drivers across various architectures
>> and IOMMU implementations.
>>
>> Signed-off-by: Alex Williamson 
>> ---
>>
>> Fingers crossed, this is the last RFC for VFIO, but we need
>> the iommu group support before this can go upstream
>> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
>> hoping this helps push that along.
>>
>> Since the last posting, this version completely modularizes
>> the device backends and better defines the APIs between the
>> core VFIO code and the device backends.  I expect that we
>> might also adopt a modular IOMMU interface as iommu_ops learns
>> about different types of hardware.  Also many, many cleanups.
>> Check the complete git history for details:
>>
>> git://github.com/awilliam/linux-vfio.git vfio-ng
>>
>> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
>>
>> This version, along with the supporting VFIO PCI backend can
>> be found here:
>>
>> git://github.com/awilliam/linux-vfio.git vfio-next-2003
>>
>> I've held off on implementing a kernel->user signaling
>> mechanism for now since the previous netlink version produced
>> too many gag reflexes.  It's easy enough to set a bit in the
>> group flags too indicate such support in the future, so I
>> think we can move ahead without it.
>>
>> Appreciate any 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-28 Thread Alexey Kardashevskiy
Hi!

I tried (successfully) to run it on POWER and while doing that I found some 
issues. I'll try to
explain them in separate mails.

IOMMU domain setup. On POWER, the linux drivers capable of DMA transfer want to 
know
a DMA window, i.e. its start and length in the PHB address space. This comes 
from
hardware. On X86 (correct if I am wrong), every device driver in the guest 
allocates
memory from the same pool. On POWER, device drivers get DMA window and allocate 
pages
for DMA within this window. In the case of VFIO, that means that QEMU has to
preallocate this DMA window before running a quest, pass it to a guest (via
device tree) and then a guest tells the host what pages are taken/released by
calling map/unmap callbacks of iommu_ops. Deallocation is made in a device 
detach
callback as I did not want to add more ioctls.
So, there are 2 patches:

- new VFIO_IOMMU_SETUP ioctl introduced which allocates a DMA window via IOMMU 
API on
POWER.
btw do we need an additional capability bit for it?

KERNEL PATCH:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 10615ad..a882e08 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -247,3 +247,12 @@ int iommu_device_group(struct device *dev, unsigned int 
*groupid)
return -ENODEV;
 }
 EXPORT_SYMBOL_GPL(iommu_device_group);
+
+int iommu_setup(struct iommu_domain *domain,
+   size_t requested_size, size_t *allocated_size,
+   phys_addr_t *start_address)
+{
+   return domain->ops->setup(domain, requested_size, allocated_size,
+   start_address);
+}
+EXPORT_SYMBOL_GPL(iommu_setup);
diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
index 029dae3..57fb70d 100644
--- a/drivers/vfio/vfio_iommu.c
+++ b/drivers/vfio/vfio_iommu.c
@@ -507,6 +507,23 @@ static long vfio_iommu_unl_ioctl(struct file *filep,

if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
ret = -EFAULT;
+
+   } else if (cmd == VFIO_IOMMU_SETUP) {
+   struct vfio_setup setup;
+   size_t allocated_size = 0;
+   phys_addr_t start_address = 0;
+
+   if (copy_from_user(&setup, (void __user *)arg, sizeof setup))
+   return -EFAULT;
+
+   printk("udomain %p, priv=%p\n", iommu->domain, 
iommu->domain->priv);
+   ret = iommu_setup(iommu->domain, setup.requested_size,
+   &allocated_size, &start_address);
+   setup.allocated_size = allocated_size;
+   setup.start_address = start_address;
+
+   if (!ret && copy_to_user((void __user *)arg, &setup, sizeof 
setup))
+   ret = -EFAULT;
}
return ret;
 }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 93617e7..355cf8b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -45,6 +45,7 @@ struct iommu_domain {

 #define IOMMU_CAP_CACHE_COHERENCY  0x1
 #define IOMMU_CAP_INTR_REMAP   0x2 /* isolates device intrs */
+#define IOMMU_CAP_SETUP_REQUIRED   0x3 /* requires setup to be called 
*/

 #ifdef CONFIG_IOMMU_API

@@ -62,6 +63,9 @@ struct iommu_ops {
int (*domain_has_cap)(struct iommu_domain *domain,
  unsigned long cap);
int (*device_group)(struct device *dev, unsigned int *groupid);
+   int (*setup)(struct iommu_domain *domain,
+size_t requested_size, size_t *allocated_size,
+phys_addr_t *start_address);
 };

 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
@@ -80,6 +84,9 @@ extern phys_addr_t iommu_iova_to_phys(struct iommu_domain 
*domain,
  unsigned long iova);
 extern int iommu_domain_has_cap(struct iommu_domain *domain,
unsigned long cap);
+extern int iommu_setup(struct iommu_domain *domain,
+  size_t requested_size, size_t *allocated_size,
+  phys_addr_t *start_address);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler);
 extern int iommu_device_group(struct device *dev, unsigned int *groupid);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 971e3b1..5e0ee75 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@
  * Author: Michael S. Tsirkin 
  */
 #include 
+#include 

 #ifndef VFIO_H
 #define VFIO_H
@@ -172,4 +173,13 @@ enum {
VFIO_PCI_NUM_IRQS
 };

+/* Setup domain */
+#define VFIO_IOMMU_SETUP   _IOWR(';', 150, struct vfio_setup)
+
+struct vfio_setup {
+   __u64   requested_size;
+   __u64   allocated_size;
+   __u64   start_address;
+};
+
  #endif /* VFIO_H */

=== end ===


QEMU PATCH:

diff --git a/hw/linux-vfio.h b/hw/linux-vfio.h
index ac48d85..a2c719f 100644
--- a/hw/linux-vfio.h
+++ b/hw/linux-vf

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-22 Thread Alex Williamson
On Mon, 2011-11-21 at 13:47 +1100, David Gibson wrote:
> On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> > On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> > > On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > > > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
 
> > > > As we've discussed previously, configfs provides part of this, but has
> > > > no ioctl support.  It doesn't make sense to me to go play with groups in
> > > > configfs, but then still interact with them via a char dev.
> > > 
> > > Why not?  You configure, say, loopback devices with losetup, then use
> > > them as a block device.  Similar with nbd.  You can configure serial
> > > devices with setserial, then use them as a char dev.
> > > 
> > > >  It also
> > > > splits the ownership model 
> > > 
> > > I'm not even sure what that means.
> > > 
> > > > and makes it harder to enforce who gets to
> > > > interact with the devices vs who gets to manipulate groups.
> > > 
> > > How so.
> > 
> > Let's map out what a configfs interface would look like, maybe I'll
> > convince myself it's on the table.  We'd probably start with
> 
> Hrm, assumin we used configfs, which is not the only option.

I'm not writing vfiofs, configfs seems most like what we'd need.  If
there are others we should consider, please note them.

> > /config/vfio/$bus_type.name/
> > 
> > That would probably be pre-populated with a bunch of $groupid files,
> > matching /dev/vfio/$bus_type.name/$groupid char dev files (assuming
> > configfs can pre-populate files).  To make a user defined group, we
> > might then do:
> > 
> > mkdir /config/vfio/$bus_type.name/my_group
> > 
> > That would generate a /dev/vfio/$bus_type.name/my_group char dev.  To
> > add groups to the new my_group "super group", we'd need to do something
> > like:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidA 
> > /config/vfio/$bus_type.name/my_group/nic_group
> > 
> > I might then add a second group as:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidB 
> > /config/vfio/$bus_type.name/my_group/hba_group
> > 
> > Either link could fail if the target group is not viable,
> 
> The link op shouldn't fail because the subgroup isn't viable.
> Instead, the supergroup jusy won't be viable until all devices in all
> subgroups are bound to vfio.

The supergroup may already be in use if it's a hotplug.  What does it
mean to have an incompatible group linked into the supergroup?  When
does the subgroup actually become part of the supergroup?  Does the
userspace driver using the supergroup get notified somehow?  Does the
vfio driver get notified independently?  This example continues to show
what an administration nightmare it becomes when we split management
from usage.

> > the group is
> > already in use, or the second link could fail if the iommu domains were
> > incompatible.
> > 
> > Do these links cause /dev/vfio/$bus_type.name/{$groupidA,$groupidB} to
> > disappear?  If not, do we allow them to be opened?  Linking would also
> > have to fail if we later tried to link one of these groupids to a
> > different super group.
> 
> Again, I think some confusion is coming in here from calling both the
> hardware determined thing and the admin determined thing a "group".
> So for now I'm going to call the first a "group" and the second a
> "predomain" (because once it's viable and the right conditions are set
> up it will become an iommu domain).
> 
> So another option is that "groups" *only* participate in the merging
> interface; getting iommu and device handles occurs only on a
> predomain.  Therefore there would be no /dev/vfio/$group, you would
> have to configure a predomain with at least one group before you had a
> device file.

I think this actually leads to a more complicated, more difficult to use
interface that interposes an unnecessary administration layer into a
driver's decisions about how to manage the iommu.

> > Now we want to give my_group to a user, so we have to go back to /dev
> > and
> > 
> > chown $user /dev/vfio/$bus_type.name/my_group
> > 
> > At this point my_group would have the existing set of group ioctls sans
> > {UN}MERGE, of course.
> > 
> > So $user can use the super group, but not manipulate it's members.  Do
> > we then allow:
> > 
> > chown $user /config/vfio/$bus_type.name/my_group
> > 
> > If so, what does it imply about the user then doing:
> > 
> > ln -s /config/vfio/$bus_type.name/$groupidC 
> > /config/vfio/$bus_type.name/my_group/stolen_group
> > 
> > Would we instead need to chown the configfs groups as well as the super
> > group?
> > 
> > chown $user /config/vfio/$bus_type.name/my_group
> > chown $user /config/vfio/$bus_type.name/$groupidA
> > chown $user /config/vfio/$bus_type.name/$groupidB
> > 
> > ie:
> > 
> > # chown $user:$user /config/vfio/$bus_type.name/$groupC
> > $ ln -s /config/vfio/$bus_type.name/$groupidC 
> > /config/vfio/$bus_ty

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-20 Thread David Gibson
On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> > On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> 
> > > > > +Groups, Devices, IOMMUs, oh my
> > > > > +---
> > > > > +
> > > > > +A fundamental component of VFIO is the notion of IOMMU groups.  
> > > > > IOMMUs
> > > > > +can't always distinguish transactions from each individual device in
> > > > > +the system.  Sometimes this is because of the IOMMU design, such as 
> > > > > with
> > > > > +PEs, other times it's caused by the I/O topology, for instance a
> > > > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets 
> > > > > of
> > > > > +devices created by these restictions IOMMU groups (or just "groups" 
> > > > > for
> > > > > +this document).
> > > > > +
> > > > > +The IOMMU cannot distiguish transactions between the individual 
> > > > > devices
> > > > > +within the group, therefore the group is the basic unit of ownership 
> > > > > for
> > > > > +a userspace process.  Because of this, groups are also the primary
> > > > > +interface to both devices and IOMMU domains in VFIO.
> > > > > +
> > > > > +The VFIO representation of groups is created as devices are added 
> > > > > into
> > > > > +the framework by a VFIO bus driver.  The vfio-pci module is an 
> > > > > example
> > > > > +of a bus driver.  This module registers devices along with a set of 
> > > > > bus
> > > > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > > > +interfaces later used for device access.  As each new group is 
> > > > > created,
> > > > > +as determined by iommu_device_group(), VFIO creates a 
> > > > > /dev/vfio/$GROUP
> > > > > +character device.
> > > > 
> > > > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > > > bus driver is per bus type, not per bus instance.   But grouping
> > > > constraints could be per bus instance, if you have a couple of
> > > > different models of PCI host bridge with IOMMUs of different
> > > > capabilities built in, for example.
> > > 
> > > Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> > > instance.
> > 
> > Ok, how can that work.  vfio-pci is responsible for generating the
> > groupings, yes?  For which it needs to know the iommu/host bridge's
> > isolation capabilities, which vary depending on the type of host
> > bridge.
> 
> No, grouping is done at the iommu driver level.  vfio gets groupings via
> iomm_device_group(), which uses the iommu_ops for the bus_type of the
> requested device.  I'll attempt to clarify where groups come from in the
> documentation.

Hrm, but still per bus type, not bus instance.  Hrm.  Yeah, I need to
look at the earlier iommu patches in more detail.

[snip]
> > Yeah, I'm starting to get my head around the model, but the current
> > description of it doesn't help very much.  In particular the terms
> > "merge" and "unmerge" lead one to the wrong mental model, I think.
> > 
> > > > The semantics of "merge" and "unmerge" under those names are really
> > > > non-obvious.  Merge kind of has to merge two whole metagroups, but
> > > > it's unclear if unmerge reverses one merge, or just takes out one
> > > > (atom) group.  These operations need better names, at least.
> > > 
> > > Christian suggested a change to UNMERGE that we do not need to
> > > specify a group to unmerge "from".  This makes it more like a list
> > > implementation except there's no defined list_head.  Any member of the
> > > list can pull in a new entry.  Calling UNMERGE on any member extracts
> > > that member.
> > 
> > I think that's a good idea, but "unmerge" is not a good word for it.
> 
> I can't think of better, if you can, please suggest.

Well, I think addgroup and removegroup would be better than merge and
unmerge, although they have their own problems.

> > > > Then it's unclear what order you can do various operations, and which
> > > > order you can open and close various things.  You can kind of figure
> > > > it out but it takes far more thinking than it should.
> > > > 
> > > > 
> > > > So at the _very_ least, we need to invent new terminology and find a
> > > > much better way of describing this API's semantics.  I still think an
> > > > entirely different interface, where metagroups are created from
> > > > outside with a lifetime that's not tied to an fd would be a better
> > > > idea.
> > > 
> > > As we've discussed previously, configfs provides part of this, but has
> > > no ioctl support.  It doesn't make sense to me to go play with groups in
> > > configfs, but then still interact with them via a char dev.
> > 
> > Why not?  You configure, say, loopback devices with losetup, then use
> > them as a block device.  Similar with nb

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-18 Thread Scott Wood
On Fri, Nov 18, 2011 at 01:32:56PM -0700, Alex Williamson wrote:
> Hmm, that might be cleaner than eliminating the size with just using
> _IO().  So we might have something like:
> 
> #define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct vfio_dma_map)
> #define VFIO_IOMMU_MAP_DMA_V2   _IOWR(';', 106, struct 
> vfio_dma_map_v2)
> 
> For which the driver might do:
> 
> case VFIO_IOMMU_MAP_DMA:
> case VFIO_IOMMU_MAP_DMA_V2:
> {
>   struct vfio_dma_map map;
> 
>   /* We don't care about the extra v2 bits */
>   if (copy_from_user(&map, (void __user *)arg, sizeof map))
>   return -EFAULT;

That won't work if you have an old kernel that doesn't know about v2, and
a new user that uses v2.  To make this work you'd have to strip out the
size from the ioctl number before switching (but still use it when
considering whether to access the v2 fields).  Simpler to just leave it
out of the ioctl number and put it in the struct field as currently
planned.

> > > I think all our structure sizes are independent of host width.  If I'm
> > > missing something, let me know.
> > 
> > Ah, for structures, that might be true.  I was seeing the bunch of
> > ioctl()s that take ints.
> 
> Ugh, I suppose you're thinking of an ILP64 platform with ILP32 compat
> mode.

Does Linux support ILP64?  There are "int" ioctls all over the place, and
I don't think we do compat wrappers for them.  In fact, some of the
ioctls in linux/fs.h use "int" for the compatible version of ioctls
originally defined as "long".

It's cleaner to always use the fixed types, though.

> Darn it, guess we need to make everything 64bit, including file
> descriptors.

What's wrong with __u32/__s32 (or uint32_t/int32_t)?

I really do not see Linux supporting an ABI that has no 32-bit type at
all, especially in a situation where userspace compatibility is needed. 
If that does happen, the ABI breakage will go well beyond VFIO.

> The point of the group is to provide a unit of ownership.  We can't let
> $userA open $groupid and fetch a device, then have $userB do the same,
> grabbing a different device.  The mappings will step on each other and
> the devices have no isolation.  We can't restrict that purely by file
> permissions or we'll have the same problem with sudo.

What is the problem with sudo?  If you're running processes as the same
user, regardless of how, they're going to be able to mess with each
other.

Is it possible to expose access to only specific groups via an
individually-permissionable /dev/device, so only the entity handing out
access to devices needs access to everything?

> At one point we discussed a single open instance, but that
> unnecessarily limits the user, so we settled on the mm.  Thanks,

It would be nice if this limitation weren't excessively integrated into
the design -- in the embedded space we've got unusual partitioning
setups, including failover arrangements where partitions share devices. 
The device may be configured with the IOMMU pointing only at regions that
are shared by both mms, or the non-shared regions may be reconfigured as
active ownership of the device gets handed around.

It would be up to userspace code to make sure that the mappings don't
"step on each other".  The mapping could be done with whichever mm issued
the map call for a given region.

For this use case, there is unlikely to be an issue with ownership
because there will not be separate privilege domains creating partitions
-- other use cases could refrain from enabling multiple-mm support unless
ownership issues are resolved.

This doesn't need to be supported initially, but we should try to avoid
letting the assumption permeate the code.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-18 Thread Alex Williamson
On Thu, 2011-11-17 at 11:02 +1100, David Gibson wrote:
> On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> > On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:

> > > > +Groups, Devices, IOMMUs, oh my
> > > > +---
> > > > +
> > > > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > > > +can't always distinguish transactions from each individual device in
> > > > +the system.  Sometimes this is because of the IOMMU design, such as 
> > > > with
> > > > +PEs, other times it's caused by the I/O topology, for instance a
> > > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > > > +devices created by these restictions IOMMU groups (or just "groups" for
> > > > +this document).
> > > > +
> > > > +The IOMMU cannot distiguish transactions between the individual devices
> > > > +within the group, therefore the group is the basic unit of ownership 
> > > > for
> > > > +a userspace process.  Because of this, groups are also the primary
> > > > +interface to both devices and IOMMU domains in VFIO.
> > > > +
> > > > +The VFIO representation of groups is created as devices are added into
> > > > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > > > +of a bus driver.  This module registers devices along with a set of bus
> > > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > > +interfaces later used for device access.  As each new group is created,
> > > > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > > > +character device.
> > > 
> > > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > > bus driver is per bus type, not per bus instance.   But grouping
> > > constraints could be per bus instance, if you have a couple of
> > > different models of PCI host bridge with IOMMUs of different
> > > capabilities built in, for example.
> > 
> > Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> > instance.
> 
> Ok, how can that work.  vfio-pci is responsible for generating the
> groupings, yes?  For which it needs to know the iommu/host bridge's
> isolation capabilities, which vary depending on the type of host
> bridge.

No, grouping is done at the iommu driver level.  vfio gets groupings via
iomm_device_group(), which uses the iommu_ops for the bus_type of the
requested device.  I'll attempt to clarify where groups come from in the
documentation.

> >  IOMMUs also register drivers per bus type, not per bus
> > instance.  The IOMMU driver is free to impose any constraints it wants.
> > 
> > > > +In addition to the device enumeration and callbacks, the VFIO bus 
> > > > driver
> > > > +also provides a traditional device driver and is able to bind to 
> > > > devices
> > > > +on it's bus.  When a device is bound to the bus driver it's available 
> > > > to
> > > > +VFIO.  When all the devices within a group are bound to their bus 
> > > > drivers,
> > > > +the group becomes "viable" and a user with sufficient access to the 
> > > > VFIO
> > > > +group chardev can obtain exclusive access to the set of group devices.
> > > > +
> > > > +As documented in linux/vfio.h, several ioctls are provided on the
> > > > +group chardev:
> > > > +
> > > > +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
> > > > + #define VFIO_GROUP_FLAGS_VIABLE(1 << 0)
> > > > + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1)
> > > > +#define VFIO_GROUP_MERGE_IOW(';', 101, int)
> > > > +#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
> > > > +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
> > > > +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)
> > > > +
> > > > +The last two ioctls return new file descriptors for accessing
> > > > +individual devices within the group and programming the IOMMU.  Each of
> > > > +these new file descriptors provide their own set of file interfaces.
> > > > +These ioctls will fail if any of the devices within the group are not
> > > > +bound to their VFIO bus driver.  Additionally, when either of these
> > > > +interfaces are used, the group is then bound to the struct_mm of the
> > > > +caller.  The GET_FLAGS ioctl can be used to view the state of the 
> > > > group.
> > > > +
> > > > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > > > +new IOMMU domain is created and all of the devices in the group are
> > > > +attached to it.  This is the only way to ensure full IOMMU isolation
> > > > +of the group, but potentially wastes resources and cycles if the user
> > > > +intends to manage multiple groups with the same set of IOMMU mappings.
> > > > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > > > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-17 Thread Scott Wood
On Thu, Nov 17, 2011 at 01:22:17PM -0700, Alex Williamson wrote:
> On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> > On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> > What would be the return value if somebody tried to unmask an edge one?
> > Should that be documented here? -ENOSPEC?
> 
> I would assume EINVAL or EFAULT since the user is providing an invalid
> argument/bad address.

EINVAL.  EFAULT is normally only used for when the user passes a bad
virtual memory address to the kernel.  This isn't an address at all, it's
an index that points to an object for which this operation does not make
sense.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-17 Thread Alex Williamson
On Wed, 2011-11-16 at 11:47 -0600, Scott Wood wrote:
> On 11/11/2011 04:10 PM, Alex Williamson wrote:
> > 
> > Thanks Konrad!  Comments inline.
> > 
> > On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> >> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> >>> +When supported, as indicated by the device flags, reset the device.
> >>> +
> >>> +#define VFIO_DEVICE_RESET   _IO(';', 116)
> >>
> >> Does it disable the 'count'? Err, does it disable the IRQ on the
> >> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> >> to set new eventfds? Or does it re-use the eventfds and the device
> >> is enabled after this?
> > 
> > It doesn't affect the interrupt programming.  Should it?
> 
> It should probably clear any currently pending interrupts, as if the
> unmask IOCTL were called.

Sounds reasonable.

> >>> +device tree properties of the device:
> >>> +
> >>> +struct vfio_dtpath {
> >>> +__u32   len;/* length of structure */
> >>> +__u32   index;
> >>
> >> 0 based I presume?
> > 
> > Everything else is, I would assume so/
> 
> Yes, it should be zero-based -- this matches how such indices are done
> in the kernel device tree APIs.
> 
> >>> +__u64   flags;
> >>> +#define VFIO_DTPATH_FLAGS_REGION(1 << 0)
> >>
> >> What is region in this context?? Or would this make much more sense
> >> if I knew what Device Tree actually is.
> > 
> > Powerpc guys, any comments?  This was their suggestion.  These are
> > effectively the first device specific extension, available when
> > VFIO_DEVICE_FLAGS_DT is set.
> 
> An assigned device may consist of an entire subtree of the device tree,
> and both register banks and interrupts can come from any node in the
> tree.  Region versus IRQ here indicates the context in which to
> interpret index, in order to retrieve the path of the node that supplied
> this particular region or IRQ.

Ok.  Thanks for the clarification.  We'll wait for the vfio-dt bus
driver before actually including this.

> >>> +};
> >>> +#define VFIO_DEVICE_GET_DTPATH  _IOWR(';', 117, struct 
> >>> vfio_dtpath)
> >>> +
> >>> +struct vfio_dtindex {
> >>> +__u32   len;/* length of structure */
> >>> +__u32   index;
> >>> +__u32   prop_type;
> >>
> >> Is that an enum type? Is this definied somewhere?
> >>> +__u32   prop_index;
> >>
> >> What is the purpose of this field?
> > 
> > Need input from powerpc folks here
> 
> To identify what this resource (register bank or IRQ) this is, we need
> both the path to the node and the index into the reg or interrupts
> property within the node.
> 
> We also need to distinguish reg from ranges, and interrupts from
> interrupt-map.  As you suggested elsewhere in the thread, the device
> tree API should probably be left out for now, and added later along with
> the device tree "bus" driver.

Yep, I'll do that.

> >>> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> >>> + struct vfio_device *device)
> >>> +{
> >>> + BUG_ON(!iommu->domain && device->attached);
> >>
> >> Whoa. Heavy hammer there.
> >>
> >> Perhaps WARN_ON as you do check it later on.
> > 
> > I think it's warranted, internal consistency is broken if we have a
> > device that thinks it's attached to an iommu domain that doesn't exist.
> > It should, of course, never happen and this isn't a performance path.
> > 
> [snip]
> >>> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> >>> +struct vfio_device *device)
> >>> +{
> >>> + int ret;
> >>> +
> >>> + BUG_ON(device->attached);
> >>
> >> How about:
> >>
> >> WARN_ON(device->attached, "The engineer who wrote the user-space device 
> >> driver is trying to register
> >> the device again! Tell him/her to stop please.\n");
> > 
> > I would almost demote this one to a WARN_ON, but userspace isn't in
> > control of attaching and detaching devices from the iommu.  That's a
> > side effect of getting the iommu or device file descriptor.  So again,
> > this is an internal consistency check and it should never happen,
> > regardless of userspace.
> 
> The rule isn't to use BUG for internal consistency checks and WARN for
> stuff userspace can trigger, but rather to use BUG if you cannot
> reasonably continue, WARN for "significant issues that need prompt
> attention" that are reasonably recoverable.  Most instances of WARN are
> internal consistency checks.

That makes sense.

> From include/asm-generic/bug.h:
> > If you're tempted to BUG(), think again:  is completely giving up
> > really the *only* solution?  There are usually better options, where
> > users don't need to reboot ASAP and can mostly shut down cleanly.

Ok, I'll make a cleanup pass of demoting BUG_ONs to WARN_ONs.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-17 Thread Alex Williamson
On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:

> > > > +
> > > > +Regions are described by a struct vfio_region_info, which is retrieved 
> > > > by
> > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set 
> > > > to
> > > > +the desired region (0 based index).  Note that devices may implement 
> > > > zero
> > > > 
> > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > > +mapping).
> > > 
> > > Huh?
> > 
> > PCI has the following static mapping:
> > 
> > enum {
> > VFIO_PCI_BAR0_REGION_INDEX,
> > VFIO_PCI_BAR1_REGION_INDEX,
> > VFIO_PCI_BAR2_REGION_INDEX,
> > VFIO_PCI_BAR3_REGION_INDEX,
> > VFIO_PCI_BAR4_REGION_INDEX,
> > VFIO_PCI_BAR5_REGION_INDEX,
> > VFIO_PCI_ROM_REGION_INDEX,
> > VFIO_PCI_CONFIG_REGION_INDEX,
> > VFIO_PCI_NUM_REGIONS
> > };
> > 
> > So 8 regions are always reported regardless of whether the device
> > implements all the BARs and the ROM.  Then we have a fixed bar:index
> > mapping so we don't have to create a region_info field to describe the
> > bar number for the index.
> 
> OK. Is that a problem if the real device actually has a zero sized BAR?
> Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just
> wondering whether (-1ULL) should be used instead? (Which seems the case
> in QEMU code).

Yes, PCI spec defines that unimplemented BARs are hardwired to zero, so
the sizing operation returns zero for the size.


> > > > +struct vfio_irq_info {
> > > > +__u32   len;/* length of structure */
> > > > +__u32   index;  /* IRQ number */
> > > > +__u32   count;  /* number of individual IRQs */
> > > > +__u64   flags;
> > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL(1 << 0)
> > > > +};
> > > > +
> > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > > +type to index mapping).
> > > 
> > > I am not really sure what that means.
> > 
> > This is so PCI can expose:
> > 
> > enum {
> > VFIO_PCI_INTX_IRQ_INDEX,
> > VFIO_PCI_MSI_IRQ_INDEX,
> > VFIO_PCI_MSIX_IRQ_INDEX,
> > VFIO_PCI_NUM_IRQS
> > };
> > 
> > So like regions it always exposes 3 IRQ indexes where count=0 if the
> > device doesn't actually support that type of interrupt.  I just want to
> > spell out that bus drivers have this kind of flexibility.
> 
> I think you should change the comment that  says 'IRQ number', as the
> first thing that comes in my mind is 'GSI' or MSI/MSI-x vector.
> Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl.
> Order of structures can be unsorted. */

Ah, yes.  I see the confusion.  They can't really be unsorted though,
the user needs some point of reference.  For PCI they will be strictly
ordered.  For Device Tree, I assume there will be a path referencing the
index.  I'll update the doc to clarify.


> > > > +
> > > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > > +on the host.  This prevents an unresponsive userspace driver from
> > > > +continuing to interrupt the host system.  After servicing the 
> > > > interrupt,
> > > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that 
> > > > level
> > > > +triggered interrupts implicitly have a count of 1 per index.
> > > 
> > > So they are enabled automatically? Meaning you don't even hav to do
> > > SET_IRQ_EVENTFDS b/c the count is set to 1?
> > 
> > I suppose that should be "no more than 1 per index" (ie. PCI would
> > report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
> > support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
> > which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
> > which does the enabling/disabling.  All interrupts are disabled by
> > default because userspace needs to give us a way to signal them via
> > eventfds.  It will be device dependent whether multiple index can be
> > enabled simultaneously.  Hmm, is that another flag on the irq_info
> > struct or do we expect drivers to implicitly have that kind of
> > knowledge?
> 
> Right, that was what I was wondering. Not sure how the PowerPC
> world works with this.

On second thought, I think an exclusive flag isn't appropriate.  VFIO is
not meant to abstract the device to the level that a user could write a
generic "vfio driver".  The user will always need to understand the type
of device, VFIO just provides the conduit to make use of it.  There's
too much left undefined with a simplistic exclusive flag.


> > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > > 
> > > So only level triggered? Hmm, how do I know whether the device is
> > > level or edge? Or is that edge (MSI) can also be unmaked using the
> > > eventfs
> > 
> > Yes, only for level.  Isn't a device going to know what type of
> > int

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread David Gibson
On Tue, Nov 15, 2011 at 11:01:28AM -0700, Alex Williamson wrote:
> On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > > new file mode 100644
> > > index 000..5866896
> > > --- /dev/null
> > > +++ b/Documentation/vfio.txt
> > > @@ -0,0 +1,304 @@
> > > +VFIO - "Virtual Function I/O"[1]
> > > +---
> > > +Many modern system now provide DMA and interrupt remapping facilities
> > > +to help ensure I/O devices behave within the boundaries they've been
> > > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > > +is an IOMMU/device agnostic framework for exposing direct device
> > > +access to userspace, in a secure, IOMMU protected environment.  In
> > > +other words, this allows safe, non-privileged, userspace drivers.
> > 
> > It's perhaps worth emphasisng that "safe" depends on the hardware
> > being sufficiently well behaved.  BenH, I know, thinks there are a
> > *lot* of cards that, e.g. have debug registers that allow a backdoor
> > to their own config space via MMIO, which would bypass vfio's
> > filtering of config space access.  And that's before we even get into
> > the varying degrees of completeness in the isolation provided by
> > different IOMMUs.
> 
> Fair enough.  I know Tom had emphasized "well behaved" in the original
> doc.  Virtual functions are probably the best indicator of well behaved.
> 
> > > +Why do we want that?  Virtual machines often make use of direct device
> > > +access ("device assignment") when configured for the highest possible
> > > +I/O performance.  From a device and host perspective, this simply turns
> > > +the VM into a userspace driver, with the benefits of significantly
> > > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > > +drivers[2].
> > > +
> > > +Some applications, particularly in the high performance computing
> > > +field, also benefit from low-overhead, direct device access from
> > > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > > +and compute accelerators.  Previous to VFIO, these drivers needed to
> > 
> > s/Previous/Prior/  although that may be a .us vs .au usage thing.
> 
> Same difference, AFAICT.
> 
> > > +go through the full development cycle to become proper upstream driver,
> > > +be maintained out of tree, or make use of the UIO framework, which
> > > +has no notion of IOMMU protection, limited interrupt support, and
> > > +requires root privileges to access things like PCI configuration space.
> > > +
> > > +The VFIO driver framework intends to unify these, replacing both the
> > > +KVM PCI specific device assignment currently used as well as provide
> > > +a more secure, more featureful userspace driver environment than UIO.
> > > +
> > > +Groups, Devices, IOMMUs, oh my
> > > +---
> > > +
> > > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > > +can't always distinguish transactions from each individual device in
> > > +the system.  Sometimes this is because of the IOMMU design, such as with
> > > +PEs, other times it's caused by the I/O topology, for instance a
> > > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > > +devices created by these restictions IOMMU groups (or just "groups" for
> > > +this document).
> > > +
> > > +The IOMMU cannot distiguish transactions between the individual devices
> > > +within the group, therefore the group is the basic unit of ownership for
> > > +a userspace process.  Because of this, groups are also the primary
> > > +interface to both devices and IOMMU domains in VFIO.
> > > +
> > > +The VFIO representation of groups is created as devices are added into
> > > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > > +of a bus driver.  This module registers devices along with a set of bus
> > > +specific callbacks with the VFIO core.  These callbacks provide the
> > > +interfaces later used for device access.  As each new group is created,
> > > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > > +character device.
> > 
> > Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> > bus driver is per bus type, not per bus instance.   But grouping
> > constraints could be per bus instance, if you have a couple of
> > different models of PCI host bridge with IOMMUs of different
> > capabilities built in, for example.
> 
> Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
> instance.

Ok, how can that work.  vfio-pci is responsible for generating the
groupings, y

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Alex Williamson
On Tue, 2011-11-15 at 16:29 -0600, Scott Wood wrote:
> On 11/15/2011 03:40 PM, Aaron Fabbri wrote:
> > 
> > 
> > 
> > On 11/15/11 12:10 PM, "Scott Wood"  wrote:
> > 
> >> On 11/15/2011 12:34 AM, David Gibson wrote:
> >  
>  +static int allow_unsafe_intrs;
>  +module_param(allow_unsafe_intrs, int, 0);
>  +MODULE_PARM_DESC(allow_unsafe_intrs,
>  +"Allow use of IOMMUs which do not support interrupt remapping");
> >>>
> >>> This should not be a global option, but part of the AMD/Intel IOMMU
> >>> specific code.  In general it's a question of how strict the IOMMU
> >>> driver is about isolation when it determines what the groups are, and
> >>> only the IOMMU driver can know what the possibilities are for its
> >>> class of hardware.
> >>
> >> It's also a concern that is specific to MSIs.  In any case, I'm not sure
> >> that the ability to cause a spurious IRQ is bad enough to warrant
> >> disabling the entire subsystem by default on certain hardware.
> > 
> > I think the issue is more that the ability to create fake MSI interrupts can
> > lead to bigger exploits.
> > 
> > Originally we didn't have this parameter. It was added it to reflect the
> > fact that MSI's triggered by guests are dangerous without the isolation that
> > interrupt remapping provides.
> > 
> > That is, it *should* be inconvenient to run without interrupt mapping HW
> > support.
> 
> A sysfs knob is sufficient inconvenience.  It should only affect whether
> you can use MSIs, and the relevant issue shouldn't be "has interrupt
> remapping" but "is there a hole".
> 
> Some systems might address the issue in ways other than IOMMU-level MSI
> translation.  Our interrupt controller provides enough separate 4K pages
> for MSI interrupt delivery for each PCIe IOMMU group to get its own (we
> currently only have 3, one per root complex) -- no special IOMMU feature
> required.
> 
> It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are
> undefined.  I shouldn't have to know how x86 IOMMUs work when
> implementing a driver for different hardware, just to know what the
> generic code is expecting.
> 
> As David suggests, if you want to do this it should be the x86 IOMMU
> driver that has a knob that controls how it forms groups in the absence
> of this support.

That is a possibility, we could push it down to the iommu driver which
could simply lump everything into a single groupid when interrupt
remapping is not supported.  Or more directly, when there is an exposure
that devices can trigger random MSIs in the host.  Then we wouldn't need
an option to override this in vfio, you'd just be stuck not being able
to use any devices if you can't bind everything to vfio.  That also
eliminates the possibility of flipping it on dynamically since we can't
handle groupids changing.  Then we'd need an iommu=group_unsafe_msi flag
to enable it.  Ok?  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Scott Wood
On 11/11/2011 04:10 PM, Alex Williamson wrote:
> 
> Thanks Konrad!  Comments inline.
> 
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
>> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
>>> +When supported, as indicated by the device flags, reset the device.
>>> +
>>> +#define VFIO_DEVICE_RESET   _IO(';', 116)
>>
>> Does it disable the 'count'? Err, does it disable the IRQ on the
>> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
>> to set new eventfds? Or does it re-use the eventfds and the device
>> is enabled after this?
> 
> It doesn't affect the interrupt programming.  Should it?

It should probably clear any currently pending interrupts, as if the
unmask IOCTL were called.

>>> +device tree properties of the device:
>>> +
>>> +struct vfio_dtpath {
>>> +__u32   len;/* length of structure */
>>> +__u32   index;
>>
>> 0 based I presume?
> 
> Everything else is, I would assume so/

Yes, it should be zero-based -- this matches how such indices are done
in the kernel device tree APIs.

>>> +__u64   flags;
>>> +#define VFIO_DTPATH_FLAGS_REGION(1 << 0)
>>
>> What is region in this context?? Or would this make much more sense
>> if I knew what Device Tree actually is.
> 
> Powerpc guys, any comments?  This was their suggestion.  These are
> effectively the first device specific extension, available when
> VFIO_DEVICE_FLAGS_DT is set.

An assigned device may consist of an entire subtree of the device tree,
and both register banks and interrupts can come from any node in the
tree.  Region versus IRQ here indicates the context in which to
interpret index, in order to retrieve the path of the node that supplied
this particular region or IRQ.

>>> +};
>>> +#define VFIO_DEVICE_GET_DTPATH  _IOWR(';', 117, struct vfio_dtpath)
>>> +
>>> +struct vfio_dtindex {
>>> +__u32   len;/* length of structure */
>>> +__u32   index;
>>> +__u32   prop_type;
>>
>> Is that an enum type? Is this definied somewhere?
>>> +__u32   prop_index;
>>
>> What is the purpose of this field?
> 
> Need input from powerpc folks here

To identify what this resource (register bank or IRQ) this is, we need
both the path to the node and the index into the reg or interrupts
property within the node.

We also need to distinguish reg from ranges, and interrupts from
interrupt-map.  As you suggested elsewhere in the thread, the device
tree API should probably be left out for now, and added later along with
the device tree "bus" driver.

>>> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
>>> +   struct vfio_device *device)
>>> +{
>>> +   BUG_ON(!iommu->domain && device->attached);
>>
>> Whoa. Heavy hammer there.
>>
>> Perhaps WARN_ON as you do check it later on.
> 
> I think it's warranted, internal consistency is broken if we have a
> device that thinks it's attached to an iommu domain that doesn't exist.
> It should, of course, never happen and this isn't a performance path.
> 
[snip]
>>> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
>>> +  struct vfio_device *device)
>>> +{
>>> +   int ret;
>>> +
>>> +   BUG_ON(device->attached);
>>
>> How about:
>>
>> WARN_ON(device->attached, "The engineer who wrote the user-space device 
>> driver is trying to register
>> the device again! Tell him/her to stop please.\n");
> 
> I would almost demote this one to a WARN_ON, but userspace isn't in
> control of attaching and detaching devices from the iommu.  That's a
> side effect of getting the iommu or device file descriptor.  So again,
> this is an internal consistency check and it should never happen,
> regardless of userspace.

The rule isn't to use BUG for internal consistency checks and WARN for
stuff userspace can trigger, but rather to use BUG if you cannot
reasonably continue, WARN for "significant issues that need prompt
attention" that are reasonably recoverable.  Most instances of WARN are
internal consistency checks.

>From include/asm-generic/bug.h:
> If you're tempted to BUG(), think again:  is completely giving up
> really the *only* solution?  There are usually better options, where
> users don't need to reboot ASAP and can mostly shut down cleanly.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-16 Thread Konrad Rzeszutek Wilk
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> 
> Thanks Konrad!  Comments inline.
> 
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > > VFIO provides a secure, IOMMU based interface for user space
> > > drivers, including device assignment to virtual machines.
> > > This provides the base management of IOMMU groups, devices,
> > > and IOMMU objects.  See Documentation/vfio.txt included in
> > > this patch for user and kernel API description.
> > > 
> > > Note, this implements the new API discussed at KVM Forum
> > > 2011, as represented by the drvier version 0.2.  It's hoped
> > > that this provides a modular enough interface to support PCI
> > > and non-PCI userspace drivers across various architectures
> > > and IOMMU implementations.
> > > 
> > > Signed-off-by: Alex Williamson 
> > > ---
> > > 
> > > Fingers crossed, this is the last RFC for VFIO, but we need
> > > the iommu group support before this can go upstream
> > > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> > > hoping this helps push that along.
> > > 
> > > Since the last posting, this version completely modularizes
> > > the device backends and better defines the APIs between the
> > > core VFIO code and the device backends.  I expect that we
> > > might also adopt a modular IOMMU interface as iommu_ops learns
> > > about different types of hardware.  Also many, many cleanups.
> > > Check the complete git history for details:
> > > 
> > > git://github.com/awilliam/linux-vfio.git vfio-ng
> > > 
> > > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> > > 
> > > This version, along with the supporting VFIO PCI backend can
> > > be found here:
> > > 
> > > git://github.com/awilliam/linux-vfio.git vfio-next-2003
> > > 
> > > I've held off on implementing a kernel->user signaling
> > > mechanism for now since the previous netlink version produced
> > > too many gag reflexes.  It's easy enough to set a bit in the
> > > group flags too indicate such support in the future, so I
> > > think we can move ahead without it.
> > > 
> > > Appreciate any feedback or suggestions.  Thanks,
> > > 
> > > Alex
> > > 
> > >  Documentation/ioctl/ioctl-number.txt |1 
> > >  Documentation/vfio.txt   |  304 +
> > >  MAINTAINERS  |8 
> > >  drivers/Kconfig  |2 
> > >  drivers/Makefile |1 
> > >  drivers/vfio/Kconfig |8 
> > >  drivers/vfio/Makefile|3 
> > >  drivers/vfio/vfio_iommu.c|  530 
> > >  drivers/vfio/vfio_main.c | 1151 
> > > ++
> > >  drivers/vfio/vfio_private.h  |   34 +
> > >  include/linux/vfio.h |  155 +
> > >  11 files changed, 2197 insertions(+), 0 deletions(-)
> > >  create mode 100644 Documentation/vfio.txt
> > >  create mode 100644 drivers/vfio/Kconfig
> > >  create mode 100644 drivers/vfio/Makefile
> > >  create mode 100644 drivers/vfio/vfio_iommu.c
> > >  create mode 100644 drivers/vfio/vfio_main.c
> > >  create mode 100644 drivers/vfio/vfio_private.h
> > >  create mode 100644 include/linux/vfio.h
> > > 
> > > diff --git a/Documentation/ioctl/ioctl-number.txt 
> > > b/Documentation/ioctl/ioctl-number.txt
> > > index 54078ed..59d01e4 100644
> > > --- a/Documentation/ioctl/ioctl-number.txt
> > > +++ b/Documentation/ioctl/ioctl-number.txt
> > > @@ -88,6 +88,7 @@ Code  Seq#(hex) Include FileComments
> > >   and kernel/power/user.c
> > >  '8'  all SNP8023 advanced NIC card
> > >   
> > > +';'  64-76   linux/vfio.h
> > >  '@'  00-0F   linux/radeonfb.hconflict!
> > >  '@'  00-0F   drivers/video/aty/aty128fb.cconflict!
> > >  'A'  00-1F   linux/apm_bios.hconflict!
> > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > > new file mode 100644
> > > index 000..5866896
> > > --- /dev/null
> > > +++ b/Documentation/vfio.txt
> > > @@ -0,0 +1,304 @@
> > > +VFIO - "Virtual Function I/O"[1]
> > > +---
> > > +Many modern system now provide DMA and interrupt remapping facilities
> > > +to help ensure I/O devices behave within the boundaries they've been
> > > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > > +is an IOMMU/device agnostic framework for exposing direct device
> > > +access to userspace, in a secure, IOMMU protected environment.  In
> > > +other words, this allows safe, non-privileged, userspace drivers.
> > > +
> > > +Why do we want that?  Virtual machines oft

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-15 Thread Scott Wood
On 11/15/2011 03:40 PM, Aaron Fabbri wrote:
> 
> 
> 
> On 11/15/11 12:10 PM, "Scott Wood"  wrote:
> 
>> On 11/15/2011 12:34 AM, David Gibson wrote:
>  
 +static int allow_unsafe_intrs;
 +module_param(allow_unsafe_intrs, int, 0);
 +MODULE_PARM_DESC(allow_unsafe_intrs,
 +"Allow use of IOMMUs which do not support interrupt remapping");
>>>
>>> This should not be a global option, but part of the AMD/Intel IOMMU
>>> specific code.  In general it's a question of how strict the IOMMU
>>> driver is about isolation when it determines what the groups are, and
>>> only the IOMMU driver can know what the possibilities are for its
>>> class of hardware.
>>
>> It's also a concern that is specific to MSIs.  In any case, I'm not sure
>> that the ability to cause a spurious IRQ is bad enough to warrant
>> disabling the entire subsystem by default on certain hardware.
> 
> I think the issue is more that the ability to create fake MSI interrupts can
> lead to bigger exploits.
> 
> Originally we didn't have this parameter. It was added it to reflect the
> fact that MSI's triggered by guests are dangerous without the isolation that
> interrupt remapping provides.
> 
> That is, it *should* be inconvenient to run without interrupt mapping HW
> support.

A sysfs knob is sufficient inconvenience.  It should only affect whether
you can use MSIs, and the relevant issue shouldn't be "has interrupt
remapping" but "is there a hole".

Some systems might address the issue in ways other than IOMMU-level MSI
translation.  Our interrupt controller provides enough separate 4K pages
for MSI interrupt delivery for each PCIe IOMMU group to get its own (we
currently only have 3, one per root complex) -- no special IOMMU feature
required.

It doesn't help that the semantics of IOMMU_CAP_INTR_REMAP are
undefined.  I shouldn't have to know how x86 IOMMUs work when
implementing a driver for different hardware, just to know what the
generic code is expecting.

As David suggests, if you want to do this it should be the x86 IOMMU
driver that has a knob that controls how it forms groups in the absence
of this support.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-15 Thread Aaron Fabbri



On 11/15/11 12:10 PM, "Scott Wood"  wrote:

> On 11/15/2011 12:34 AM, David Gibson wrote:
 
>>> +static int allow_unsafe_intrs;
>>> +module_param(allow_unsafe_intrs, int, 0);
>>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>>> +"Allow use of IOMMUs which do not support interrupt remapping");
>> 
>> This should not be a global option, but part of the AMD/Intel IOMMU
>> specific code.  In general it's a question of how strict the IOMMU
>> driver is about isolation when it determines what the groups are, and
>> only the IOMMU driver can know what the possibilities are for its
>> class of hardware.
> 
> It's also a concern that is specific to MSIs.  In any case, I'm not sure
> that the ability to cause a spurious IRQ is bad enough to warrant
> disabling the entire subsystem by default on certain hardware.

I think the issue is more that the ability to create fake MSI interrupts can
lead to bigger exploits.

Originally we didn't have this parameter. It was added it to reflect the
fact that MSI's triggered by guests are dangerous without the isolation that
interrupt remapping provides.

That is, it *should* be inconvenient to run without interrupt mapping HW
support.

-Aaron

> Probably best to just print a warning on module init if there are any
> known isolation holes, and let the admin decide whom (if anyone) to let
> use this.  If the hole is bad enough that it must be confirmed, it
> should require at most a sysfs poke.
> 
> -Scott
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-15 Thread Scott Wood
On 11/15/2011 12:34 AM, David Gibson wrote:
> I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> VFIO is pretty much a lost cause if you can't map any normal process
> memory page into the IOMMU, so I think the only thing that is really
> covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> if you can't map it, it's not a (valid) IOVA.  Better to say that
> IOVAs can be any 64-bit value, which I think is what you really mean
> here.

It also means that there are no restrictions on what the IOVA can be
within that range (other than page alignment), which isn't true on our
IOMMU.

We'll also need a way to communicate the desired geometry of the overall
IOMMU table (for this group) to the kernel, which determines what the
restrictions will be (we can't determine it automatically until we know
what all the translation requests will be, and even then it's awkward).

> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
>> +When a level triggered interrupt is signaled, the interrupt is masked
>> +on the host.  This prevents an unresponsive userspace driver from
>> +continuing to interrupt the host system.  After servicing the interrupt,
>> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
>> +triggered interrupts implicitly have a count of 1 per index.
> 
> This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> on a function in theory, though no-one ever does.  Embedded devices
> can and do have multiple level interrupts.

Those interrupts would each have their own index.  This is necessary for
level-triggered interrupts since they'll need to be individually
identifiable to VFIO_DEVICE_UNMASK_IRQ -- doesn't seem worth adding
another parameter to UNMASK.

>> +#ifdef CONFIG_COMPAT
>> +static long vfio_iommu_compat_ioctl(struct file *filep,
>> +unsigned int cmd, unsigned long arg)
>> +{
>> +arg = (unsigned long)compat_ptr(arg);
>> +return vfio_iommu_unl_ioctl(filep, cmd, arg);
> 
> Um, this only works if the structures are exactly compatible between
> 32-bit and 64-bit ABIs.  I don't think that is always true.

These are new structs, we can make it true.

>> +static int allow_unsafe_intrs;
>> +module_param(allow_unsafe_intrs, int, 0);
>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>> +"Allow use of IOMMUs which do not support interrupt remapping");
> 
> This should not be a global option, but part of the AMD/Intel IOMMU
> specific code.  In general it's a question of how strict the IOMMU
> driver is about isolation when it determines what the groups are, and
> only the IOMMU driver can know what the possibilities are for its
> class of hardware.

It's also a concern that is specific to MSIs.  In any case, I'm not sure
that the ability to cause a spurious IRQ is bad enough to warrant
disabling the entire subsystem by default on certain hardware.

Probably best to just print a warning on module init if there are any
known isolation holes, and let the admin decide whom (if anyone) to let
use this.  If the hole is bad enough that it must be confirmed, it
should require at most a sysfs poke.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-15 Thread Alex Williamson
On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > new file mode 100644
> > index 000..5866896
> > --- /dev/null
> > +++ b/Documentation/vfio.txt
> > @@ -0,0 +1,304 @@
> > +VFIO - "Virtual Function I/O"[1]
> > +---
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > +is an IOMMU/device agnostic framework for exposing direct device
> > +access to userspace, in a secure, IOMMU protected environment.  In
> > +other words, this allows safe, non-privileged, userspace drivers.
> 
> It's perhaps worth emphasisng that "safe" depends on the hardware
> being sufficiently well behaved.  BenH, I know, thinks there are a
> *lot* of cards that, e.g. have debug registers that allow a backdoor
> to their own config space via MMIO, which would bypass vfio's
> filtering of config space access.  And that's before we even get into
> the varying degrees of completeness in the isolation provided by
> different IOMMUs.

Fair enough.  I know Tom had emphasized "well behaved" in the original
doc.  Virtual functions are probably the best indicator of well behaved.

> > +Why do we want that?  Virtual machines often make use of direct device
> > +access ("device assignment") when configured for the highest possible
> > +I/O performance.  From a device and host perspective, this simply turns
> > +the VM into a userspace driver, with the benefits of significantly
> > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > +drivers[2].
> > +
> > +Some applications, particularly in the high performance computing
> > +field, also benefit from low-overhead, direct device access from
> > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > +and compute accelerators.  Previous to VFIO, these drivers needed to
> 
> s/Previous/Prior/  although that may be a .us vs .au usage thing.

Same difference, AFAICT.

> > +go through the full development cycle to become proper upstream driver,
> > +be maintained out of tree, or make use of the UIO framework, which
> > +has no notion of IOMMU protection, limited interrupt support, and
> > +requires root privileges to access things like PCI configuration space.
> > +
> > +The VFIO driver framework intends to unify these, replacing both the
> > +KVM PCI specific device assignment currently used as well as provide
> > +a more secure, more featureful userspace driver environment than UIO.
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +---
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> > +this document).
> > +
> > +The IOMMU cannot distiguish transactions between the individual devices
> > +within the group, therefore the group is the basic unit of ownership for
> > +a userspace process.  Because of this, groups are also the primary
> > +interface to both devices and IOMMU domains in VFIO.
> > +
> > +The VFIO representation of groups is created as devices are added into
> > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > +of a bus driver.  This module registers devices along with a set of bus
> > +specific callbacks with the VFIO core.  These callbacks provide the
> > +interfaces later used for device access.  As each new group is created,
> > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > +character device.
> 
> Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> bus driver is per bus type, not per bus instance.   But grouping
> constraints could be per bus instance, if you have a couple of
> different models of PCI host bridge with IOMMUs of different
> capabilities built in, for example.

Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
instance.  IOMMUs also register drivers per bus type, not per bus
instance.  The IOMMU driver is free to impose any constraints it wants.

> > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > +also provides a traditional device driver and is able to bind to devices
> > +on it's bus.  When a device i

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > > + int (*get)(void *);
> > > + void(*put)(void *);
> > > + ssize_t (*read)(void *, char __user *,
> > > + size_t, loff_t *);
> > > + ssize_t (*write)(void *, const char __user *,
> > > +  size_t, loff_t *);
> > > + long(*ioctl)(void *, unsigned int, unsigned long);
> > > + int (*mmap)(void *, struct vm_area_struct *);
> > > +};
> > 
> > When defining an API, please do not omit parameter names.
> 
> ok
> 
> > Should specify what the driver is supposed to do with get/put -- I guess
> > not try to unbind when the count is nonzero?  Races could still lead the
> > unbinder to be blocked, but I guess it lets the driver know when it's
> > likely to succeed.
> 
> Right, for the pci bus driver, it's mainly for reference counting,
> including the module_get to prevent vfio-pci from being unloaded.  On
> the first get for a device, we also do a pci_enable() and pci_disable()
> on last put.  I'll try to clarify in the docs.

Looking at these again, I should just rename them to open/release.  That
matches the points when they're called.  I suspect I started with just
reference counting and it grew to more of a full blown open/release.
Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Benjamin Herrenschmidt
On Tue, 2011-11-15 at 11:05 +1100, David Gibson wrote:
> Being strict, or at least enforcing strictness, requires that the
> infrastructure track all the maps, so that the unmaps can be
> matching.  This is not a natural thing with the data structures you
> want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
> is a simple 1-level pagetable.  One pointer with a couple of
> permission bits per IOMMU page.  Handling oddly overlapping operations
> on that data structure is natural, enforcing strict matching of maps
> and unmaps is not and would require extra information to be stored by
> vfio.  On POWER, the IOMMU operations often *are* a hot path, so
> manipulating those structures would have a real cost, too. 

In fact they are a very hot path even. There's no way we can afford the
cost of tracking per page mapping/unmapping (other than bumping the page
count on a page that's currently mapped or via some debug-only feature).

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread David Gibson
On Mon, Nov 14, 2011 at 03:59:00PM -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
[snip]

> > - the user either unmaps one specific mapping or 'all of them'.
> >   The 'all of them' case would also take care of those cases where
> >   the user does _not_ keep track of mappings and simply uses
> >   the "unmap from 0 to ~0" each time.
> > 
> > Because of this you could still provide an exact map/unmap logic
> > and allow such "unmap from 0 to ~0" by making the latter a special
> > case.
> > However, if we want to allow any arbitrary/inexact unmap request, then OK.
> 
> I can't think of any good reasons we shouldn't be more strict.  I think
> it was primarily just convenient to hit all the corner cases since we
> merge all the requests together for tracking and need to be able to
> split them back apart.  It does feel a little awkward to have a 0/~0
> special case though, but I don't think it's worth adding another ioctl
> to handle it.

Being strict, or at least enforcing strictness, requires that the
infrastructure track all the maps, so that the unmaps can be
matching.  This is not a natural thing with the data structures you
want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
is a simple 1-level pagetable.  One pointer with a couple of
permission bits per IOMMU page.  Handling oddly overlapping operations
on that data structure is natural, enforcing strict matching of maps
and unmaps is not and would require extra information to be stored by
vfio.  On POWER, the IOMMU operations often *are* a hot path, so
manipulating those structures would have a real cost, too.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread David Gibson
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> Thanks Konrad!  Comments inline.
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
[snip]
> > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int)
> > 
> > Don't want __u32?
> 
> It could be, not sure if it buys us anything maybe even restricts us.
> We likely don't need 2^32 regions (famous last words?), so we could
> later define <0 to something?

As a rule, it's best to use explicit fixed width types for all ioctl()
arguments, to avoid compat hell for 32-bit userland on 64-bit kernel
setups.

[snip]
> > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > +type to index mapping).
> > 
> > I am not really sure what that means.
> 
> This is so PCI can expose:
> 
> enum {
> VFIO_PCI_INTX_IRQ_INDEX,
> VFIO_PCI_MSI_IRQ_INDEX,
> VFIO_PCI_MSIX_IRQ_INDEX,
> VFIO_PCI_NUM_IRQS
> };
> 
> So like regions it always exposes 3 IRQ indexes where count=0 if the
> device doesn't actually support that type of interrupt.  I just want to
> spell out that bus drivers have this kind of flexibility.

I knew what you were aiming for, so I could see what you meant here,
but I don't think the doco is very clearly expressed at all.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, November 11, 2011 10:04 AM
> > To: Christian Benvenuti (benve)
> > Cc: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com;
> > d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Aaron Fabbri
> > (aafabbri); b08...@freescale.com; b07...@freescale.com; a...@redhat.com;
> > konrad.w...@oracle.com; kvm@vger.kernel.org; qemu-de...@nongnu.org;
> > io...@lists.linux-foundation.org; linux-...@vger.kernel.org
> > Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
> > 
> > On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> > > Here are few minor comments on vfio_iommu.c ...
> > 
> > Sorry, I've been poking sticks at trying to figure out a clean way to
> > solve the force vfio driver attach problem.
> 
> Attach o detach?

Attach.  For the case when a new device appears that belongs to a group
that already in use.  I'll probably add a claim() operation to the
vfio_device_ops that tells the driver to grab it.  I was hoping for pci
this would just add it to the dynamic ids, but that hits device lock
problems.

> > > > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > > > new file mode 100644
> > > > index 000..029dae3
> > > > --- /dev/null
> > > > +++ b/drivers/vfio/vfio_iommu.c
> > 
> > > > +
> > > > +#include "vfio_private.h"
> > >
> > > Doesn't the 'dma_'  prefix belong to the generic DMA code?
> > 
> > Sure, we could these more vfio-centric.
> 
> Like vfio_dma_map_page?

Something like that, though _page doesn't seem appropriate as it tracks
a region.

> > 
> > > > +struct dma_map_page {
> > > > +   struct list_headlist;
> > > > +   dma_addr_t  daddr;
> > > > +   unsigned long   vaddr;
> > > > +   int npage;
> > > > +   int rdwr;
> > > > +};
> > > > +
> > > > +/*
> > > > + * This code handles mapping and unmapping of user data buffers
> > > > + * into DMA'ble space using the IOMMU
> > > > + */
> > > > +
> > > > +#define NPAGE_TO_SIZE(npage)   ((size_t)(npage) << PAGE_SHIFT)
> > > > +
> > > > +struct vwork {
> > > > +   struct mm_struct*mm;
> > > > +   int npage;
> > > > +   struct work_struct  work;
> > > > +};
> > > > +
> > > > +/* delayed decrement for locked_vm */
> > > > +static void vfio_lock_acct_bg(struct work_struct *work)
> > > > +{
> > > > +   struct vwork *vwork = container_of(work, struct vwork, work);
> > > > +   struct mm_struct *mm;
> > > > +
> > > > +   mm = vwork->mm;
> > > > +   down_write(&mm->mmap_sem);
> > > > +   mm->locked_vm += vwork->npage;
> > > > +   up_write(&mm->mmap_sem);
> > > > +   mmput(mm);  /* unref mm */
> > > > +   kfree(vwork);
> > > > +}
> > > > +
> > > > +static void vfio_lock_acct(int npage)
> > > > +{
> > > > +   struct vwork *vwork;
> > > > +   struct mm_struct *mm;
> > > > +
> > > > +   if (!current->mm) {
> > > > +   /* process exited */
> > > > +   return;
> > > > +   }
> > > > +   if (down_write_trylock(¤t->mm->mmap_sem)) {
> > > > +   current->mm->locked_vm += npage;
> > > > +   up_write(¤t->mm->mmap_sem);
> > > > +   return;
> > > > +   }
> > > > +   /*
> > > > +* Couldn't get mmap_sem lock, so must setup to decrement
> > >   ^
> > >
> > > Increment?
> > 
> > Yep

Actually, side note, this is increment/decrement depending on the sign
of the parameter.  So "update" may be more appropriate.  I think Tom
originally used increment in one place and decrement in another to show
it's dual use.

> > 
> > > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > > > start,
> >

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alexander Graf


Am 14.11.2011 um 23:26 schrieb Scott Wood :

> On 11/14/2011 02:54 PM, Alex Williamson wrote:
>> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>>> What are the semantics of "desired and/or returned dma address"?
>> 
>> I believe the original intention was that a user could leave dmaaddr
>> clear and let the iommu layer provide an iova address.  The iommu api
>> has since evolved and that mapping scheme really isn't present anymore.
>> We'll currently fail if we can map the requested address.  I'll update
>> the docs to make that be the definition.
> 
> OK... if there is any desire in the future to have the kernel pick an
> address (which could be useful for IOMMUs that don't set
> VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
> since zero could be a valid address to request (doesn't mean "clear").
> 
>>> Note that the "length of structure" approach means that ioctl numbers
>>> will change whenever this grows -- perhaps we should avoid encoding the
>>> struct size into these ioctls?
>> 
>> How so?  What's described here is effectively the base size.  If we
>> later add feature foo requiring additional fields, we set a flag, change
>> the size, and tack those fields onto the end.  The kernel side should
>> balk if the size doesn't match what it expects from the flags it
>> understands (which I think I probably need to be more strict about).
> 
> The size of the struct is encoded into the ioctl number via the _IOWR()
> macro.  If we want the struct to be growable in the future, we should
> leave that out and just use _IO().  Otherwise if the size of the struct
> changes, the ioctl number changes.  This is annoying for old userspace
> plus new kernel (have to add compat entries to the switch), and broken
> for old kernel plus new userspace.

Avi wanted to write up a patch for this to allow ioctls with arbitrary size, 
for exctly this purpose.

> 
>>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>>> would still be useful for devices which don't do DMA, or where we accept
>>> the lack of protection/translation (e.g. we have a customer that wants
>>> to do KVM device assignment on one of our lower-end chips that lacks an
>>> IOMMU).
>> 
>> Ugh.  I'm not really onboard with it given that we're trying to sell
>> vfio as a secure user space driver interface with iommu-based
>> protection.
> 
> That's its main use case, but it doesn't make much sense to duplicate
> the non-iommu-related bits for other use cases.
> 
> This applies at runtime too, some devices don't do DMA at all (and thus
> may not be part of an IOMMU group, even if there is an IOMMU present for
> other devices -- could be considered a standalone group of one device,
> with a null IOMMU backend).  Support for such devices can wait, but it's
> good to keep the possibility in mind.

I agree. Potentially backing a device with a nop iommu also makes testing 
easier.

Alex

> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Scott Wood
On 11/14/2011 02:54 PM, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>> What are the semantics of "desired and/or returned dma address"?
> 
> I believe the original intention was that a user could leave dmaaddr
> clear and let the iommu layer provide an iova address.  The iommu api
> has since evolved and that mapping scheme really isn't present anymore.
> We'll currently fail if we can map the requested address.  I'll update
> the docs to make that be the definition.

OK... if there is any desire in the future to have the kernel pick an
address (which could be useful for IOMMUs that don't set
VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
since zero could be a valid address to request (doesn't mean "clear").

>> Note that the "length of structure" approach means that ioctl numbers
>> will change whenever this grows -- perhaps we should avoid encoding the
>> struct size into these ioctls?
> 
> How so?  What's described here is effectively the base size.  If we
> later add feature foo requiring additional fields, we set a flag, change
> the size, and tack those fields onto the end.  The kernel side should
> balk if the size doesn't match what it expects from the flags it
> understands (which I think I probably need to be more strict about).

The size of the struct is encoded into the ioctl number via the _IOWR()
macro.  If we want the struct to be growable in the future, we should
leave that out and just use _IO().  Otherwise if the size of the struct
changes, the ioctl number changes.  This is annoying for old userspace
plus new kernel (have to add compat entries to the switch), and broken
for old kernel plus new userspace.

>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>> would still be useful for devices which don't do DMA, or where we accept
>> the lack of protection/translation (e.g. we have a customer that wants
>> to do KVM device assignment on one of our lower-end chips that lacks an
>> IOMMU).
> 
> Ugh.  I'm not really onboard with it given that we're trying to sell
> vfio as a secure user space driver interface with iommu-based
> protection.

That's its main use case, but it doesn't make much sense to duplicate
the non-iommu-related bits for other use cases.

This applies at runtime too, some devices don't do DMA at all (and thus
may not be part of an IOMMU group, even if there is an IOMMU present for
other devices -- could be considered a standalone group of one device,
with a null IOMMU backend).  Support for such devices can wait, but it's
good to keep the possibility in mind.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote: 
> > > + for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > > + unsigned long pfn = 0;
> > > +
> > > + ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > > + if (ret) {
> > > + __vfio_dma_unmap(iommu, start, i, rdwr);
> > > + return ret;
> > > + }
> > > +
> > > + /* Only add actual locked pages to accounting */
> > > + if (!is_invalid_reserved_pfn(pfn))
> > > + locked++;
> > > +
> > > + ret = iommu_map(iommu->domain, iova,
> > > + (phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > > + if (ret) {
> > > + /* Back out mappings on error */
> > > + put_pfn(pfn, rdwr);
> > > + __vfio_dma_unmap(iommu, start, i, rdwr);
> > > + return ret;
> > > + }
> > > + }
> > 
> > There's no way to hand this stuff to the IOMMU driver in chunks larger
> > than a page?  That's going to be a problem for our IOMMU, which wants to
> > deal with large windows.
> 
> There is, this is just a simple implementation that maps individual
> pages.  We "just" need to determine physically contiguous chunks and
> mlock them instead of using get_user_pages.  The current implementation
> is much like how KVM maps iommu pages, but there shouldn't be a user API
> change to try to use larger chinks.  We want this for IOMMU large page
> support too.

Also, at one point intel-iommu didn't allow sub-ranges to be unmapped;
an unmap of a single page would unmap the entire original mapping that
contained that page.  That made it easier to map each page individually
for the flexibility it provided on unmap.  I need to see if we still
have that restriction.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-14 Thread Alex Williamson
On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  
> 
> Maybe replace "(technology name unknown)" with "(such as Freescale chips
> with PAMU)" or similar?
> 
> Or just leave out the parenthetical.

I was hoping that comment would lead to an answer.  Thanks for the
info ;)

> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE(1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1)
> > +#define VFIO_GROUP_MERGE_IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)
> 
> This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
> pointer to char rather than a pointer to an array of char (just as e.g.
> VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

I believe I was following the UI_SET_PHYS ioctl as an example, which is
defined as a char *.  I'll change to char and verify.

> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS_IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY   (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct 
> > vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA_IOWR(';', 107, struct 
> > vfio_dma_map)
> 
> What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
> an implementation supposed to add a new flag that describes its
> restrictions?

If MAP_ANY is clear then I would expect a new flag is set defining a new
mapping paradigm, probably with an ioctl to describe the
restrictions/parameters.  MAP_ANY effectively means there are no
restrictions.

> Can we get a way to turn DMA access off and on, short of unmapping
> everything, and then mapping it again?

iommu_ops doesn't support such an interface, so no, not currently.

> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +__u64   len;/* length of structure */
> > +__u64   vaddr;  /* process virtual addr */
> > +__u64   dmaaddr;/* desired and/or returned dma address */
> > +__u64   size;   /* size in bytes */
> > +__u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem 
> > */
> > +};
> 
> What are the semantics of "desired and/or returned dma address"?

I believe the original intention was that a user could leave dmaaddr
clear and let the iommu layer provide an iova address.  The iommu api
has since evolved and that mapping scheme really isn't present anymore.
We'll currently fail if we can map the requested address.  I'll update
the docs to make that be the definition.

> Are we always supposed to provide a desired address, but it may be
> different on return?  Or are there cases where we want to say "give me
> whatever you want" or "give me this or fail"?

Exactly, that's what it used to be, but we don't really implement that
any more.

> How much of this needs to be filled out for unmap?

dmaaddr & size, will update docs.

> Note that the "length of structure" approach means that ioctl numbers
> will change whenever this grows -- perhaps we should avoid encoding the
> struct size into these ioctls?

How so?  What's described here is effectively the base size.  If we
later add feature foo requiring additional fields, we set a flag, change
the size, and tack those fields onto the end.  The kernel side should
balk if the size doesn't match what it expects from the flags it
understands (which I think I probably need to be more strict about).

> > +struct vfio_region_info {
> > +__u32   len;/* length of structure */
> > +__u32   index;  /* region number */
> > +__u64   size;   /* size in bytes of region */
> > +__u64   offset; /* start offset of region */
> > +__u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP  (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO(1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID   

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Scott Wood
On 11/03/2011 03:12 PM, Alex Williamson wrote:
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  

Maybe replace "(technology name unknown)" with "(such as Freescale chips
with PAMU)" or similar?

Or just leave out the parenthetical.

> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS_IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE(1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1)
> +#define VFIO_GROUP_MERGE_IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE  _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD_IOW(';', 104, char *)

This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
pointer to char rather than a pointer to an array of char (just as e.g.
VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS_IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY   (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA  _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA_IOWR(';', 107, struct vfio_dma_map)

What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
an implementation supposed to add a new flag that describes its
restrictions?

Can we get a way to turn DMA access off and on, short of unmapping
everything, and then mapping it again?

> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +__u64   len;/* length of structure */
> +__u64   vaddr;  /* process virtual addr */
> +__u64   dmaaddr;/* desired and/or returned dma address */
> +__u64   size;   /* size in bytes */
> +__u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem */
> +};

What are the semantics of "desired and/or returned dma address"?

Are we always supposed to provide a desired address, but it may be
different on return?  Or are there cases where we want to say "give me
whatever you want" or "give me this or fail"?

How much of this needs to be filled out for unmap?

Note that the "length of structure" approach means that ioctl numbers
will change whenever this grows -- perhaps we should avoid encoding the
struct size into these ioctls?

> +struct vfio_region_info {
> +__u32   len;/* length of structure */
> +__u32   index;  /* region number */
> +__u64   size;   /* size in bytes of region */
> +__u64   offset; /* start offset of region */
> +__u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP  (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO(1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID(1 << 2)
> +__u64   phys;   /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct 
> vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS_IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +__u32   len;/* length of structure */
> +__u32   index;  /* IRQ number */
> +__u32   count;  /* number of individual IRQs */
> +__u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL(1 << 0)

Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
not do this, causing problems if the kernel is 64-bit and thus assumes a
different layout.

> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO_IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single inter

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Christian Benvenuti (benve)
> -Original Message-
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, November 11, 2011 10:04 AM
> To: Christian Benvenuti (benve)
> Cc: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com;
> d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Aaron Fabbri
> (aafabbri); b08...@freescale.com; b07...@freescale.com; a...@redhat.com;
> konrad.w...@oracle.com; kvm@vger.kernel.org; qemu-de...@nongnu.org;
> io...@lists.linux-foundation.org; linux-...@vger.kernel.org
> Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
> 
> On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> > Here are few minor comments on vfio_iommu.c ...
> 
> Sorry, I've been poking sticks at trying to figure out a clean way to
> solve the force vfio driver attach problem.

Attach o detach?

> > > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > > new file mode 100644
> > > index 000..029dae3
> > > --- /dev/null
> > > +++ b/drivers/vfio/vfio_iommu.c
> 
> > > +
> > > +#include "vfio_private.h"
> >
> > Doesn't the 'dma_'  prefix belong to the generic DMA code?
> 
> Sure, we could these more vfio-centric.

Like vfio_dma_map_page?

> 
> > > +struct dma_map_page {
> > > + struct list_headlist;
> > > + dma_addr_t  daddr;
> > > + unsigned long   vaddr;
> > > + int npage;
> > > + int rdwr;
> > > +};
> > > +
> > > +/*
> > > + * This code handles mapping and unmapping of user data buffers
> > > + * into DMA'ble space using the IOMMU
> > > + */
> > > +
> > > +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
> > > +
> > > +struct vwork {
> > > + struct mm_struct*mm;
> > > + int npage;
> > > + struct work_struct  work;
> > > +};
> > > +
> > > +/* delayed decrement for locked_vm */
> > > +static void vfio_lock_acct_bg(struct work_struct *work)
> > > +{
> > > + struct vwork *vwork = container_of(work, struct vwork, work);
> > > + struct mm_struct *mm;
> > > +
> > > + mm = vwork->mm;
> > > + down_write(&mm->mmap_sem);
> > > + mm->locked_vm += vwork->npage;
> > > + up_write(&mm->mmap_sem);
> > > + mmput(mm);  /* unref mm */
> > > + kfree(vwork);
> > > +}
> > > +
> > > +static void vfio_lock_acct(int npage)
> > > +{
> > > + struct vwork *vwork;
> > > + struct mm_struct *mm;
> > > +
> > > + if (!current->mm) {
> > > + /* process exited */
> > > + return;
> > > + }
> > > + if (down_write_trylock(¤t->mm->mmap_sem)) {
> > > + current->mm->locked_vm += npage;
> > > + up_write(¤t->mm->mmap_sem);
> > > + return;
> > > + }
> > > + /*
> > > +  * Couldn't get mmap_sem lock, so must setup to decrement
> >   ^
> >
> > Increment?
> 
> Yep
> 
> 
> > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > > start,
> > > + size_t size, struct dma_map_page *mlp)
> > > +{
> > > + struct dma_map_page *split;
> > > + int npage_lo, npage_hi;
> > > +
> > > + /* Existing dma region is completely covered, unmap all */
> >
> > This works. However, given how vfio_dma_map_dm implements the merging
> > logic, I think it is impossible to have
> >
> > (start < mlp->daddr &&
> >  start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))
> 
> It's quite possible.  This allows userspace to create a sparse mapping,
> then blow it all away with a single unmap from 0 to ~0.

I would prefer the user to use exact ranges in the unmap operations
because it would make it easier to detect bugs/leaks in the map/unmap
logic used by the callers.
My assumptions are that:

- the user always keeps track of the mappings

- the user either unmaps one specific mapping or 'all of them'.
  The 'all of them' case would also take care of those cases where
  the user does _not_ keep track of mappings and simply uses
  the "unmap from 0 to ~0" each time.

Because of this you could still provide an exact map/unmap logic
and allow such "unmap from 0 to ~0" by making the 

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-11 Thread Alex Williamson
On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> Here are few minor comments on vfio_iommu.c ...

Sorry, I've been poking sticks at trying to figure out a clean way to
solve the force vfio driver attach problem.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c

> > +
> > +#include "vfio_private.h"
> 
> Doesn't the 'dma_'  prefix belong to the generic DMA code?

Sure, we could these more vfio-centric.

> > +struct dma_map_page {
> > +   struct list_headlist;
> > +   dma_addr_t  daddr;
> > +   unsigned long   vaddr;
> > +   int npage;
> > +   int rdwr;
> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage)   ((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > +   struct mm_struct*mm;
> > +   int npage;
> > +   struct work_struct  work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > +   struct vwork *vwork = container_of(work, struct vwork, work);
> > +   struct mm_struct *mm;
> > +
> > +   mm = vwork->mm;
> > +   down_write(&mm->mmap_sem);
> > +   mm->locked_vm += vwork->npage;
> > +   up_write(&mm->mmap_sem);
> > +   mmput(mm);  /* unref mm */
> > +   kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > +   struct vwork *vwork;
> > +   struct mm_struct *mm;
> > +
> > +   if (!current->mm) {
> > +   /* process exited */
> > +   return;
> > +   }
> > +   if (down_write_trylock(¤t->mm->mmap_sem)) {
> > +   current->mm->locked_vm += npage;
> > +   up_write(¤t->mm->mmap_sem);
> > +   return;
> > +   }
> > +   /*
> > +* Couldn't get mmap_sem lock, so must setup to decrement
>   ^
> 
> Increment?

Yep


> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > +   size_t size, struct dma_map_page *mlp)
> > +{
> > +   struct dma_map_page *split;
> > +   int npage_lo, npage_hi;
> > +
> > +   /* Existing dma region is completely covered, unmap all */
> 
> This works. However, given how vfio_dma_map_dm implements the merging
> logic, I think it is impossible to have
> 
> (start < mlp->daddr &&
>  start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))

It's quite possible.  This allows userspace to create a sparse mapping,
then blow it all away with a single unmap from 0 to ~0.

> > +   if (start <= mlp->daddr &&
> > +   start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +   vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +   list_del(&mlp->list);
> > +   npage_lo = mlp->npage;
> > +   kfree(mlp);
> > +   return npage_lo;
> > +   }
> > +
> > +   /* Overlap low address of existing range */
> 
> Same as above (ie, '<' is impossible)

existing:   |<--- A --->|  |<--- B --->|
unmap:|<--- C --->|

Maybe not good practice from userspace, but we shouldn't count on
userspace to be well behaved.

> > +   if (start <= mlp->daddr) {
> > +   size_t overlap;
> > +
> > +   overlap = start + size - mlp->daddr;
> > +   npage_lo = overlap >> PAGE_SHIFT;
> > +   npage_hi = mlp->npage - npage_lo;
> > +
> > +   vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +   mlp->daddr += overlap;
> > +   mlp->vaddr += overlap;
> > +   mlp->npage -= npage_lo;
> > +   return npage_lo;
> > +   }
> 
> Same as above (ie, '>' is impossible).

Same example as above.

> > +   /* Overlap high address of existing range */
> > +   if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +   size_t overlap;
> > +
> > +   overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +   npage_hi = overlap >> PAGE_SHIFT;
> > +   npage_lo = mlp->npage - npage_hi;
> > +
> > +   vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +   mlp->npage -= npage_hi;
> > +   return npage_hi;
> > +   }

> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map
> > *dmp)
> > +{
> > +   int npage;
> > +   struct dma_map_page *mlp, *mmlp = NULL;
> > +   dma_addr_t daddr = dmp->dmaaddr;
> > +   unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +   size_t size = dmp->size;
> > +   int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +   if (vaddr & (PAGE_SIZE-1))
> > +   return -EINVAL;
> > +   if (daddr & (PAGE_SIZE-1))
> > +   return -EINVAL;
> > +   if (size & (PAGE_SIZE-1))
> > +   return -EINVAL;
> > +
> > +   npage = si

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-09 Thread Christian Benvenuti (benve)
Here are few minor comments on vfio_iommu.c ...

> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + * Author: Alex Williamson 
> + *
> + * This program is free software; you can redistribute it and/or
> modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, p...@cisco.com
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "vfio_private.h"

Doesn't the 'dma_'  prefix belong to the generic DMA code?

> +struct dma_map_page {
> + struct list_headlist;
> + dma_addr_t  daddr;
> + unsigned long   vaddr;
> + int npage;
> + int rdwr;
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> + struct mm_struct*mm;
> + int npage;
> + struct work_struct  work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> + struct vwork *vwork = container_of(work, struct vwork, work);
> + struct mm_struct *mm;
> +
> + mm = vwork->mm;
> + down_write(&mm->mmap_sem);
> + mm->locked_vm += vwork->npage;
> + up_write(&mm->mmap_sem);
> + mmput(mm);  /* unref mm */
> + kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> + struct vwork *vwork;
> + struct mm_struct *mm;
> +
> + if (!current->mm) {
> + /* process exited */
> + return;
> + }
> + if (down_write_trylock(¤t->mm->mmap_sem)) {
> + current->mm->locked_vm += npage;
> + up_write(¤t->mm->mmap_sem);
> + return;
> + }
> + /*
> +  * Couldn't get mmap_sem lock, so must setup to decrement
  ^

Increment?

> +  * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> +  * need this silliness
> +  */
> + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> + if (!vwork)
> + return;
> + mm = get_task_mm(current);  /* take ref mm */
> + if (!mm) {
> + kfree(vwork);
> + return;
> + }
> + INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> + vwork->mm = mm;
> + vwork->npage = npage;
> + schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device.  These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)
> +{
> + if (pfn_valid(pfn)) {
> + int reserved;
> + struct page *tail = pfn_to_page(pfn);
> + struct page *head = compound_trans_head(tail);
> + reserved = PageReserved(head);
> + if (head != tail) {
> + /* "head" is not a dangling pointer
> +  * (compound_trans_head takes care of that)
> +  * but the hugepage may have been split
> +  * from under us (and we may not hold a
> +  * reference count on the head page so it can
> +  * be reused before we run PageReferenced), so
> +  * we've to check PageTail before returning
> +  * what we just read.
> +  */
> + smp_rmb();
> + if (PageTail(tail))
> + return reserved;
> + }
> + return PageReserved(tail);
> + }
> +
> + return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + struct page *page = pfn_to_page(pfn);
> + if (rdwr)
> + SetPageDirty(page);
> + put_page(page);
> + return 1;
> + }
> + return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */
> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long
> iova,
> + int npage, int rdwr)
> +{
> + int i, unlocked = 0;
> +
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> + unsigned long pfn;
> +
> + pfn = iommu_iova_to_phys(iommu->dom

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-09 Thread Alex Williamson
On Wed, 2011-11-09 at 15:08 -0600, Christian Benvenuti (benve) wrote:

> > > > +
> > > > +struct vfio_group {
> > > > +   dev_t   devt;
> > > > +   unsigned intgroupid;
> > >
> > > This groupid is returned by the device_group callback you recently
> > added
> > > with a separate (not yet in tree) IOMMU patch.
> > > Is it correct to say that the scope of this ID is the bus the iommu
> > > belongs too (but you use it as if it was global)?
> > > I believe there is nothing right now to ensure the uniqueness of such
> > > ID across bus types (assuming there will be other bus drivers in the
> > > future besides vfio-pci).
> > > If that's the case, the vfio.group_list global list and the
> > __vfio_lookup_dev
> > > routine should be changed to account for the bus too?
> > > Ops, I just saw the error msg in vfio_group_add_dev about the group
> > id conflict.
> > > Is that warning related to what I mentioned above?
> > 
> > Yeah, this is a concern, but I can't think of a system where we would
> > manifest a collision.  The IOMMU driver is expected to provide unique
> > groupids for all devices below them, but we could imagine a system that
> > implements two different bus_types, each with a different IOMMU driver
> > and we have no coordination between them.  Perhaps since we have
> > iommu_ops per bus, we should also expose the bus in the vfio group
> > path,
> > ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
> > means userspace would need to do a readlink of the subsystem entry
> > where
> > it finds the iommu_group to find the vfio group.  Reasonable?
> 
> Most probably we won't see use cases with multiple buses anytime soon, but
> this scheme you proposed (with the per-bus subdir) looks good to me. 

Ok, I think that's easier than any scheme of trying to organize globally
unique groupids instead of just bus_type unique.  That makes group
objects internally matched by the {groupid, bus} pair.


> > >
> > > I looked at how you take care of ref counts ...
> > >
> > > This is how the tree of vfio_iommu/vfio_group/vfio_device data
> > > Structures is organized (I'll use just iommu/group/dev to make
> > > the graph smaller):
> > >
> > > iommu
> > >/ \
> > >   /   \
> > > group   ... group
> > > /  \   /  \
> > >/\ /\
> > > dev  ..  dev   dev  ..  dev
> > >
> > > This is how you get a file descriptor for the three kind of objects:
> > >
> > > - group : open /dev/vfio/xxx for group xxx
> > > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> > > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> > >
> > > Given the above topology, I would assume that:
> > >
> > > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
> > >  b) any of its groups is 'inuse'
> > >
> > > (2) a  group is 'inuse' if : a) group refcnt > 0, or
> > >  b) any of its devices is 'inuse'
> > >
> > > (3) a device is 'inuse' if : a) device refcnt > 0
> > 
> > (2) is a bit debatable.  I've wrestled with this one for a while.  The
> > vfio_iommu serves two purposes.  First, it is the object we use for
> > managing iommu domains, which includes allocating domains and attaching
> > devices to domains.  Groups objects aren't involved here, they just
> > manage the set of devices.  The second role is to manage merged groups,
> > because whether or not groups can be merged is a function of iommu
> > domain compatibility.
> > 
> > So if we look at "is the iommu in use?" ie. can I destroy the mapping
> > context, detach devices and free the domain, the reference count on the
> > group is irrelevant.  The user has to have a device or iommu file
> > descriptor opened somewhere, across the group or merged group, for that
> > context to be maintained.  A reasonable requirement, I think.
> 
> OK, then if you close all devices and the iommu, keeping the group open
> Would not protect the iommu domain mapping. This means that if you (or
> A management application) need to close all devices+iommu and reopen
> right away again the same devices+iommu you may get a failure on the
> iommu domain creation (supposing the system goes out of resources).
> Is this just a very unlikely scenario? 

Can you think of a use case that would require such?  I can't.

> I guess in this case you would simply have to avoid releasing the iommu
> fd, right?

Right.  We could also debate whether we should drop all iommu mappings
when the iommu refcnt goes to zero.  We don't currently do that, but it
might make sense.

> 
> > However, if we ask "is the group in use?" ie. can I not only destroy
> > the
> > mappings above, but also automatically tear apart merged groups, then I
> > think we need to look at the group refcnt.
> 
> Correct.
> 
> > There's also a symmetry factor, the group is a benign entry point to
> > device access.  It's only when device or iommu access is granted that
> > the group gains an

RE: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-09 Thread Alex Williamson
On Wed, 2011-11-09 at 02:11 -0600, Christian Benvenuti (benve) wrote:
> I have not gone through the all patch yet, but here are
> my first comments/questions about the code in vfio_main.c
> (and pci/vfio_pci.c).

Thanks!  Comments inline...

> > -Original Message-
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, November 03, 2011 1:12 PM
> > To: chr...@sous-sol.org; a...@au1.ibm.com; p...@au1.ibm.com;
> > d...@au1.ibm.com; joerg.roe...@amd.com; ag...@suse.de; Christian
> > Benvenuti (benve); Aaron Fabbri (aafabbri); b08...@freescale.com;
> > b07...@freescale.com; a...@redhat.com; konrad.w...@oracle.com;
> > kvm@vger.kernel.org; qemu-de...@nongnu.org; iommu@lists.linux-
> > foundation.org; linux-...@vger.kernel.org
> > Subject: [RFC PATCH] vfio: VFIO Driver core framework
> 
> 
> 
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + * Author: Alex Williamson 
> > + *
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, p...@cisco.com
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION "0.2"
> > +#define DRIVER_AUTHOR  "Alex Williamson "
> > +#define DRIVER_DESC"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> > +module_param(allow_unsafe_intrs, int, 0);
> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +"Allow use of IOMMUs which do not support interrupt
> > remapping");
> > +
> > +static struct vfio {
> > +   dev_t   devt;
> > +   struct cdev cdev;
> > +   struct list_headgroup_list;
> > +   struct mutexlock;
> > +   struct kref kref;
> > +   struct class*class;
> > +   struct idr  idr;
> > +   wait_queue_head_t   release_q;
> > +} vfio;
> > +
> > +static const struct file_operations vfio_group_fops;
> > +extern const struct file_operations vfio_iommu_fops;
> > +
> > +struct vfio_group {
> > +   dev_t   devt;
> > +   unsigned intgroupid;
> 
> This groupid is returned by the device_group callback you recently added
> with a separate (not yet in tree) IOMMU patch.
> Is it correct to say that the scope of this ID is the bus the iommu
> belongs too (but you use it as if it was global)?
> I believe there is nothing right now to ensure the uniqueness of such
> ID across bus types (assuming there will be other bus drivers in the
> future besides vfio-pci).
> If that's the case, the vfio.group_list global list and the __vfio_lookup_dev
> routine should be changed to account for the bus too?
> Ops, I just saw the error msg in vfio_group_add_dev about the group id 
> conflict.
> Is that warning related to what I mentioned above?

Yeah, this is a concern, but I can't think of a system where we would
manifest a collision.  The IOMMU driver is expected to provide unique
groupids for all devices below them, but we could imagine a system that
implements two different bus_types, each with a different IOMMU driver
and we have no coordination between them.  Perhaps since we have
iommu_ops per bus, we should also expose the bus in the vfio group path,
ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
means userspace would need to do a readlink of the subsystem entry where
it finds the iommu_group to find the vfio group.  Reasonable?

> > +   struct bus_type *bus;
> > +   struct vfio_iommu   *iommu;
> > +   struct list_headdevice_list;
> > +   struct list_headiommu_next;
> > +   struct list_headgroup_next;
> > +   int refcnt;
> > +};
> > +
> > +struct vfio_device {
> > +   struct device   *dev;
> > +   const struct vfio_device_ops*ops;
> > +   struct vfio_iommu   *iommu;
> 
> I wonder if you need to have the 'iommu' field here.
> vfio_device.iommu is always set and reset together with
> vfio_group.iommu.
> Given that a vfio_device instance is always linked to a vfio_group
> instance, do we need this duplication? Is this duplication there
> because you do not want the double dereference device->group->iommu?

I think that was my initial goal in duplicating the pointer on the
device.  I believe I was also at one point passing a vfio_d

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-08 Thread Alex Williamson
On Tue, 2011-11-08 at 20:17 -0800, Aaron Fabbri wrote:
> I'm going to send out chunks of comments as I go over this stuff.  Below
> I've covered the documentation file and vfio_iommu.c.  More comments coming
> soon...
> 
> On 11/3/11 1:12 PM, "Alex Williamson"  wrote:
> 
> > VFIO provides a secure, IOMMU based interface for user space
> > drivers, including device assignment to virtual machines.
> > This provides the base management of IOMMU groups, devices,
> > and IOMMU objects.  See Documentation/vfio.txt included in
> > this patch for user and kernel API description.
> > 
> > Note, this implements the new API discussed at KVM Forum
> > 2011, as represented by the drvier version 0.2.  It's hoped
> > that this provides a modular enough interface to support PCI
> > and non-PCI userspace drivers across various architectures
> > and IOMMU implementations.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> 
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +-
> > --
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> 
> Can you define this acronym the first time you use it, i.e.
> 
> + PEs (partitionable endpoints), ...

It was actually up in the :

... POWER systems with Partitionable Endpoints (PEs) ...

I tried to make sure I defined them, but let me know if anything else is
missing/non-obvious.

> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> 
> restrictions

Ugh, lost w/o a spell checker.  Fixed all these.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
> 
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +  dma_addr_t start, size_t size)
> > +{
> > +struct list_head *pos;
> > +struct dma_map_page *mlp;
> > +
> > +list_for_each(pos, &iommu->dm_list) {
> > +mlp = list_entry(pos, struct dma_map_page, list);
> > +if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +   start, size))
> > +return mlp;
> > +}
> > +return NULL;
> > +}
> > +
> 
> This function below should be static.

Fixed

> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +size_t size, struct dma_map_page *mlp)
> > +{
> > +struct dma_map_page *split;
> > +int npage_lo, npage_hi;
> > +
> > +/* Existing dma region is completely covered, unmap all */
> > +if (start <= mlp->daddr &&
> > +start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +list_del(&mlp->list);
> > +npage_lo = mlp->npage;
> > +kfree(mlp);
> > +return npage_lo;
> > +}
> > +
> > +/* Overlap low address of existing range */
> > +if (start <= mlp->daddr) {
> > +size_t overlap;
> > +
> > +overlap = start + size - mlp->daddr;
> > +npage_lo = overlap >> PAGE_SHIFT;
> > +npage_hi = mlp->npage - npage_lo;
> 
> npage_hi not used.. Delete this line ^

Yep, and npage_lo in the next block.  I was setting them just for
symmetry, but they can be removed now.

> > +
> > +vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +mlp->daddr += overlap;
> > +mlp->vaddr += overlap;
> > +mlp->npage -= npage_lo;
> > +return npage_lo;
> > +}
> > +
> > +/* Overlap high address of existing range */
> > +if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +size_t overlap;
> > +
> > +overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +npage_hi = overlap >> PAGE_SHIFT;
> > +npage_lo = mlp->npage - npage_hi;
> > +
> > +vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +mlp->npage -= npage_hi;
> > +return npage_hi;
> > +}
> > +
> > +/* Split existing */
> > +npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +split = kzalloc(sizeof *split, GFP_KERNEL);
> > +if (!split)
> > +return -ENOMEM;
> > +
> > +vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +mlp->npage = npage_lo;
> > +
> > +split->npage = npage_hi;
> > +split->daddr = start + size;
> > +split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +split->rdwr = mlp->rdwr;
> > +list_add(&split->list, &iommu->dm_list);
> > +return size >> PAGE_SHIFT;
> > +}
> > +
> 

Re: [RFC PATCH] vfio: VFIO Driver core framework

2011-11-08 Thread Aaron Fabbri
I'm going to send out chunks of comments as I go over this stuff.  Below
I've covered the documentation file and vfio_iommu.c.  More comments coming
soon...

On 11/3/11 1:12 PM, "Alex Williamson"  wrote:

> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson 
> ---

> +
> +Groups, Devices, IOMMUs, oh my
> +-
> --
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a

Can you define this acronym the first time you use it, i.e.

+ PEs (partitionable endpoints), ...

> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for

restrictions

> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices

distinguish

> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +

> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE

explicitly


> +
> +Device tree devices also invlude ioctls for further defining the

include


> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c

> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +  dma_addr_t start, size_t size)
> +{
> +struct list_head *pos;
> +struct dma_map_page *mlp;
> +
> +list_for_each(pos, &iommu->dm_list) {
> +mlp = list_entry(pos, struct dma_map_page, list);
> +if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +   start, size))
> +return mlp;
> +}
> +return NULL;
> +}
> +

This function below should be static.

> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +size_t size, struct dma_map_page *mlp)
> +{
> +struct dma_map_page *split;
> +int npage_lo, npage_hi;
> +
> +/* Existing dma region is completely covered, unmap all */
> +if (start <= mlp->daddr &&
> +start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +list_del(&mlp->list);
> +npage_lo = mlp->npage;
> +kfree(mlp);
> +return npage_lo;
> +}
> +
> +/* Overlap low address of existing range */
> +if (start <= mlp->daddr) {
> +size_t overlap;
> +
> +overlap = start + size - mlp->daddr;
> +npage_lo = overlap >> PAGE_SHIFT;
> +npage_hi = mlp->npage - npage_lo;

npage_hi not used.. Delete this line ^

> +
> +vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +mlp->daddr += overlap;
> +mlp->vaddr += overlap;
> +mlp->npage -= npage_lo;
> +return npage_lo;
> +}
> +
> +/* Overlap high address of existing range */
> +if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +size_t overlap;
> +
> +overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +npage_hi = overlap >> PAGE_SHIFT;
> +npage_lo = mlp->npage - npage_hi;
> +
> +vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +mlp->npage -= npage_hi;
> +return npage_hi;
> +}
> +
> +/* Split existing */
> +npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +split = kzalloc(sizeof *split, GFP_KERNEL);
> +if (!split)
> +return -ENOMEM;
> +
> +vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +mlp->npage = npage_lo;
> +
> +split->npage = npage_hi;
> +split->daddr = start + size;
> +split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +split->rdwr = mlp->rdwr;
> +list_add(&split->list, &iommu->dm_list);
> +return size >> PAGE_SHIFT;
> +}
> +

Function should be static.

> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_d