BUG: virtio_mmio multi-queue competely broken -- virtio *registers* considered harmful

2013-05-01 Thread Tom Lyon
Virtiio_mmio attempts to mimic the layout of some control registers from 
virtio_pci.  These registers, in particular VIRTIO_MMIO_QUEUE_SEL and 
VIRTIO_PCI_QUEUE_SEL,
are active in nature, and not just passive like a normal memory 
location.  Thus, the host side must react immediately upon write of 
these registers to map some other registers (queue address, size, etc) 
to queue-specific locations.  This is just not possible for mmio, and, I 
would argue, not desirable for PCI either.


Because the queue selector register doesn't work in mmio, it is clear 
that only single queue virtio devices can work.  This means no 
virtio_net - I've seen a few messages

complaining that it doesn't work but nothing so far on why.

It seems from some messages back in March that there is a register 
re-layout in the works for virtio_pci.  I think that virtio_pci could 
become just one of the
various ways to configure a virtio_mmio device and there would no need 
for any "registers", just memory locations acting like memory. The one 
gotcha is in
figuring out the kick/notify mechanism for the guest to notify the host 
when there is work on a queue.  For notify, using a hypervisor call 
could unify the pci and mmio

cases, but comes with the cost of leaving the pure pci domain.

I got into this code because I am looking at the possibility of using an 
off the shelf embedded processor sitting on a PCIe port to emulate the 
virtio pci interface.  The
notion of active registers makes this a non-starter, whereas if there 
was a purely memory based system like mmio (with mq fixes), a real PCI 
device could easily emulate it.
Excepting, of course, whatever the notify mechanism is.  If it were 
hypercall based, then the hypervisor could call a transport or device 
specific way of notifying and a small

notify driver could poke the PCI device is some way.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: virtio_mmio multi-queue competely broken -- virtio *registers* considered harmful

2013-05-02 Thread Tom Lyon

On 05/01/2013 10:25 PM, Michael S. Tsirkin wrote:

On Wed, May 01, 2013 at 08:40:54PM -0700, Tom Lyon wrote:

Virtiio_mmio attempts to mimic the layout of some control registers
from virtio_pci.  These registers, in particular
VIRTIO_MMIO_QUEUE_SEL and VIRTIO_PCI_QUEUE_SEL,
are active in nature, and not just passive like a normal memory
location.  Thus, the host side must react immediately upon write of
these registers to map some other registers (queue address, size,
etc) to queue-specific locations.  This is just not possible for
mmio, and, I would argue, not desirable for PCI either.

Because the queue selector register doesn't work in mmio, it is
clear that only single queue virtio devices can work.  This means no
virtio_net - I've seen a few messages
complaining that it doesn't work but nothing so far on why.

It seems from some messages back in March that there is a register
re-layout in the works for virtio_pci.  I think that virtio_pci
could become just one of the
various ways to configure a virtio_mmio device and there would no
need for any "registers", just memory locations acting like memory.
The one gotcha is in
figuring out the kick/notify mechanism for the guest to notify the
host when there is work on a queue.  For notify, using a hypervisor
call could unify the pci and mmio
cases, but comes with the cost of leaving the pure pci domain.

I got into this code because I am looking at the possibility of
using an off the shelf embedded processor sitting on a PCIe port to
emulate the virtio pci interface.  The
notion of active registers makes this a non-starter, whereas if
there was a purely memory based system like mmio (with mq fixes), a
real PCI device could easily emulate it.
Excepting, of course, whatever the notify mechanism is.  If it were
hypercall based, then the hypervisor could call a transport or
device specific way of notifying and a small
notify driver could poke the PCI device is some way.


This was discussed on this thread:
'[PATCH 16/22] virtio_pci: use separate notification offsets for each 
vq'
Please take a look there and confirm that this addresses your concern.
I'm working on making memory io as fast as pio on x86,
implemented on intel, once I do it on AMD too and assuming it's
as fast as PIO, we'll do mmio everywhere.
Then with a PCI card, you won't have exits for notification just normal
passthrough.


Yes, I had seen that thread. It addresses my concerns for pci, but not 
mmio, although I slightly favor a hypercall notify mechanism over a pci 
write.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: virtio_mmio multi-queue competely broken -- virtio *registers* considered harmful

2013-05-02 Thread Tom Lyon

On 05/02/2013 04:37 AM, Pawel Moll wrote:

Hi Tom,

On Thu, 2013-05-02 at 04:40 +0100, Tom Lyon wrote:

Virtiio_mmio attempts to mimic the layout of some control registers from
virtio_pci.  These registers, in particular VIRTIO_MMIO_QUEUE_SEL and
VIRTIO_PCI_QUEUE_SEL,
are active in nature, and not just passive like a normal memory
location.  Thus, the host side must react immediately upon write of
these registers to map some other registers (queue address, size, etc)
to queue-specific locations.  This is just not possible for mmio, and, I
would argue, not desirable for PCI either.

Could you, please, elaborate more about the environment you are talking
about here?

The intention of the MMIO device is to behave like a normal memory
mapped peripheral, say serial port. In the world of architecture without
separate I/O address space (like ARM, MIPS, SH-4 to name only those I
know anything about), such peripherals are usually mapped into the
virtual address space with special attributes, eg. guaranteeing
transactions order. That's why the host can "react immediately" and to
my knowledge multi-queue virtio devices work just fine.

I'd love see comments from someone with x86 expertise in such areas.
Maybe we are missing some memory barriers here? So the host
implementation would have a chance to react to the QUEUE_SEL access
before servicing other transactions?

Regards

Paweł


Ah, my mistake. I was assuming that mmio just used shared memory 
regions, not that there was emulated IO registers.  I was driven to that
assumption by looking at the rpmsg use case in which a main processor 
talks to satellite processors - there seems to be no
hypervisor and therefore no emulated IO in that case.  However, looking 
deeper, it seems that rpmsg has its own notify mechanism anyways, so it
is not so cleanly layered on virtio.  Also, for my needs, I want rpmsg 
and virtio_net.


In my desire to use a PCIe card to emulate virtio I am also talking 
about a non-hypervisor use case - just  leveraging the virtio framework 
for inter-processor
communication.  Virtio is so close to being the answer it woud be a 
shame if it wasn't. So I would propose that virtio be restructured such 
that the configuration
layout is identical among virtio transports and acts like memory, the 
configuration location is transport-specific, and the notification 
mechanism is transport specific - but I
could live writing to some queue specific memory/io location (for both 
pci and mmio).



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


change of email address: p...@cisco.com -> p...@ieee.org

2010-12-16 Thread Tom Lyon
eom
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] VFIO V6 & public VFIO repositories

2010-12-21 Thread Tom Lyon
On Monday, December 20, 2010 09:37:33 pm Benjamin Herrenschmidt wrote:
> Hi Tom, just wrote that to linux-pci in reply to your VFIO annouce,
> but your email bounced. Alex gave me your ieee one instead, I'm sending
> this copy to you, please feel free to reply on the list !
> 
> Cheers,
> Ben.
> 
> On Tue, 2010-12-21 at 16:29 +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote:
> > > VFIO "driver" development has moved to a publicly accessible
> > > respository
> > > 
> > > on github:
> > >   git://github.com/pugs/vfio-linux-2.6.git
> > > 
> > > This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
> > > branch (which is the default). There is a tag 'vfio-v6' marking the
> > > latest "release" of VFIO.
> > > 
> > > In addition, I am open-sourcing my user level code which uses VFIO.
> > > It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
> > > 
> > > hardware drivers. This code is available at:
> > >   git://github.com/pugs/vfio-user-level-drivers.git
> > 
> > So I do have some concerns about this...
> > 
> > So first, before I go into the meat of my issues, let's just drop a
> > quick one about the interface: why netlink ? I find it horrible
> > myself... Just confuses everything and adds overhead. ioctl's would have
> > been a better choice imho.
> > 
> > Now, my actual issues, which in fact extend to the whole "generic" iommu
> > APIs that have been added to drivers/pci for "domains", and that in
> > turns "stains" VFIO in ways that I'm not sure I can use on POWER...
> > 
> > I would appreciate your input on how you think is the best way for me to
> > solve some of these "mismatches" between our HW and this design.
> > 
> > Basically, the whole iommu domain stuff has been entirely designed
> > around the idea that you can create those "domains" which are each an
> > entire address space, and put devices in there.
> > 
> > This is sadly not how the IBM iommus work on POWER today...
> > 
> > I have currently one "shared" DMA address space (per host bridge), but I
> > can assign regions of it to different devices (and I have limited
> > filtering capabilities so basically, a bus per region, a device per
> > region or a function per region).
> > 
> > That means essentially that I cannot just create a mapping for the DMA
> > addresses I want, but instead, need to have some kind of "allocator" for
> > DMA translations (which we have in the kernel, ie, dma_map/unmap use a
> > bitmap allocator).
> > 
> > I generally have 2 regions per device, one in 32-bit space of quite
> > limited size (some times as small as 128M window) and one in 64-bit
> > space that I can make quite large if I need to, enough to map all of
> > memory if that's really desired, using large pages or something like
> > that).
> > 
> > Now that has various consequences vs. the interfaces betweem iommu
> > 
> > domains and qemu, and VFIO:
> >  - I don't quite see how I can translate the concept of domains and
> > 
> > attaching devices to such domains. The basic idea won't work. The
> > domains in my case are essentially pre-existing, not created on-the-fly,
> > and may contain multiple devices tho I suppose I can assume for now that
> > we only support KVM pass-through with 1 device == 1 domain.
> > 
> > I don't know how to sort that one out if the userspace or kvm code
> > assumes it can put multiple devices in one domain and they start to
> > magically share the translations...
> > 
> > Not sure what the right approach here is. I could make the "Linux"
> > domain some artifical SW construct that contains a list of the real
> > iommu's it's "bound" to and establish translations in all of them... but
> > that isn't very efficient. If the guest kernel explicitely use some
> > iommu PV ops targeting a device, I need to only setup translations for
> > -that- device, not everything in the "domain".
> > 
> >  - The code in virt/kvm/iommu.c that assumes it can map the entire guest
> > 
> > memory 1:1 in the IOMMU is just not usable for us that way. We -might-
> > be able to do that for 64-bit capable devices as we can create quite
> > large regions in the 64-bit space, but at the very least we need some
> > kind of offset, and the guest must know about

Re: [PATCH V2] drivers/uio/uio.c: DMA mapping, interrupt extensions, etc.

2010-04-17 Thread Tom Lyon
The current uio and uio_pci_generic allow multiple opens; I was just preserving 
that behavior.

On Saturday 17 April 2010 03:43:09 am Joerg Roedel wrote:
> On Thu, Apr 15, 2010 at 01:55:29PM -0700, Tom Lyon wrote:
> 
> > +   down(&idev->gate);
> > +   if (idev->listeners == 0) { /* first open */
> > +   if (idev->pmaster && !iommu_found() && !capable(CAP_SYS_RAWIO)) 
> > {
> > +   up(&idev->gate);
> > +   return -EPERM;
> > +   }
> > +   /* reset to known state if we can */
> > +   if (idev->pdev)
> > +   (void) pci_reset_function(idev->pdev);
> > +   }
> > +   idev->listeners++;
> > +   up(&idev->gate);
> 
> Why do you want to allow multiple opens for a device? Is it not
> sufficient to allow only one?
> 
> 
> > +   if (idev->domain == NULL) {
> > +   if (!list_empty(&listener->dm_list))/* no mix with anywhere 
> > */
> > +   return -EINVAL;
> > +   if (!iommu_found())
> > +   return -EINVAL;
> > +   idev->domain = iommu_domain_alloc();
> > +   if (idev->domain == NULL)
> > +   return -ENXIO;
> > +   idev->cachec = iommu_domain_has_cap(idev->domain,
> > +   IOMMU_CAP_CACHE_COHERENCY);
> > +   ret = iommu_attach_device(idev->domain, idev->dev->parent);
> > +   if (ret) {
> > +   iommu_domain_free(idev->domain);
> > +   idev->domain = NULL;
> > +   printk(KERN_ERR "%s: device_attach failed %d\n", 
> > __func__, ret);
> > +   return ret;
> > +   }
> > +   }
> 
> If userspace calls this path this will make all the addresses mapped
> with DMA-API paths unusable by the device. This doesn't look like a sane
> userspace interface.
> 
> For better and more in-depth review I suggest that you split up this
> large patch into a series of smaler which implement specific aspects of
> your work.
> 
>   Joerg
> 
> P.S.: I got these warning when applying your patches ...
> 
> Applying: drivers/uio/uio_pci_generic.c: allow access for non-privileged 
> processes
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:86: trailing 
> whitespace.
> info->mem[j].name = name; 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:87: trailing 
> whitespace.
> info->mem[j].addr = pci_resource_start(pdev, i); 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:89: trailing 
> whitespace.
> info->mem[j].memtype = UIO_MEM_PHYS; 
> warning: 3 lines add whitespace errors.
> Applying: drivers/uio/uio.c: DMA mapping, interrupt extensions, etc.
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:315: trailing 
> whitespace.
> ret = iommu_map_range(idev->domain, iova, 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:325: trailing 
> whitespace.
> } 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:366: trailing 
> whitespace.
>  * adjacent pages, but noone seems to really do that. So we squash 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:368: trailing 
> whitespace.
>  * This works if (a) there is an iommu, or (b) the user allocates 
> /home/joro/src/linux.trees.git/.git/rebase-apply/patch:578: trailing 
> whitespace.
> unsigned int cmd, unsigned long arg) 
> warning: squelched 1 whitespace error
> warning: 6 lines add whitespace errors.
> 
> And there are also some coding-style issues.
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] drivers/uio/uio_pci_generic.c: allow access for non-privileged processes

2010-04-19 Thread Tom Lyon

These are changes to uio_pci_generic.c to allow better use of the driver by
non-privileged processes.
1. Add back old code which allowed interrupt re-enablement through uio fd.
2. Translate PCI bards to uio mmap regions, to allow mmap through uio fd.
3. Allow devices which support MSI or MSI-X, but not IRQ.
Signed-off-by: Tom Lyon 
---
checkpatch.pl is happy with this one.

--- linux-2.6.33/drivers/uio/uio_pci_generic.c  2010-02-24 10:52:17.0 
-0800
+++ mylinux-2.6.33/drivers/uio/uio_pci_generic.c2010-04-19 
14:57:21.0 -0700
@@ -14,9 +14,10 @@
  * # ls -l /sys/bus/pci/devices/:00:19.0/driver
  * .../:00:19.0/driver -> ../../../bus/pci/drivers/uio_pci_generic
  *
- * Driver won't bind to devices which do not support the Interrupt Disable Bit
- * in the command register. All devices compliant to PCI 2.3 (circa 2002) and
- * all compliant PCI Express devices should support this bit.
+ * Driver won't bind to devices which do not support MSI, MSI-x, or the
+ * Interrupt Disable Bit in the command register. All devices compliant
+ * to PCI 2.3 (circa 2002) and all compliant PCI Express devices should
+ * support one of these.
  */
 
 #include 
@@ -27,7 +28,7 @@
 
 #define DRIVER_VERSION "0.01.0"
 #define DRIVER_AUTHOR  "Michael S. Tsirkin "
-#define DRIVER_DESC"Generic UIO driver for PCI 2.3 devices"
+#define DRIVER_DESC"Generic UIO driver for PCIe & PCI 2.3 devices"
 
 struct uio_pci_generic_dev {
struct uio_info info;
@@ -41,6 +42,39 @@ to_uio_pci_generic_dev(struct uio_info *
return container_of(info, struct uio_pci_generic_dev, info);
 }
 
+/* Read/modify/write command register to disable interrupts.
+ * Note: we could cache the value and optimize the read if there was a way to
+ * get notified of user changes to command register through sysfs.
+ * */
+static void irqtoggle(struct uio_pci_generic_dev *gdev, int irq_on)
+{
+   struct pci_dev *pdev = gdev->pdev;
+   unsigned long flags;
+   u16 orig, new;
+
+   spin_lock_irqsave(&gdev->lock, flags);
+   pci_block_user_cfg_access(pdev);
+   pci_read_config_word(pdev, PCI_COMMAND, &orig);
+   new = irq_on ? (orig & ~PCI_COMMAND_INTX_DISABLE)
+: (orig | PCI_COMMAND_INTX_DISABLE);
+   if (new != orig)
+   pci_write_config_word(gdev->pdev, PCI_COMMAND, new);
+   pci_unblock_user_cfg_access(pdev);
+   spin_unlock_irqrestore(&gdev->lock, flags);
+}
+
+/* irqcontrol is use by userspace to enable/disable interrupts. */
+/* A privileged app can write the PCI_COMMAND register directly,
+ * but we need this for normal apps
+ */
+static int irqcontrol(struct uio_info *info, s32 irq_on)
+{
+   struct uio_pci_generic_dev *gdev = to_uio_pci_generic_dev(info);
+
+   irqtoggle(gdev, irq_on);
+   return 0;
+}
+
 /* Interrupt handler. Read/modify/write the command register to disable
  * the interrupt. */
 static irqreturn_t irqhandler(int irq, struct uio_info *info)
@@ -89,7 +123,7 @@ done:
 /* Verify that the device supports Interrupt Disable bit in command register,
  * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
  * in PCI 2.2. */
-static int __devinit verify_pci_2_3(struct pci_dev *pdev)
+static int verify_pci_2_3(struct pci_dev *pdev)
 {
u16 orig, new;
int err = 0;
@@ -121,17 +155,52 @@ err:
return err;
 }
 
-static int __devinit probe(struct pci_dev *pdev,
+/* we could've used the generic pci sysfs stuff for mmap,
+ * but this way we can allow non-privileged users as long
+ * as /dev/uio* has the right permissions
+ */
+static void uio_do_maps(struct uio_pci_generic_dev *gdev)
+{
+   struct pci_dev *pdev = gdev->pdev;
+   struct uio_info *info = &gdev->info;
+   int i, j;
+   char *name;
+
+   for (i = 0, j = 0; i < PCI_STD_RESOURCE_END && j < MAX_UIO_MAPS; i++) {
+   if (pci_resource_flags(pdev, i) & IORESOURCE_MEM) {
+   name = kmalloc(8, GFP_KERNEL);
+   if (name == NULL)
+   break;
+   sprintf(name, "membar%d", i);
+   info->mem[j].name = name;
+   info->mem[j].addr = pci_resource_start(pdev, i);
+   info->mem[j].size = pci_resource_len(pdev, i);
+   info->mem[j].memtype = UIO_MEM_PHYS;
+   j++;
+   }
+   }
+   for (i = 0, j = 0; i < PCI_STD_RESOURCE_END &&
+  j < MAX_UIO_PORT_REGIONS; i++) {
+   if (pci_resource_flags(pdev, i) & IORESOURCE_IO) {
+   name = kmalloc(8, GFP_KERNEL);
+   if (name == NULL)
+   break;
+   sprintf(name, "iobar%d"

Re: [PATCH V3] drivers/uio/uio_pci_generic.c: allow access for non-privileged processes

2010-04-30 Thread Tom Lyon
Michael, et al - sorry for the delay, but I've been digesting the comments and 
researching new approaches.

I think the plan for V4 will be to take things entirely out of the UIO 
framework, and instead have a driver which supports user mode use of 
"well-behaved" PCI devices. I would like to use read and write to support 
access to memory regions, IO regions,  or PCI config space. Config space is a 
bitch because not everything is safe to read or write, but I've come up with a 
table driven approach which can be run-time extended for non-compliant devices 
(under root control) which could then enable non-privileged users. For 
instance, OHCI 1394 devices use a dword in config space which is not formatted 
as a PCI capability, root can use sysfs to enable access:
echo> 
/sys/dev/pci/devices/:xx:xx.x//config_permit


A "well-behaved" PCI device must have memory BARs >= 4K for mmaping, must have 
separate memory space for MSI-X that does not need mmaping
by the user driver, must support the PCI 2.3 interrupt masking, and must not go 
totally crazy with PCI config space (tg3 is real ugly, e1000 is fine).

Again, my primary usage model is for direct user-level access to network 
devices, not for virtualization, but I think both will work.

So, I will go outside UIO because:
1 - it doesn't allow reads and writes to sub-drivers, just irqcontrol
2 - it doesn't have ioctls
3 - it has its own interrupt model which doesn't use eventfds
4 - it's ugly doing the new stuff and maintaining backwards compat.

I hereby solicit comments on the name and location for the new driver.

Michael - some of your comments below imply you didn't look at the companion 
changes to uio.c, which had the eventfd interrupts and effectively the same 
iommu handling - but see my inline comments below.

On Wednesday 21 April 2010 02:38:49 am Michael S. Tsirkin wrote:
> On Mon, Apr 19, 2010 at 03:05:35PM -0700, Tom Lyon wrote:
> > 
> > These are changes to uio_pci_generic.c to allow better use of the driver by
> > non-privileged processes.
> > 1. Add back old code which allowed interrupt re-enablement through uio fd.
> > 2. Translate PCI bards to uio mmap regions, to allow mmap through uio fd.
> 
> Since it's common for drivers to need configuration cycles
> for device control, the above 2 are not sufficient for generic devices.
> And if you fix the above, you won't need irqcontrol,
> which IMO we are better off saving for stuff like eventfd mapping.
I will handle config access for well-behaved devices.

> 
> Also - this modifies a kernel/userspace interface in a way
> that makes an operation that was always safe previously
> potentially unsafe.
Not sure what you meant, but probably irrelevant in new scheme.
> 
> Also, BAR regions could be less than 1 page in size,
> mapping these to unpriveledged process is a security problem.
Agreed, no mmaping, just r/w.

> Also, for a generic driver, we likely need write combining
> support in the interface.
Given that many system platforms don't have it, doesn't seem like a big deal.
But I'll look into it.

> Also, io space often can not be mmaped. We need read/write
> for that.
Agreed.

> 
> > 3. Allow devices which support MSI or MSI-X, but not IRQ.
> 
> If the device supports MSI or MSI-X, it can perform
> PCI writes upstream, and MSI-X vectors are controlled
> through memory. So with MSI-X + mmap to an unpriveledged
> process you can easily cause the device to modify any memory.
Yes, will "virtualize" this in the driver. User level will not be allowed to
mmap real MSI-X region (if MSI-X in use); MSI config writes will be intercepted.
>
> With MSI it's hard to be sure, maybe some devices might make guarantees
> not to do writes except for MSI, but there's no generic way to declare
> that: bus master needs to be enabled for MSI to work, and once bus
> master is enabled, nothing seems to prevent the device from corrupting
> host memory.
The code already requires iommu protection for masters, I will make sure this
includes MSI and MSI-X devices.

As an aside, the IOMMU is supposed to be able to do interrupt translation also,
but the format for vectors changes, so it doesn't really help with 
virtualization.

> So the patch doesn't look like generic enough or safe enough
> for users I have in mind (virtualization). What users/devices
> do you have in mind?
Non-virt, just new user level drivers for special cases.

> For virtualization, I've been thinking about unpriviledged access and
> msi as well, and here's a plan I thought might work:
> 
> - add a uio_iommu character device that controls an iommu domain
> - uio_iommu would make sure iommu is programmed in a safe way
> - use irqcontrol to bind pci device to i

Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 03:46:51 am Michael S. Tsirkin wrote:
> On Tue, Jun 01, 2010 at 01:28:48PM +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> >>
> >>>   It can't program the iommu.
> >>> What
> >>> the patch proposes is that userspace tells vfio about the needed
> >>> mappings, and vfio programs the iommu.
> >>>  
> >> There seems to be some misunderstanding.  The userspace interface
> >> proposed forces a separate domain per device and forces userspace to
> >> repeat iommu programming for each device.  We are better off sharing a
> >> domain between devices and programming the iommu once.
> >>
> >
> >   iommufd = open(/dev/iommu);
> >   ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> >   ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
> >
> > ?
> 
> Yes.
> 
> > If so, I agree.
> 
> Good.

I'm not really opposed to multiple devices per domain, but let me point out how 
I
ended up here.  First, the driver has two ways of mapping pages, one based on 
the
iommu api and one based on the dma_map_sg api.  With the latter, the system
already allocates a domain per device and there's no way to control it. This was
presumably done to help isolation between drivers.  If there are multiple 
drivers
in the user level, do we not want the same isoation to apply to them?  
Also, domains are not a very scarce resource - my little core i5 has 256, 
and the intel architecture goes to 64K.
And then there's the fact that it is possible to have multiple disjoint iommus 
on a system,
so it may not even be possible to bring 2 devices under one domain. 

Given all that, I am inclined to leave it alone until someone has a real 
problem.
Note that not sharing iommu domains doesn't mean you can't share device memory,
just that you have to do multiple mappings
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Monday 31 May 2010 10:17:35 am Alan Cox wrote:
> 
> Does look like it needs a locking audit, some memory and error checks
> reviewing and some further review of the ioctl security and
> overflows/trusted values.
Yes. Thanks for the detailed look.
> 
> Rather a nice way of attacking the user space PCI problem.
And thanks for that!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-01 Thread Tom Lyon
On Tuesday 01 June 2010 09:29:47 pm Alex Williamson wrote:
> On Tue, 2010-06-01 at 13:28 +0300, Avi Kivity wrote:
> > On 06/01/2010 12:55 PM, Michael S. Tsirkin wrote:
> > >
> > >>   It can't program the iommu.
> > >> What
> > >> the patch proposes is that userspace tells vfio about the needed
> > >> mappings, and vfio programs the iommu.
> > >>  
> > > There seems to be some misunderstanding.  The userspace interface
> > > proposed forces a separate domain per device and forces userspace to
> > > repeat iommu programming for each device.  We are better off sharing a
> > > domain between devices and programming the iommu once.
> > >
> > 
> >iommufd = open(/dev/iommu);
> >ioctl(iommufd, IOMMUFD_ASSIGN_RANGE, ...)
> >ioctl(vfiofd, VFIO_SET_IOMMU, iommufd)
> 
> It seems part of the annoyance of the current KVM device assignment is
> that we have multiple files open, we mmap here, read there, write over
> there, maybe, if it's not emulated.  I quite like Tom's approach that we
> have one stop shopping with /dev/vfio, including config space
> emulation so each driver doesn't have to try to write their own.  So
> continuing with that, shouldn't we be able to add a GET_IOMMU/SET_IOMMU
> ioctl to vfio so that after we setup one device we can bind the next to
> the same domain?

This is just what I was thinking.  But rather than a get/set, just use two fds.

ioctl(vfio_fd1, VFIO_SET_DOMAIN, vfio_fd2);

This may fail if there are really 2 different IOMMUs, so user code must be
prepared for failure,  In addition, this is strictlyupwards compatible with 
what is there now, so maybe we can add it later.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-02 Thread Tom Lyon
On Wednesday 02 June 2010 10:46:15 am Chris Wright wrote:
> * Joerg Roedel (j...@8bytes.org) wrote:
> > On Wed, Jun 02, 2010 at 02:21:00PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 02, 2010 at 01:12:25PM +0200, Joerg Roedel wrote:
> > 
> > > > Even if it is bound to a domain the userspace driver could program the
> > > > device to do dma to unmapped regions causing io-page-faults. The kernel
> > > > can't do anything about it.
> > > 
> > > It can always corrupt its own memory directly as well :)
> > > But that is not a reason not to detect errors if we can,
> > > and not to make APIs hard to misuse.
> > 
> > Changing the domain of a device while dma can happen is the same type of
> > bug as unmapping potential dma target addresses. We can't catch this
> > kind of misuse.
> > 
> > > > > With 10 devices you have 10 extra ioctls.
> > > > 
> > > > And this works implicitly with your proposal?
> > > 
> > > Yes.  so you do:
> > > iommu = open
> > > ioctl(dev1, BIND, iommu)
> > > ioctl(dev2, BIND, iommu)
> > > ioctl(dev3, BIND, iommu)
> > > ioctl(dev4, BIND, iommu)
> > > 
> > > No need to add a SHARE ioctl.
> > 
> > In my proposal this looks like:
> > 
> > 
> > dev1 = open();
> > ioctl(dev2, SHARE, dev1);
> > ioctl(dev3, SHARE, dev1);
> > ioctl(dev4, SHARE, dev1);
> > 
> > So we actually save an ioctl.
> 
> This is not any hot path, so saving an ioctl shouldn't be a consideration.
> Only important consideration is a good API.  I may have lost context here,
> but the SHARE API is limited to the vfio fd.  The BIND API expects a new
> iommu object.  Are there other uses for this object?  Tom's current vfio
> driver exposes a dma mapping interface, would the iommu object expose
> one as well?  Current interface is device specific DMA interface for
> host device drivers typically mapping in-flight dma buffers, and IOMMU
> specific interface for assigned devices typically mapping entire virtual
> address space.

Actually, it a domain object - which may be usable among iommus (Joerg?).
However, you can't really do the dma mapping with just the domain because
every device supports a different size address space as a master, i.e.,
the dma_mask.

And I don't know how kvm would deal with devices with varying dma mask support,
or why they'd be in the same domain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-03 Thread Tom Lyon
OK, in the interest of making progress, I am about to embark on the following:

1. Create a user-iommu-domain driver - opening it will give a new empty domain.
Ultimately this can also populate sysfs with the state of its world, which 
would
also be a good addition to the base iommu stuff.
If someone closes the fd while in use, the domain stays valid anyway until 
users
drop off.

2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
   a domain be set before using the VFIO_DMA_MAP_IOVA ioctl (this is the one
   that KVM wants).  However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
   which uses the dma_sg interface which has no expicit control of domains. I
   intend to keep it the way it is, but expect only non-hypervisor programs 
would
   want to use it.

3. Clean up the docs and other nits that folks have found.

Comments? 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] VFIO driver: Non-privileged user level PCI drivers

2010-06-07 Thread Tom Lyon
On Sunday 06 June 2010 02:54:51 am Michael S. Tsirkin wrote:
> On Thu, Jun 03, 2010 at 02:41:38PM -0700, Tom Lyon wrote:
> > OK, in the interest of making progress, I am about to embark on the 
> > following:
> > 
> > 1. Create a user-iommu-domain driver - opening it will give a new empty 
> > domain.
> > Ultimately this can also populate sysfs with the state of its world, 
> > which would
> > also be a good addition to the base iommu stuff.
> > If someone closes the fd while in use, the domain stays valid anyway 
> > until users
> > drop off.
> > 
> > 2. Add DOMAIN_SET and DOMAIN_UNSET ioctls to the vfio driver.  Require that
> >a domain be set before using the VFIO_DMA_MAP_IOVA ioctl
> 
> Require domain to be set before you allow any access to the device:
> mmap, write, read.  IMO this is the only safe way to make sure userspace
> does not corrupt memory, and this removes the need to special-case
> MSI memory, play with bus master enable and hope it can be cleared without
> reset, etc.

Michael - the light bulb finally lit for me and I now understand what you've 
been
saying the past few weeks.  Of course you're right - we need iommu set before 
any
register access.  I had thought that was done by default but now I see that the 
dma_map_sg routine only attaches to the iommu on demand.

So I will torpedo the MAP_ANYWHERE stuff. I'd like to keep the MAP_IOVA ioctl
with the vfio fd so that the user can still do everything with one fd. I'm 
thinking the
fd opens and iommu bindings could be done in a program before spinning out the
program with the user driver.
> 
> > (this is the one
> >that KVM wants).
> 
> Not sure I understand. I think that MAP should be done on the domain,
> not the device, this handles pinning pages correctly and
> this way you don't need any special checks.
> 
> >However, the VFIO_DMA_MAP_ANYWHERE ioctl is the one
> >which uses the dma_sg interface which has no expicit control of domains. 
> > I
> >intend to keep it the way it is, but expect only non-hypervisor programs 
> > would
> >want to use it.
> 
> If we support MAP_IOVA, why is MAP_ANYWHERE useful? Can't
> non-hypervisors just pick an address?
> 
> > 3. Clean up the docs and other nits that folks have found.
> > 
> > Comments? 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-08 Thread Tom Lyon
On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes 
> > to 
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and 
> > PCIe
> > devices.
> > Signed-off-by: Tom Lyon 
> 
> Some general comments:
> - Please pass this through ./scripts/checkpatch.pl to fix some formatting.
I did. What did you find?

> - Lots of hard-coded constants. Please try using pci_regs.h much more,
>   where not possible please add named enums.
This is mostly for lengths not specified in pci_regs, but given in standards 
docs.

> - There are places where you get parameters from userspace and pass them
>   on to kmalloc etc. Everything you get from userspace needs to be
>   validated.
I  thought I had. Thats what more eyeballs are for.

> - You play non-standard tricks with minor numbers.
>   Won't it be easier to just make udev create a node
>   for the device in the way everyone does it? The name could be
>   descriptive including e.g. bus/dev/fn so userspace can find
>   your device.
I just copied what uio was doing. What is "the way everyone does it?"
> 
> - I note that if we exclude the iommu mapping, the rest conceptually could 
> belong
>   in pci_generic driver in uio. So if we move these ioctls to the iommu 
> driver,
>   as Avi suggested, then vfio can be part of the uio framework.
But the interrupt handling is different in uio; uio doesn't support read or 
write calls
to read and write registers or memory, and it doesn't support ioctls at all for 
other
misc stuff.  If we could blow off backwards compatibility with uio, then, sure 
we
could have a nice unified solution.

> > ---
> > This version now requires an IOMMU domain to be set before any access to
> > device registers is granted (except that config space may be read).  In
> > addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
> > which does not have sufficient controls around IOMMU usage.  The IOMMU 
> > domain
> > is obtained from the 'uiommu' driver which is included in this patch.
> > 
> > Various locking, security, and documentation issues have also been fixed.
> > 
> > Please commit - it or me!
> > But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?
> > 
> > Blurb from previous patch version:
> > 
> > This patch is the evolution of code which was first proposed as a patch to
> > uio/uio_pci_generic, then as a more generic uio patch. Now it is taken 
> > entirely
> > out of the uio framework, and things seem much cleaner. Of course, there is
> > a lot of functional overlap with uio, but the previous version just seemed
> > like a giant mode switch in the uio code that did not lead to clarity for
> > either the new or old code.
> > 
> > [a pony for avi...]
> > The major new functionality in this version is the ability to deal with
> > PCI config space accesses (through read & write calls) - but includes table
> > driven code to determine whats safe to write and what is not. Also, some
> > virtualization of the config space to allow drivers to think they're writing
> > some registers when they're not.  Also, IO space accesses are also allowed.
> > Drivers for devices which use MSI-X are now prevented from directly writing
> > the MSI-X vector area.
> 
> This table adds a lot of complexity to the code,
> and I don't really understand why we need this code in
> kernel: isn't the point of iommu that it protects us
> from buggy devices? If yes, we should be able to
> just ask userspace to be careful and avoid doing silly things
> like overwriting MSI-X vectors, and if it's not careful,
> no one gets hurt :)
> 
> If some registers absolutely must be protected,
> I think we should document this carefully in code.
> 
> > +/*
> > + * MSI and MSI-X Interrupt handler.
> > + * Just signal an event
> > + */
> > +static irqreturn_t msihandler(int irq, void *arg)
> > +{
> > +   struct eventfd_ctx *ctx = arg;
> > +
> > +   eventfd_signal(ctx, 1);
> > +   return IRQ_HANDLED;
> > +}
> > +
> > +void vfio_disable_msi(struct vfio_dev *vdev)
> > +{
> > +   struct pci_dev *pdev = vdev->pdev;
> > +
> > +   if (vdev->ev_msi) {
> > +   eventfd_ctx_put(vdev->ev_msi);
> > +   free_irq(pdev->irq, vdev->ev_msi);
> > +   vdev->ev_msi = NULL;
> > +   }
> > +   pci_disable_msi(pdev);
> > +}
&g

Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-10 Thread Tom Lyon
On Thursday 10 June 2010 10:27:36 am Konrad Rzeszutek Wilk wrote:
> > +EXPORT_SYMBOL(uiommu_fdget);
> 
> EXPORT_SYMBOL_GPL
> .. snip
> > +EXPORT_SYMBOL(uiommu_put);
> 
> ditto.
> 

Is there a definitive explanation somewhere of when to use each?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-11 Thread Tom Lyon
The inline comments are getting pretty hard to wade through, so I'm deleting 
some
of the lesser stuff - but I am incorporating into the code.

On Tuesday 08 June 2010 10:45:57 pm Michael S. Tsirkin wrote:
> On Tue, Jun 08, 2010 at 04:54:43PM -0700, Tom Lyon wrote:
> > On Tuesday 08 June 2010 03:38:44 pm Michael S. Tsirkin wrote:
> > > On Tue, Jun 08, 2010 at 02:21:52PM -0700, Tom Lyon wrote:
...
> > > > +   /* reset to known state if we can */
> > > > +   (void) pci_reset_function(vdev->pdev);
> > > 
> > > We are opening the device - how can it not be in a known state?
> > If an alternate driver left it in a weird state.
> 
> Don't we care if it fails then? I think we do ...
> > > And we should make sure (at open time) we *can* reset on close, fail 
> > > binding if we can't.
> > How do you propose?
> 
> Fail open if reset fails on open?
OK, it'll now fail open if reset fails.

[ bunch of stuff about MSI-X checking and IOMMUs and config registers...]

OK, here's the thing.  The IOMMU API today does not do squat about dealing with
interrupts. Interrupts are special because the APIC  addresses are not each in 
their own page.  Yes, the IOMMU hardware supports it (at least Intel), and 
there's
some Intel intr remapping code (not AMD), but it doesn't look like it is enough.

Therefore, we must not allow the user level driver to diddle the MSI or MSI-X
areas - either in config space or in the device memory space.  If the device
doesn't have its MSI-X registers in nice page aligned areas, then it is not
"well-behaved" and it is S.O.L. The SR-IOV spec recommends that devices be
designed the well-behaved way.

When the code in vfio_pci_config speaks of "virtualization" it means that there
are fake registers which the user driver can read or write, but do not affect 
the
real registers. BARs are one case, MSI regs another. The PCI vendor and device
ID are virtual because SR-IOV doesn't supply them but I wanted the user driver
to find them in the same old place.

> > > > +   case VFIO_DMA_MASK: /* set master mode and DMA mask */
> > > > +   if (copy_from_user(&mask, uarg, sizeof mask))
> > > > +   return -EFAULT;
> > > > +   pci_set_master(pdev);
> > > 
> > > This better be done by userspace when it sees fit.
> > > Otherwise device might corrupt userspace memory.
This ioctl is gone now - it was vestigial from the dma_sg_map interface.

[ Re: Hotplug and Suspend/Resume]
There are *plenty* of real drivers - brand new ones - which don't bother
with these today.  Yeah, I can see adding them to the framework someday -
but if there's no urgent need then it is way down the priority list. Meanwhile,
the other uses beckon.  And I never heard the Infiniband users complaining
about not having these things.


> > > > +   pages = kzalloc(npage * sizeof(struct page *), GFP_KERNEL);
> > > 
> > > If you map a 4G region, this will try to allocate 8Mbytes?
> > Yes. Ce la vie.
> 
> First of all, this will fail - and the request is quite real with
> decent sized guests. Second, with appropriately sized allocs before failing
> it will stress the system pretty hard. Split it in chunks of 4K or something.
Changed to use vmalloc/vfree - don't need physical contiguity.
 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-17 Thread Tom Lyon
On Sunday 13 June 2010 03:23:39 am Michael S. Tsirkin wrote:
> On Fri, Jun 11, 2010 at 03:15:53PM -0700, Tom Lyon wrote:
> > [ bunch of stuff about MSI-X checking and IOMMUs and config registers...]
> > 
> > OK, here's the thing.  The IOMMU API today does not do squat about
> > dealing with interrupts. Interrupts are special because the APIC
> > addresses are not each in their own page.  Yes, the IOMMU hardware
> > supports it (at least Intel), and there's some Intel intr remapping
> > code (not AMD), but it doesn't look like it is enough.
> 
> The iommu book from AMD seems to say that interrupt remapping table
> address is taken from the device table entry.  So hardware support seems
> to be there, and to me it looks like it should be enough.
> Need to look at the iommu/msi code some more to figure out
> whether what linux does is handling this correctly -
> if it doesn't we need to fix that.
> 
> > Therefore, we must not allow the user level driver to diddle the MSI
> > or MSI-X areas - either in config space or in the device memory space.
> 
> It won't help.
> Consider that you want to let a userspace driver control
> the device with DMA capabilities.
> 
> So if there is a range of addresses that device
> can write into that can break host, these writes
> can be triggered by userspace. Limiting
> userspace access to MSI registers won't help:
> you need a way to protect host from the device.

OK, after more investigation, I realize you are right.
We definitely need the IOMMU protection for interrupts, and
if we have it, a lot of the code for config space protection is pointless.
It does seem that the Intel  intr_remapping code does what we want
(accidentally) but that the AMD iommu code does not yet do any
interrupt remapping.  Joerg - can you comment? On the roadmap?

I should have an AMD system w IOMMU in a couple of days, so I
can check this out.

> 
> >  If the device doesn't have its MSI-X registers in nice page aligned
> >  areas, then it is not "well-behaved" and it is S.O.L. The SR-IOV spec
> >  recommends that devices be designed the well-behaved way.
> > 
> > When the code in vfio_pci_config speaks of "virtualization" it means
> > that there are fake registers which the user driver can read or write,
> > but do not affect the real registers. BARs are one case, MSI regs
> > another. The PCI vendor and device ID are virtual because SR-IOV
> > doesn't supply them but I wanted the user driver to find them in the
> > same old place.
> 
> Sorry, I still don't understand why do we bother.  All this is already
> implemented in userspace.  Why can't we just use this existing userspace
> implementation?  It seems that all kernel needs to do is prevent
> userspace from writing BARs.

I assume the userspace of which you speak is qemu?  This is not what I'm
doing with vfio - I'm interested in the HPC networking model of direct 
user space access to the network. 

> Why can't we replace all this complexity with basically:
> 
> if (addr <= PCI_BASE_ADDRESS_5 && addr + len >= PCI_BASE_ADDRESS_0)
>   return -ENOPERM;
> 
> And maybe another register or two. Most registers should be fine.
> 
> > [ Re: Hotplug and Suspend/Resume]
> > There are *plenty* of real drivers - brand new ones - which don't
> > bother with these today.  Yeah, I can see adding them to the framework
> > someday - but if there's no urgent need then it is way down the
> > priority list.
> 
> Well, for kernel drivers everything mostly works out of the box, it is
> handled by the PCI subsystem.  So some kind of framework will need to be
> added for userspace drivers as well.  And I suspect this issue won't be
> fixable later without breaking applications.

Whatever works out of the box for the kernel drivers which don't implement
suspend/resume will work for the user level drivers which don't.
> 
> > Meanwhile, the other uses beckon.
> 
> Which other uses? I thought the whole point was fixing
> what's broken with current kvm implementation.
> So it seems to be we should not rush it ignoring existing issues such as
> hotplug.
Non-kvm cases.  That don't care about suspend/resume.

 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-03-31 Thread Tom Lyon
uio_pci_generic has previously been discussed on the KVM list, but this patch 
has nothing to do with KVM, so it is also going to LKML.

The point of this patch is to beef up the uio_pci_generic driver so that a 
non-privileged user process can run a user level driver for most PCIe 
devices. This can only be safe if there is an IOMMU in the system with 
per-device domains.  Privileged users (CAP_SYS_RAWIO) are allowed if there is 
no IOMMU.

Specifically, I seek to allow low-latency user level network drivers (non 
tcp/ip) which directly access SR-IOV style virtual network adapters, for use 
with packages such as OpenMPI.

Key areas of change:
- ioctl extensions to allow registration and dma mapping of memory regions, 
with lock accounting
- support for mmu notifier driven de-mapping
- support for MSI and MSI-X interrupts (the intel 82599 VFs support only 
MSI-X)
- allowing interrupt enabling and device register mapping all 
through /dev/uio* so that  permissions may be granted just by chmod 
on /dev/uio*
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-03-31 Thread Tom Lyon

diff -rupN linux-2.6.33/drivers/uio/uio.c uio-2.6.33/drivers/uio/uio.c
--- linux-2.6.33/drivers/uio/uio.c  2010-02-24 10:52:17.0 -0800
+++ uio-2.6.33/drivers/uio/uio.c2010-03-31 12:26:24.0 -0700
@@ -730,12 +730,24 @@ static int uio_mmap(struct file *filep,
}
 }
 
+static int uio_ioctl(struct inode *inode, struct file *filep,
+   unsigned int cmd, unsigned long arg) 
+{
+   struct uio_listener *listener = filep->private_data;
+   struct uio_device *idev = listener->dev;
+
+   if (idev == NULL || idev->info == NULL || idev->info->ioctl == NULL)
+   return -EINVAL;
+   return idev->info->ioctl(idev->info, cmd, arg);
+}
+
 static const struct file_operations uio_fops = {
.owner  = THIS_MODULE,
.open   = uio_open,
.release= uio_release,
.read   = uio_read,
.write  = uio_write,
+   .ioctl  = uio_ioctl,
.mmap   = uio_mmap,
.poll   = uio_poll,
.fasync = uio_fasync,
diff -rupN linux-2.6.33/drivers/uio/uio_pci_generic.c 
uio-2.6.33/drivers/uio/uio_pci_generic.c
--- linux-2.6.33/drivers/uio/uio_pci_generic.c  2010-02-24 
10:52:17.0 -0800
+++ uio-2.6.33/drivers/uio/uio_pci_generic.c2010-03-31 
16:28:33.0 -0700
@@ -1,4 +1,7 @@
-/* uio_pci_generic - generic UIO driver for PCI 2.3 devices
+/* uio_pci_generic - generic UIO driver for PCI 2.3 and PCIe devices 
+ *
+ * Copyright (C) 2010 Cisco Systems, Inc.
+ * Extensions by Tom Lyon 
  *
  * Copyright (C) 2009 Red Hat, Inc.
  * Author: Michael S. Tsirkin 
@@ -14,25 +17,35 @@
  * # ls -l /sys/bus/pci/devices/:00:19.0/driver
  * .../:00:19.0/driver -> ../../../bus/pci/drivers/uio_pci_generic
  *
- * Driver won't bind to devices which do not support the Interrupt Disable 
Bit
+ * Driver won't bind to devices which do not support MSI, MSI-x, or the 
Interrupt Disable Bit
  * in the command register. All devices compliant to PCI 2.3 (circa 2002) and
- * all compliant PCI Express devices should support this bit.
+ * all compliant PCI Express devices should support one of these.
  */
 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
+#include 
 
-#define DRIVER_VERSION "0.01.0"
+#define DRIVER_VERSION "0.02.0"
 #define DRIVER_AUTHOR  "Michael S. Tsirkin "
-#define DRIVER_DESC"Generic UIO driver for PCI 2.3 devices"
+#define DRIVER_DESC"Generic UIO driver for PCI devices"
 
 struct uio_pci_generic_dev {
struct uio_info info;
struct pci_dev *pdev;
spinlock_t lock; /* guards command register accesses */
+   int msi;
+   struct msix_entry *msix;
+   int nvec;
+   struct mm_struct *mm;
+   struct mmu_notifier mmu_notifier;
+   struct list_head dm_list;
 };
 
 static inline struct uio_pci_generic_dev *
@@ -41,6 +54,51 @@ to_uio_pci_generic_dev(struct uio_info *
return container_of(info, struct uio_pci_generic_dev, info);
 }
 
+/* Read/modify/write command register to disable interrupts.
+ * Note: we could cache the value and optimize the read if there was a way to
+ * get notified of user changes to command register through sysfs.
+ * */
+static void irqtoggle(struct uio_pci_generic_dev *gdev, int irq_on)
+{
+   struct pci_dev *pdev = gdev->pdev;
+   unsigned long flags;
+   u16 orig, new;
+
+   spin_lock_irqsave(&gdev->lock, flags);
+   pci_block_user_cfg_access(pdev);
+   pci_read_config_word(pdev, PCI_COMMAND, &orig);
+   new = irq_on ? (orig & ~PCI_COMMAND_INTX_DISABLE)
+   : (orig | PCI_COMMAND_INTX_DISABLE);
+   if (new != orig)
+   pci_write_config_word(gdev->pdev, PCI_COMMAND, new);
+   pci_unblock_user_cfg_access(pdev);
+   spin_unlock_irqrestore(&gdev->lock, flags);
+}
+
+/* irqcontrol is use by userspace to enable/disable interrupts. */
+/* A privileged app can write the PCI_COMMAND register directly,
+ * but we need this for normal apps
+ */
+static int irqcontrol(struct uio_info *info, s32 irq_on)
+{
+   struct uio_pci_generic_dev *gdev = to_uio_pci_generic_dev(info);
+
+   irqtoggle(gdev, irq_on);
+   return 0;
+}
+
+/* MSI and MSI-X Interrupt handler.
+ * For bad devices, we may get an interrupt per event/packet, but most
+ * devices will self-throttle until user driver wants more
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+   struct uio_info *info = arg;
+
+   uio_event_notify(info);
+   return IRQ_HANDLED;
+}
+
 /* Interrupt handler. Read/modify/write the command register to disable
  * the interrupt. */
 static irqreturn_t irqhandler(int irq, struct uio_info *info)
@@ -89,7 +147,7 @@ done:
 /* Verify that the device supports Interrupt Disable bit in command register,
  * per PCI 2.3, by fl

Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 02:09:09 am Avi Kivity wrote:
> On 04/01/2010 03:08 AM, Tom Lyon wrote:
> > uio_pci_generic has previously been discussed on the KVM list, but this
> > patch has nothing to do with KVM, so it is also going to LKML.
>
> (needs to go to lkml even if it was for kvm)
>
> > The point of this patch is to beef up the uio_pci_generic driver so that
> > a non-privileged user process can run a user level driver for most PCIe
> > devices. This can only be safe if there is an IOMMU in the system with
> > per-device domains.  Privileged users (CAP_SYS_RAWIO) are allowed if
> > there is no IOMMU.
> >
> > Specifically, I seek to allow low-latency user level network drivers (non
> > tcp/ip) which directly access SR-IOV style virtual network adapters, for
> > use with packages such as OpenMPI.
> >
> > Key areas of change:
> > - ioctl extensions to allow registration and dma mapping of memory
> > regions, with lock accounting
> > - support for mmu notifier driven de-mapping
>
> Note that current iommus/devices don't support restart-on-fault dma, so
> userspace drivers will have to lock memory so that it is not swapped
> out.  I don't think this prevents page migration, though.
The driver provides a way to lock memory for DMA; the mmu notifier support is 
to catch things when the user accidentally frees locked pages.

> > - support for MSI and MSI-X interrupts (the intel 82599 VFs support only
> > MSI-X)
>
> How does a userspace program receive those interrupts?
Same as other UIO drivers - by read()ing an event counter.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 05:52:18 am Joerg Roedel wrote:
> On Wed, Mar 31, 2010 at 05:08:38PM -0700, Tom Lyon wrote:
> > uio_pci_generic has previously been discussed on the KVM list, but this
> > patch has nothing to do with KVM, so it is also going to LKML.
>
> But since you send it to the KVM list it should be suitable for KVM too,
> no?
I know not.

>
> > The point of this patch is to beef up the uio_pci_generic driver so that
> > a non-privileged user process can run a user level driver for most PCIe
> > devices. This can only be safe if there is an IOMMU in the system with
> > per-device domains.  Privileged users (CAP_SYS_RAWIO) are allowed if
> > there is no IOMMU.
>
> If you rely on an IOMMU you can use the IOMMU-API instead of the DMA-API
> for dma mappings. This change makes this driver suitable for KVM use
> too. If the interface is designed clever enough we can even use it for
> IOMMU emulation for pass-through devices.
The use with privileged processes and no IOMMUs is still quite useful, so I'd 
rather stick with the DMA interface.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 07:25:04 am Michael S. Tsirkin wrote:
> On Wed, Mar 31, 2010 at 05:08:38PM -0700, Tom Lyon wrote:
> > uio_pci_generic has previously been discussed on the KVM list, but this
> > patch has nothing to do with KVM, so it is also going to LKML.
> >
> > The point of this patch is to beef up the uio_pci_generic driver so that
> > a non-privileged user process can run a user level driver for most PCIe
> > devices. This can only be safe if there is an IOMMU in the system with
> > per-device domains.
>
> Why? Per-guest domain should be safe enough.
I'm not sure what 'per-guest' means in an ordinary process context.
>
> >  Privileged users (CAP_SYS_RAWIO) are allowed if there is
> > no IOMMU.
>
> qemu does not support it, I doubt this last option is worth having.
This is extremely useful in non IOMMU systems - again, we're talking ordinary 
processes, nothing to do with VMs. As long as the program can be trusted, 
e.g., in embedded apps.

>
> > Specifically, I seek to allow low-latency user level network drivers (non
> > tcp/ip) which directly access SR-IOV style virtual network adapters, for
> > use with packages such as OpenMPI.
> >
> > Key areas of change:
> > - ioctl extensions to allow registration and dma mapping of memory
> > regions, with lock accounting
> > - support for mmu notifier driven de-mapping
> > - support for MSI and MSI-X interrupts (the intel 82599 VFs support only
> > MSI-X)
> > - allowing interrupt enabling and device register mapping all
> > through /dev/uio* so that  permissions may be granted just by chmod
> > on /dev/uio*
>
> For non-priveledged users, we need a way to enforce that
> device is bound to an iommu.
Right now I just use iommu_found - assuming that if we have one, it is in use. 
Something better would be nice.

> Further, locking really needs to be scoped with iommu domain existance
> and with iommu mappings: as long as a page is mapped in iommu,
> it must be locked. This patch does not seem to enforce that.
Sure it does. The DMA API - get_user_pages and dma_map_sg lock pages into the 
MMU and the IOMMU. The MMU notifier unlocks if the user forgets to do it 
explicitly.

> Also note that what we really want is a single iommu domain per guest,
> not per device.
For my networking applications, I will need the ability to talk to multiple 
devices on potentially separate IOMMUs. What would per-guest mean then?
>

> For this reason, I think we should address the problem somwwhat
> differently:
> - Create a character device to represent the iommu
> - This device will handle memory locking etc
> - Allow binding this device to iommu
> - Allow other operations only after iommu is bound
There are still per-device issues with locking - in particular the size of the 
device's DMA address space. The DMA API already handles this - why not use 
it? It would be nice to have a way to test whether a device is truly covered 
by an IOMMU, but today it appears that if an IOMMU exists, then it covers all 
devices (at least as far as I can see for x86).




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 08:54:14 am Avi Kivity wrote:
> On 04/01/2010 06:39 PM, Tom Lyon wrote:
> >>> - support for MSI and MSI-X interrupts (the intel 82599 VFs support
> >>> only MSI-X)
> >>
> >> How does a userspace program receive those interrupts?
> >
> > Same as other UIO drivers - by read()ing an event counter.
>
> IIRC the usual event counter is /dev/uioX, what's your event counter now?
Exact same mechanism.

>
> kvm really wants the event counter to be an eventfd, that allows hooking
> it directly to kvm (which can inject an interrupt on an eventfd_signal),
> can you adapt your patch to do this?

My patch does not currently go anywhere near the read/fd logic of /dev/uioX.
I think a separate patch would be appropriate.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 09:07:47 am Joerg Roedel wrote:
> On Thu, Apr 01, 2010 at 08:40:34AM -0700, Tom Lyon wrote:
> > On Thursday 01 April 2010 05:52:18 am Joerg Roedel wrote:
> > > > The point of this patch is to beef up the uio_pci_generic driver so
> > > > that a non-privileged user process can run a user level driver for
> > > > most PCIe devices. This can only be safe if there is an IOMMU in the
> > > > system with per-device domains.  Privileged users (CAP_SYS_RAWIO) are
> > > > allowed if there is no IOMMU.
> > >
> > > If you rely on an IOMMU you can use the IOMMU-API instead of the
> > > DMA-API for dma mappings. This change makes this driver suitable for
> > > KVM use too. If the interface is designed clever enough we can even use
> > > it for IOMMU emulation for pass-through devices.
> >
> > The use with privileged processes and no IOMMUs is still quite useful, so
> > I'd rather stick with the DMA interface.
>
> For the KVM use-case we need to be able to specify the io virtual
> address for a given process virtual address. This is not possible with
> the dma-api interface. So if we want to have uio-dma without an hardware
> iommu we need two distinct interfaces for userspace to cover all
> use-cases. I don't think its worth it to have two interfaces.
>
>   Joerg

I started to add that capability but then realized that the IOMMU API also 
doesn't allow it. The map function allows a range of physically contiguous 
pages, not virtual.

My preferred approach would be to add a DMA_ATTR that would request allocation 
of DMA at a specific device/iommu address.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 09:10:57 am Avi Kivity wrote:
> On 04/01/2010 07:06 PM, Tom Lyon wrote:
> > On Thursday 01 April 2010 08:54:14 am Avi Kivity wrote:
> >> On 04/01/2010 06:39 PM, Tom Lyon wrote:
> >>>>> - support for MSI and MSI-X interrupts (the intel 82599 VFs support
> >>>>> only MSI-X)
> >>>>
> >>>> How does a userspace program receive those interrupts?
> >>>
> >>> Same as other UIO drivers - by read()ing an event counter.
> >>
> >> IIRC the usual event counter is /dev/uioX, what's your event counter
> >> now?
> >
> > Exact same mechanism.
>
> But there are multiple msi-x interrupts, how do you know which one
> triggered?

You don't. This would suck for KVM, I guess, but we'd need major rework of the 
generic UIO stuff to have a separate event channel for each MSI-X.

For my purposes, collapsing all the MSI-Xs into one MSI-look-alike is fine, 
because I'd be using MSI anyways if I could. The weird Intel 82599 VF only 
supports MSI-X.

So one big question is - do we expand the whole UIO framework for KVM 
requirements, or do we split off either KVM or non-VM into a separate driver?
Hans or Greg - care to opine?




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-01 Thread Tom Lyon
On Thursday 01 April 2010 08:54:14 am Avi Kivity wrote:
> On 04/01/2010 06:39 PM, Tom Lyon wrote:
> >>> - support for MSI and MSI-X interrupts (the intel 82599 VFs support
> >>> only MSI-X)
> >>
> >> How does a userspace program receive those interrupts?
> >
> > Same as other UIO drivers - by read()ing an event counter.
>
> IIRC the usual event counter is /dev/uioX, what's your event counter now?
>
> kvm really wants the event counter to be an eventfd, that allows hooking
> it directly to kvm (which can inject an interrupt on an eventfd_signal),
> can you adapt your patch to do this?

I looked further into eventfds - they seem the perfect solution for the 
MSI/MSI-X interrupts. Will include in V2.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Tom Lyon
On Friday 09 April 2010 02:58:19 am Avi Kivity wrote:
> On 04/02/2010 08:05 PM, Greg KH wrote:
> >
> >> Currently kvm does device assignment with its own code, I'd like to unify
> >> it with uio, not split it off.
> >>
> >> Separate notifications for msi-x interrupts are just as useful for uio as
> >> they are for kvm.
> >>  
> > I agree, there should not be a difference here for KVM vs. the "normal"
> > version.
> >
> 
> Just so you know what you got into, here are the kvm requirements:
> 
> - msi interrupts delivered via eventfd (these allow us to inject 
> interrupts from uio to a guest without going through userspace)
Check.
> - nonlinear iommu mapping (i.e. map discontiguous ranges of the device 
> address space into ranges of the virtual address space)
Check.
> - dynamic iommu mapping (support guest memory hotplug)
Check.
> - unprivileged operation once an admin has assigned a device (my 
> preferred implementation is to have all operations go through an fd, 
> which can be passed via SCM_RIGHTS from a privileged application that 
> opens the file)
Check.
> - access to all config space, but BARs must be translated so userspace 
> cannot attack the host
Please elaborate. All of PCI config? All of PCIe config? Seems like a huge mess.
> - some mechanism which allows us to affine device interrupts with their 
> target vcpus (eventually, this is vague)
Do-able.
> - anything mst might add
mst?
> - a pony
Rainbow or glitter?

The 'check' items are already done, not fully tested; probably available next 
week.
Can we leave the others for future patches? Please? And I definitely need help 
with 
the PCI config stuff.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] uio_pci_generic: extensions to allow access for non-privileged processes

2010-04-09 Thread Tom Lyon
Mea culpa. 

On Friday 09 April 2010 02:08:55 am Joerg Roedel wrote:
> Btw. This patch posting is broken. It suffers from line-wraps which make
> it impossible to apply as-is. I was able to fix it but please consider
> this in your next posting.
> 
> On Wed, Mar 31, 2010 at 05:12:35PM -0700, Tom Lyon wrote:
> 
> > --- linux-2.6.33/drivers/uio/uio_pci_generic.c  2010-02-24 
> > 10:52:17.0 -0800
> ^
> Unexpected line-wrap.
> 
> I also got some whitespace warnings when trying to apply it. Please make
> sure you fix this in the next version too.
> 
> Thanks,
> 
>   Joerg
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] drivers/uio/uio_pci_generic.c: allow access for non-privileged processes

2010-04-15 Thread Tom Lyon
This is the firt of 2 related, but independent, patches. This is for 
uio_pci_generic.c, the next is for uio.c.

The 2 patches were previously one large patch. Changes for this version:
- uio_pci_generic.c just gets extensions so that a single fd can be used
  by non-privileged processes for interrupt control and mmaps
- All of the DMA and IOMMU related stuff move to uio.c; no longer a need
  to pass ioctls to individual uio drivers. It turns out that the code 
  is not PCI specific anyways.
- A new ioctl to pin DMA buffers to certain IO virtual addresses for KVM.
- New eventfd based interrupt notifications, including support for PCI
  specific MSI and MSI-X interrupts.
- PCI specific code to reset PCI functions before and after use

--- linux-2.6.33/drivers/uio/uio_pci_generic.c  2010-02-24 10:52:17.0 
-0800
+++ uio-2.6.33/drivers/uio/uio_pci_generic.c2010-04-15 13:44:25.0 
-0700
@@ -14,9 +14,9 @@
  * # ls -l /sys/bus/pci/devices/:00:19.0/driver
  * .../:00:19.0/driver -> ../../../bus/pci/drivers/uio_pci_generic
  *
- * Driver won't bind to devices which do not support the Interrupt Disable Bit
+ * Driver won't bind to devices which do not support MSI, MSI-x, or the 
Interrupt Disable Bit
  * in the command register. All devices compliant to PCI 2.3 (circa 2002) and
- * all compliant PCI Express devices should support this bit.
+ * all compliant PCI Express devices should support one of these.
  */
 
 #include 
@@ -41,6 +41,39 @@
return container_of(info, struct uio_pci_generic_dev, info);
 }
 
+/* Read/modify/write command register to disable interrupts.
+ * Note: we could cache the value and optimize the read if there was a way to
+ * get notified of user changes to command register through sysfs.
+ * */
+static void irqtoggle(struct uio_pci_generic_dev *gdev, int irq_on)
+{
+   struct pci_dev *pdev = gdev->pdev;
+   unsigned long flags;
+   u16 orig, new;
+
+   spin_lock_irqsave(&gdev->lock, flags);
+   pci_block_user_cfg_access(pdev);
+   pci_read_config_word(pdev, PCI_COMMAND, &orig);
+   new = irq_on ? (orig & ~PCI_COMMAND_INTX_DISABLE)
+   : (orig | PCI_COMMAND_INTX_DISABLE);
+   if (new != orig)
+   pci_write_config_word(gdev->pdev, PCI_COMMAND, new);
+   pci_unblock_user_cfg_access(pdev);
+   spin_unlock_irqrestore(&gdev->lock, flags);
+}
+
+/* irqcontrol is use by userspace to enable/disable interrupts. */
+/* A privileged app can write the PCI_COMMAND register directly,
+ * but we need this for normal apps
+ */
+static int irqcontrol(struct uio_info *info, s32 irq_on)
+{
+   struct uio_pci_generic_dev *gdev = to_uio_pci_generic_dev(info);
+
+   irqtoggle(gdev, irq_on);
+   return 0;
+}
+
 /* Interrupt handler. Read/modify/write the command register to disable
  * the interrupt. */
 static irqreturn_t irqhandler(int irq, struct uio_info *info)
@@ -89,7 +122,7 @@
 /* Verify that the device supports Interrupt Disable bit in command register,
  * per PCI 2.3, by flipping this bit and reading it back: this bit was readonly
  * in PCI 2.2. */
-static int __devinit verify_pci_2_3(struct pci_dev *pdev)
+static int verify_pci_2_3(struct pci_dev *pdev)
 {
u16 orig, new;
int err = 0;
@@ -121,17 +154,51 @@
return err;
 }
 
-static int __devinit probe(struct pci_dev *pdev,
+/* we could've used the generic pci sysfs stuff for mmap,
+ * but this way we can allow non-privileged users as long
+ * as /dev/uio* has the right permissions
+ */
+static void uio_do_maps(struct uio_pci_generic_dev *gdev)
+{
+   struct pci_dev *pdev = gdev->pdev;
+   struct uio_info *info = &gdev->info;
+   int i, j;
+   char *name;
+
+   for (i=0, j=0; imem[j].name = name; 
+   info->mem[j].addr = pci_resource_start(pdev, i); 
+   info->mem[j].size = pci_resource_len(pdev, i);
+   info->mem[j].memtype = UIO_MEM_PHYS; 
+   j++;
+   }
+   }
+   for (i=0, j=0; iport[j].name = name;
+   info->port[j].start = pci_resource_start(pdev, i);
+   info->port[j].size = pci_resource_len(pdev, i);
+   info->port[j].porttype = UIO_PORT_X86;
+   j++;
+   }
+   }
+}
+
+static int probe(struct pci_dev *pdev,
   const struct pci_device_id *id)
 {
struct uio_pci_generic_dev *gdev;
int err;
-
-   if (!pdev->irq) {
-   dev_warn(&pdev->dev, "No IRQ assigned to device: "
-"no support for interrupts?\n");
-   return -ENODEV;
-   }
+   int msi=0;
 
err = pci_enable_device(pdev);
if (err) {
@@ -140,9 +207,26 @@
return err;
}
 
-   err = verify_pci_2_3(pdev);
-   if (err)
-   goto err_verify;
+   if (pci_find_capability(pdev, PCI_CAP_ID_MSI)) {
+  

[PATCH V2] drivers/uio/uio.c: DMA mapping, interrupt extensions, etc.

2010-04-15 Thread Tom Lyon
This is the second of 2 related, but independent, patches. This is for 
uio.c, the previous is for uio_pci_generic.c.

The 2 patches were previously one large patch. Changes for this version:
- uio_pci_generic.c just gets extensions so that a single fd can be used
  by non-privileged processes for interrupt control and mmaps
- All of the DMA and IOMMU related stuff move to uio.c; no longer a need
  to pass ioctls to individual uio drivers. It turns out that the code 
  is not PCI specific anyways.
- A new ioctl to pin DMA buffers to certain IO virtual addresses for KVM.
- New eventfd based interrupt notifications, including support for PCI
  specific MSI and MSI-X interrupts.
- PCI specific code to reset PCI functions before and after use

diff -ruNP linux-2.6.33/drivers/uio/uio.c uio-2.6.33/drivers/uio/uio.c
--- linux-2.6.33/drivers/uio/uio.c  2010-02-24 10:52:17.0 -0800
+++ uio-2.6.33/drivers/uio/uio.c2010-04-15 12:39:02.0 -0700
@@ -23,6 +23,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 #define UIO_MAX_DEVICES 255
 
@@ -37,8 +42,37 @@
struct uio_info *info;
struct kobject  *map_dir;
struct kobject  *portio_dir;
+   struct pci_dev  *pdev;
+   int pmaster;
+   struct semaphoregate;
+   int listeners;
+   atomic_tmapcount;
+   struct msix_entry   *msix;
+   int nvec;
+   struct iommu_domain *domain;
+   int cachec;
+   struct eventfd_ctx  *ev_irq;
+   struct eventfd_ctx  *ev_msi;
+   struct eventfd_ctx  **ev_msix;
 };
 
+/*
+ * Structure for keeping track of memory nailed down by the
+ * user for DMA
+ */
+struct dma_map_page {
+struct list_head list;
+struct page **pages;
+   struct scatterlist *sg;
+dma_addr_t  daddr;
+   unsigned long   vaddr;
+   int npage;
+   int rdwr;
+};
+
+static void uio_disable_msi(struct uio_device *);
+static void uio_disable_msix(struct uio_device *);
+
 static int uio_major;
 static DEFINE_IDR(uio_idr);
 static const struct file_operations uio_fops;
@@ -440,17 +474,38 @@
struct uio_device *idev = (struct uio_device *)dev_id;
irqreturn_t ret = idev->info->handler(irq, idev->info);
 
-   if (ret == IRQ_HANDLED)
+   if (ret != IRQ_HANDLED)
+   return ret;
+   if (idev->ev_irq)
+   eventfd_signal(idev->ev_irq, 1);
+   else
uio_event_notify(idev->info);
 
return ret;
 }
 
+/*
+ * MSI and MSI-X Interrupt handler.
+ * Just record an event
+ */
+static irqreturn_t msihandler(int irq, void *arg)
+{
+   struct eventfd_ctx *ctx = arg;
+
+   eventfd_signal(ctx, 1);
+   return IRQ_HANDLED;
+}
+
 struct uio_listener {
struct uio_device *dev;
s32 event_count;
+   struct mm_struct*mm;
+   struct mmu_notifier mmu_notifier;
+   struct list_headdm_list;
 };
 
+static void uio_dma_unmapall(struct uio_listener *);
+
 static int uio_open(struct inode *inode, struct file *filep)
 {
struct uio_device *idev;
@@ -470,7 +525,7 @@
goto out;
}
 
-   listener = kmalloc(sizeof(*listener), GFP_KERNEL);
+   listener = kzalloc(sizeof(*listener), GFP_KERNEL);
if (!listener) {
ret = -ENOMEM;
goto err_alloc_listener;
@@ -478,8 +533,22 @@
 
listener->dev = idev;
listener->event_count = atomic_read(&idev->event);
+   INIT_LIST_HEAD(&listener->dm_list);
filep->private_data = listener;
 
+   down(&idev->gate);
+   if (idev->listeners == 0) { /* first open */
+   if (idev->pmaster && !iommu_found() && !capable(CAP_SYS_RAWIO)) 
{
+   up(&idev->gate);
+   return -EPERM;
+   }
+   /* reset to known state if we can */
+   if (idev->pdev)
+   (void) pci_reset_function(idev->pdev);
+   }
+   idev->listeners++;
+   up(&idev->gate);
+
if (idev->info->open) {
ret = idev->info->open(idev->info, inode);
if (ret)
@@ -514,6 +583,34 @@
if (idev->info->release)
ret = idev->info->release(idev->info, inode);
 
+   uio_dma_unmapall(listener);
+   if (listener->mm) {
+   mmu_notifier_unregister(&listener->mmu_notifier, listener->mm);
+   listener->mm = NULL;
+   }
+
+   down(&idev->gate);
+   if (--idev->listeners <= 0) {
+   if (idev->msix) {
+   uio_disable_msix(idev);
+   }
+   if (idev->ev_msi) {
+   uio_disable_msi(idev);
+   }
+   if (idev->ev_irq) {
+  

[PATCH 0/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers

2010-09-22 Thread Tom Lyon
After a long summer break, it's tanned, it's rested, and it's ready to rumble!

In this version:*** REBASE to 2.6.35 ***

There's new code using generic netlink messages which allows the kernel
to notify the user level of weird events and allows the user level to 
respond. This is currently used to handle device removal (whether software
or hardware driven), PCI error events, and system suspend & hibernate.

The driver now supports devices which use multiple MSI interrupts, reflecting
the actual number of interrupts allocated by the system to the user level.

PCI config accesses are now done through the pci_user_{read,write)_config
routines from drivers/pci/access.c.

Numerous other tweaks and cleanups.

Blurb from version 3:

There are lots of bug fixes and cleanups in this version, but the main
change is to check to make sure that the IOMMU has interrupt remapping
enabled, which is necessary to prevent user level code from triggering
spurious interrupts for other devices.  Since most platforms today
do not have the necessary hardware and/or software for this, a module
option can override this check, thus making vfio useful (but not safe)
on many more platforms.

In the next version I plan to add kernel to user messaging using the
generic netlink mechanism to allow the user driver to react to hot add
and remove, and power management requests.

Blurb from version 2:

This version now requires an IOMMU domain to be set before any access to
device registers is granted (except that config space may be read).  In
addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
which does not have sufficient controls around IOMMU usage.  The IOMMU domain
is obtained from the 'uiommu' driver which is included in this patch.

Various locking, security, and documentation issues have also been fixed.

Please commit - it or me!
But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?

Blurb from version 1:

This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.

[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not.  Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.

All interrupts are now handled using eventfds, which makes things very simple.

The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices.  I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?

I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release).  This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.

Tom Lyon (3):
  VFIO V4: export pci_user_{read,write}_config
  VFIO V4: uiommu driver - allow user progs to manipulate iommu domains
  VFIO V4: VFIO driver: Non-privileged user level PCI drivers

 Documentation/ioctl/ioctl-number.txt |1 +
 Documentation/vfio.txt   |  183 
 MAINTAINERS  |8 +
 drivers/Kconfig  |2 +
 drivers/Makefile |2 +
 drivers/pci/access.c |6 +-
 drivers/pci/pci.h|7 -
 drivers/vfio/Kconfig |   18 +
 drivers/vfio/Makefile|   11 +
 drivers/vfio/uiommu.c|  126 ++
 drivers/vfio/vfio_dma.c  |  346 +++
 drivers/vfio/vfio_intrs.c|  257 
 drivers/vfio/vfio_main.c |  768 ++
 drivers/vfio/vfio_netlink.c  |  459 
 drivers/vfio/vfio_pci_config.c   |  698 ++
 drivers/vfio/vfio_rdwr.c |  158 +++
 drivers/vfio/vfio_sysfs.c|  118 ++
 include/linux/Kbuild |1 +
 include/linux/pci.h  |8 +
 include/linux/uiommu.h   |   76 
 include/linux/vfio.h |  267 +

[PATCH 1/3] VFIO V4: export pci_user_{read,write}_config

2010-09-22 Thread Tom Lyon

Signed-off-by: Tom Lyon 
---
 drivers/pci/access.c |6 --
 drivers/pci/pci.h|7 ---
 include/linux/pci.h  |8 
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 531bc69..96ed449 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -157,7 +157,8 @@ int pci_user_read_config_##size 
\
raw_spin_unlock_irq(&pci_lock); \
*val = (type)data;  \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 #define PCI_USER_WRITE_CONFIG(size,type)   \
 int pci_user_write_config_##size   \
@@ -171,7 +172,8 @@ int pci_user_write_config_##size
\
pos, sizeof(type), val);\
raw_spin_unlock_irq(&pci_lock); \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index f8077b3..76c3ed3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -67,13 +67,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
return !!(pci_dev->subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
-
 struct pci_vpd_ops {
ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void 
*buf);
ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const 
void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f26fda7..36eacbf 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -754,6 +754,14 @@ static inline int pci_write_config_dword(struct pci_dev 
*dev, int where,
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] VFIO V4: uiommu driver - allow user progs to manipulate iommu domains

2010-09-22 Thread Tom Lyon

Signed-off-by: Tom Lyon 
---
 drivers/Kconfig|2 +
 drivers/Makefile   |1 +
 drivers/vfio/Kconfig   |8 +++
 drivers/vfio/Makefile  |1 +
 drivers/vfio/uiommu.c  |  126 
 include/linux/uiommu.h |   76 +
 6 files changed, 214 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/uiommu.c
 create mode 100644 include/linux/uiommu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index a2b902f..711c1cb 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
 source "drivers/staging/Kconfig"
 
 source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index 91874e0..e496fc3 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_FUSION)  += message/
 obj-$(CONFIG_FIREWIRE) += firewire/
 obj-y  += ieee1394/
 obj-$(CONFIG_UIO)  += uio/
+obj-$(CONFIG_UIOMMU)   += vfio/
 obj-y  += cdrom/
 obj-y  += auxdisplay/
 obj-$(CONFIG_PCCARD)   += pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 000..3ab9af3
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig UIOMMU
+   tristate "User level manipulation of IOMMU"
+   help
+ Device driver to allow user level programs to
+ manipulate IOMMU domains.
+
+ If you don't know what to do here, say N.
+
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 000..556f3c1
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_UIOMMU) += uiommu.o
diff --git a/drivers/vfio/uiommu.c b/drivers/vfio/uiommu.c
new file mode 100644
index 000..eec1759
--- /dev/null
+++ b/drivers/vfio/uiommu.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, p...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+/*
+ * uiommu driver - issue fd handles for IOMMU domains
+ * so they may be passed to vfio (and others?)
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tom Lyon ");
+MODULE_DESCRIPTION("User IOMMU driver");
+
+static struct uiommu_domain *uiommu_domain_alloc(void)
+{
+   struct iommu_domain *domain;
+   struct uiommu_domain *udomain;
+
+   domain = iommu_domain_alloc();
+   if (domain == NULL)
+   return NULL;
+   udomain = kzalloc(sizeof *udomain, GFP_KERNEL);
+   if (udomain == NULL) {
+   iommu_domain_free(domain);
+   return NULL;
+   }
+   udomain->domain = domain;
+   atomic_inc(&udomain->refcnt);
+   return udomain;
+}
+
+static int uiommu_open(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = uiommu_domain_alloc();
+   if (udomain == NULL)
+   return -ENOMEM;
+   file->private_data = udomain;
+   return 0;
+}
+
+static int uiommu_release(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = file->private_data;
+   uiommu_put(udomain);
+   return 0;
+}
+
+static const struct file_operations uiommu_fops = {
+   .owner  = THIS_MODULE,
+   .open   = uiommu_open,
+   .release= uiommu_release,
+};
+
+static struct miscdevice uiommu_dev = {
+   .name   = "uiommu",
+   .minor  = MISC_DYNAMIC_MINOR,
+   .fops   = &uiommu_fops,
+};
+
+struct uiommu_domain *uiommu_fdget(int fd)
+{
+   struct file *file;
+   struct uiommu_domain *udomain;
+
+   file = fget(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);
+   if (file->f_op != &uiommu_fops) {
+   fput(file);
+   return ERR_PTR(-EINVAL);
+   }
+   udomain = file->private_data;
+   atomic_inc(&udo

Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers

2010-09-30 Thread Tom Lyon
On Monday, September 27, 2010 04:54:21 am Michael S. Tsirkin wrote:
> On Wed, Sep 22, 2010 at 02:18:24PM -0700, Tom Lyon wrote:
> > Signed-off-by: Tom Lyon 
> 
> Some comments on the pci bits:
> 
> After going over them for the Nth time - something needs to be done
> with the rvirt/write tables. I doubt anyone besides me and you
> has gone over them: one has to pull
> up the spec just to understand which bits are set here. Why can't we
> have a module init routine and use the macros in pci_regs.h for this
> purpose, as all other drivers do? We will also get to use the nice
> endian-ness macros linux has for portability ...

I tried a couple of approaches before settling on this one. Yes, its dense, 
perhaps its ugly, but a table approach beats open coding of rules.
And I absolutely don't expect it to verifiable to someone without a PCI spec;
I had to have the PCI spec to write it! This is an example of where 
correctness and readability pull to different places. And the defines in 
pci_regs are incomplete and not real self-consistent anyways.

Perhaps its time for someone to code up a different approach? Code talks!

> 
> > +static struct perm_bits pci_cap_basic_perm[] = {
> 
> What endian-ness is this? Native?
> You pass this to user as is but pci config is little endian...
> Also -are all these readonly? So const? read mostly?
Yes, I need some consts.
Endianness - I haven't really thought much and I don't have  any big-endian 
machines to test with.  However, for now everything is byte-at-a-time which 
makes things easier. I'll take a pass at cleaning it up.
> 
> > +   { 0x,   0, },   /* 0x00 vendor & device id - RO */
> > +   { 0x0003,   0x, },  /* 0x04 cmd - mem & io bits virt */
> 
> Interrupt disable bit is under host control.
Yes?

> 
> > +   { 0,0, },   /* 0x08 class code & revision id */
> > +   { 0,0xFF00, },  /* 0x0c bist, htype, lat, cache */
> > +   { 0x,   0x, },  /* 0x10 bar */
> > +   { 0x,   0x, },  /* 0x14 bar */
> > +   { 0x,   0x, },  /* 0x18 bar */
> > +   { 0x,   0x, },  /* 0x1c bar */
> > +   { 0x,   0x, },  /* 0x20 bar */
> > +   { 0x,   0x, },  /* 0x24 bar */
> > +   { 0,0, },   /* 0x28 cardbus - not yet */
> > +   { 0,0, },   /* 0x2c subsys vendor & dev */
> > +   { 0x,   0x, },  /* 0x30 rom bar */
> > +   { 0,0, },   /* 0x34 capability ptr & resv */
> > +   { 0,0, },   /* 0x38 resv */
> > +   { 0x00FF,   0x00FF, },  /* 0x3c max_lat ... irq */
> > +};
> > +
> > +static struct perm_bits pci_cap_pm_perm[] = {
> > +   { 0,0, },   /* 0x00 PM capabilities */
> > +   { 0,0x, },  /* 0x04 PM control/status */
> > +};
> > +
> > +static struct perm_bits pci_cap_vpd_perm[] = {
> > +   { 0,0x, },  /* 0x00 address */
> > +   { 0,0x, },  /* 0x04 data */
> > +};
> > +
> > +static struct perm_bits pci_cap_slotid_perm[] = {
> > +   { 0,0, },   /* 0x00 all read only */
> > +};
> > +
> > +/* 4 different possible layouts of MSI capability */
> > +static struct perm_bits pci_cap_msi_10_perm[] = {
> > +   { 0x00FF,   0x00FF, },  /* 0x00 MSI message control */
> > +   { 0x,   0x, },  /* 0x04 MSI message address */
> > +   { 0x,   0x, },  /* 0x08 MSI message data */
> > +};
> > +static struct perm_bits pci_cap_msi_14_perm[] = {
> > +   { 0x00FF,   0x00FF, },  /* 0x00 MSI message control */
> > +   { 0x,   0x, },  /* 0x04 MSI message address */
> > +   { 0x,   0x, },  /* 0x08 MSI message upper addr */
> > +   { 0x,   0x, },  /* 0x0c MSI message data */
> > +};
> > +static struct perm_bits pci_cap_msi_20_perm[] = {
> > +   { 0x00FF,   0x00FF, },  /* 0x00 MSI message control */
> > +   { 0x,   0x, },  /* 0x04 MSI message address */
> > +   { 0x,   0x, },  /* 0x08 MSI message data */
> > +   { 0,0x, },  /* 0x0c MSI mask bits */
> > +   { 0,0x, },  /* 0x10 MSI pending bits */
> > +};
> > +static struct perm_bits pci_cap_msi_24_perm[] = {
> > +   { 0x00FF,   0x00FF, },  /* 0x00 MSI message control */
> > +   { 0x,   0x, },  /* 0x04 MSI message address */
> > +   { 0x,   0x, },  /* 0x08 MSI message u

Re: [PATCH 3/3] VFIO V4: VFIO driver: Non-privileged user level PCI drivers

2010-10-15 Thread Tom Lyon
Michael & Alex et al -
Sorry for going quiet; I've been digesting the comments and researching 
a 
lot more stuff.  I plan to release V5 shortly after 2.6.36 is out, highlights 
will be:

1. Re-written pci config tables - using approach suggested by MST to clean 
things up. Looking much better. Also fixed endian issues.
2. Clean up ROM bar handling. Now only read() system call can access rom bar; 
it is always disabled afterwards.
3. Clean up pci_iomap and pci_request_regions handling.  Allocates on demand 
but frees all on close.
4. Resets device on open. Disables, but does not do full reset on close due to 
lock problem with remove() callers.
5. Fixed up memlock rlimit accounting.
6. Lots of small cleanups.

Stay tuned.
I'm also looking at adding PCIe extended capabilities - got a request from a 
cisco project.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] VFIO V5: Non-privileged user level PCI drivers

2010-10-28 Thread Tom Lyon
[Rebased to 2.6.36]
Just in time for Halloween - be afraid!

This version adds support for PCIe extended capabilities, including Advanced
Error Reporting.  All of the config table initialization has been rewritten to
be much more readable. All config accesses are byte-at-a-time and endian issues
have been resolved. Reading of PCI ROMs is now handled properly. Problems with
usage if pci_iomap and pci_request_regions have now been resolved.  Devices are
now reset upon first open. Races in rmMemlock rlimit accounting have been
cleaned up. Lots of other tweaks and comments.

Blurb from version 4:

After a long summer break, it's tanned, it's rested, and it's ready to rumble!

In this version:*** REBASE to 2.6.35 ***

There's new code using generic netlink messages which allows the kernel
to notify the user level of weird events and allows the user level to 
respond. This is currently used to handle device removal (whether software
or hardware driven), PCI error events, and system suspend & hibernate.

The driver now supports devices which use multiple MSI interrupts, reflecting
the actual number of interrupts allocated by the system to the user level.

PCI config accesses are now done through the pci_user_{read,write)_config
routines from drivers/pci/access.c.

Numerous other tweaks and cleanups.

Blurb from version 3:

There are lots of bug fixes and cleanups in this version, but the main
change is to check to make sure that the IOMMU has interrupt remapping
enabled, which is necessary to prevent user level code from triggering
spurious interrupts for other devices.  Since most platforms today
do not have the necessary hardware and/or software for this, a module
option can override this check, thus making vfio useful (but not safe)
on many more platforms.

In the next version I plan to add kernel to user messaging using the
generic netlink mechanism to allow the user driver to react to hot add
and remove, and power management requests.

Blurb from version 2:

This version now requires an IOMMU domain to be set before any access to
device registers is granted (except that config space may be read).  In
addition, the VFIO_DMA_MAP_ANYWHERE is dropped - it used the dma_map_sg API
which does not have sufficient controls around IOMMU usage.  The IOMMU domain
is obtained from the 'uiommu' driver which is included in this patch.

Various locking, security, and documentation issues have also been fixed.

Please commit - it or me!
But seriously, who gets to commit this? Avi for KVM? or GregKH for drivers?

Blurb from version 1:

This patch is the evolution of code which was first proposed as a patch to
uio/uio_pci_generic, then as a more generic uio patch. Now it is taken entirely
out of the uio framework, and things seem much cleaner. Of course, there is
a lot of functional overlap with uio, but the previous version just seemed
like a giant mode switch in the uio code that did not lead to clarity for
either the new or old code.

[a pony for avi...]
The major new functionality in this version is the ability to deal with
PCI config space accesses (through read & write calls) - but includes table
driven code to determine whats safe to write and what is not. Also, some
virtualization of the config space to allow drivers to think they're writing
some registers when they're not.  Also, IO space accesses are also allowed.
Drivers for devices which use MSI-X are now prevented from directly writing
the MSI-X vector area.

All interrupts are now handled using eventfds, which makes things very simple.

The name VFIO refers to the Virtual Function capabilities of SR-IOV devices
but the driver does support many more types of devices.  I was none too sure
what driver directory this should live in, so for now I made up my own under
drivers/vfio. As a new driver/new directory, who makes the commit decision?

I currently have user level drivers working for 3 different network adapters
- the Cisco "Palo" enic, the Intel 82599 VF, and the Intel 82576 VF (but the
whole user level framework is a long ways from release).  This driver could
also clearly replace a number of other drivers written just to give user
access to certain devices - but that will take time.

Tom Lyon (4):
  VFIO V5: export pci_user_{read,write}_config
  VFIO V5: additions to include/linux/pci_regs.h
  VFIO V5: uiommu driver - allow user progs to manipulate iommu domains
  VFIO V5: Non-privileged user level PCI drivers

 Documentation/ioctl/ioctl-number.txt |1 +
 Documentation/vfio.txt   |  182 ++
 MAINTAINERS  |8 +
 drivers/Kconfig  |2 +
 drivers/Makefile |2 +
 drivers/pci/access.c |6 +-
 drivers/pci/pci.h|7 -
 drivers/vfio/Kconfig |   18 +
 drivers/vfio/Makefile|   11 +
 drivers/vfio/uiommu.c|  126 
 drivers

[PATCH 1/4] VFIO V5: export pci_user_{read,write}_config

2010-10-28 Thread Tom Lyon

Acked-by:   Jesse Barnes 
Signed-off-by: Tom Lyon 
---
 drivers/pci/access.c |6 --
 drivers/pci/pci.h|7 ---
 include/linux/pci.h  |8 
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 531bc69..96ed449 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -157,7 +157,8 @@ int pci_user_read_config_##size 
\
raw_spin_unlock_irq(&pci_lock); \
*val = (type)data;  \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
 
 #define PCI_USER_WRITE_CONFIG(size,type)   \
 int pci_user_write_config_##size   \
@@ -171,7 +172,8 @@ int pci_user_write_config_##size
\
pos, sizeof(type), val);\
raw_spin_unlock_irq(&pci_lock); \
return ret; \
-}
+}  \
+EXPORT_SYMBOL_GPL(pci_user_write_config_##size);
 
 PCI_USER_READ_CONFIG(byte, u8)
 PCI_USER_READ_CONFIG(word, u16)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6beb11b..e1db481 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -77,13 +77,6 @@ static inline bool pci_is_bridge(struct pci_dev *pci_dev)
return !!(pci_dev->subordinate);
 }
 
-extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
-extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
-extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
-extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
-extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
-extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
-
 struct pci_vpd_ops {
ssize_t (*read)(struct pci_dev *dev, loff_t pos, size_t count, void 
*buf);
ssize_t (*write)(struct pci_dev *dev, loff_t pos, size_t count, const 
void *buf);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index c8d95e3..7f22c8a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -756,6 +756,14 @@ static inline int pci_write_config_dword(struct pci_dev 
*dev, int where,
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
 
+/* user-space driven config access */
+extern int pci_user_read_config_byte(struct pci_dev *dev, int where, u8 *val);
+extern int pci_user_read_config_word(struct pci_dev *dev, int where, u16 *val);
+extern int pci_user_read_config_dword(struct pci_dev *dev, int where, u32 
*val);
+extern int pci_user_write_config_byte(struct pci_dev *dev, int where, u8 val);
+extern int pci_user_write_config_word(struct pci_dev *dev, int where, u16 val);
+extern int pci_user_write_config_dword(struct pci_dev *dev, int where, u32 
val);
+
 int __must_check pci_enable_device(struct pci_dev *dev);
 int __must_check pci_enable_device_io(struct pci_dev *dev);
 int __must_check pci_enable_device_mem(struct pci_dev *dev);
-- 
1.6.0.2

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] VFIO V5: uiommu driver - allow user progs to manipulate iommu domains

2010-10-28 Thread Tom Lyon

Signed-off-by: Tom Lyon 
---
 drivers/Kconfig|2 +
 drivers/Makefile   |1 +
 drivers/vfio/Kconfig   |8 +++
 drivers/vfio/Makefile  |1 +
 drivers/vfio/uiommu.c  |  126 
 include/linux/uiommu.h |   76 +
 6 files changed, 214 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/uiommu.c
 create mode 100644 include/linux/uiommu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index a2b902f..711c1cb 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -111,4 +111,6 @@ source "drivers/xen/Kconfig"
 source "drivers/staging/Kconfig"
 
 source "drivers/platform/Kconfig"
+
+source "drivers/vfio/Kconfig"
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index a2aea53..c445440 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -53,6 +53,7 @@ obj-$(CONFIG_FUSION)  += message/
 obj-y  += firewire/
 obj-y  += ieee1394/
 obj-$(CONFIG_UIO)  += uio/
+obj-$(CONFIG_UIOMMU)   += vfio/
 obj-y  += cdrom/
 obj-y  += auxdisplay/
 obj-$(CONFIG_PCCARD)   += pcmcia/
diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
new file mode 100644
index 000..3ab9af3
--- /dev/null
+++ b/drivers/vfio/Kconfig
@@ -0,0 +1,8 @@
+menuconfig UIOMMU
+   tristate "User level manipulation of IOMMU"
+   help
+ Device driver to allow user level programs to
+ manipulate IOMMU domains.
+
+ If you don't know what to do here, say N.
+
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
new file mode 100644
index 000..556f3c1
--- /dev/null
+++ b/drivers/vfio/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_UIOMMU) += uiommu.o
diff --git a/drivers/vfio/uiommu.c b/drivers/vfio/uiommu.c
new file mode 100644
index 000..5c17c5a
--- /dev/null
+++ b/drivers/vfio/uiommu.c
@@ -0,0 +1,126 @@
+/*
+ * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
+ * Author: Tom Lyon, p...@cisco.com
+ *
+ * This program is free software; you may redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; version 2 of the License.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+*/
+
+/*
+ * uiommu driver - issue fd handles for IOMMU domains
+ * so they may be passed to vfio (and others?)
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Tom Lyon ");
+MODULE_DESCRIPTION("User IOMMU driver");
+
+static struct uiommu_domain *uiommu_domain_alloc(void)
+{
+   struct iommu_domain *domain;
+   struct uiommu_domain *udomain;
+
+   domain = iommu_domain_alloc();
+   if (!domain)
+   return NULL;
+   udomain = kzalloc(sizeof *udomain, GFP_KERNEL);
+   if (!udomain) {
+   iommu_domain_free(domain);
+   return NULL;
+   }
+   udomain->domain = domain;
+   atomic_inc(&udomain->refcnt);
+   return udomain;
+}
+
+static int uiommu_open(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = uiommu_domain_alloc();
+   if (!udomain)
+   return -ENOMEM;
+   file->private_data = udomain;
+   return 0;
+}
+
+static int uiommu_release(struct inode *inode, struct file *file)
+{
+   struct uiommu_domain *udomain;
+
+   udomain = file->private_data;
+   uiommu_put(udomain);
+   return 0;
+}
+
+static const struct file_operations uiommu_fops = {
+   .owner  = THIS_MODULE,
+   .open   = uiommu_open,
+   .release= uiommu_release,
+};
+
+static struct miscdevice uiommu_dev = {
+   .name   = "uiommu",
+   .minor  = MISC_DYNAMIC_MINOR,
+   .fops   = &uiommu_fops,
+};
+
+struct uiommu_domain *uiommu_fdget(int fd)
+{
+   struct file *file;
+   struct uiommu_domain *udomain;
+
+   file = fget(fd);
+   if (!file)
+   return ERR_PTR(-EBADF);
+   if (file->f_op != &uiommu_fops) {
+   fput(file);
+   return ERR_PTR(-EINVAL);
+   }
+   udomain = file->private_data;
+   atomic_inc(&udomain->refcnt);
+  

[PATCH 2/4] VFIO V5: additions to include/linux/pci_regs.h

2010-10-28 Thread Tom Lyon

Signed-off-by: Tom Lyon 
---
 include/linux/pci_regs.h |  107 ++
 1 files changed, 98 insertions(+), 9 deletions(-)

diff --git a/include/linux/pci_regs.h b/include/linux/pci_regs.h
index 455b9cc..70addc9 100644
--- a/include/linux/pci_regs.h
+++ b/include/linux/pci_regs.h
@@ -26,6 +26,7 @@
  * Under PCI, each device has 256 bytes of configuration address space,
  * of which the first 64 bytes are standardized as follows:
  */
+#definePCI_STD_HEADER_SIZEOF   64
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
 #define PCI_COMMAND0x04/* 16 bits */
@@ -209,9 +210,12 @@
 #define  PCI_CAP_ID_SHPC   0x0C/* PCI Standard Hot-Plug Controller */
 #define  PCI_CAP_ID_SSVID  0x0D/* Bridge subsystem vendor/device ID */
 #define  PCI_CAP_ID_AGP3   0x0E/* AGP Target PCI-PCI bridge */
+#define  PCI_CAP_ID_SECDEV 0x0F/* Secure Device */
 #define  PCI_CAP_ID_EXP0x10/* PCI Express */
 #define  PCI_CAP_ID_MSIX   0x11/* MSI-X */
+#define  PCI_CAP_ID_SATA   0x12/* SATA Data/Index Conf. */
 #define  PCI_CAP_ID_AF 0x13/* PCI Advanced Features */
+#define  PCI_CAP_ID_MAXPCI_CAP_ID_AF
 #define PCI_CAP_LIST_NEXT  1   /* Next capability in the list */
 #define PCI_CAP_FLAGS  2   /* Capability defined flags (16 bits) */
 #define PCI_CAP_SIZEOF 4
@@ -276,6 +280,7 @@
 #define  PCI_VPD_ADDR_MASK 0x7fff  /* Address mask */
 #define  PCI_VPD_ADDR_F0x8000  /* Write 0, 1 indicates 
completion */
 #define PCI_VPD_DATA   4   /* 32-bits of data returned here */
+#definePCI_CAP_VPD_SIZEOF  8
 
 /* Slot Identification */
 
@@ -297,8 +302,10 @@
 #define PCI_MSI_ADDRESS_HI 8   /* Upper 32 bits (if 
PCI_MSI_FLAGS_64BIT set) */
 #define PCI_MSI_DATA_328   /* 16 bits of data for 32-bit 
devices */
 #define PCI_MSI_MASK_3212  /* Mask bits register for 
32-bit devices */
+#define PCI_MSI_PENDING_32 16  /* Pending intrs for 32-bit devices */
 #define PCI_MSI_DATA_6412  /* 16 bits of data for 64-bit 
devices */
 #define PCI_MSI_MASK_6416  /* Mask bits register for 
64-bit devices */
+#define PCI_MSI_PENDING_64 20  /* Pending intrs for 64-bit devices */
 
 /* MSI-X registers (these are at offset PCI_MSIX_FLAGS) */
 #define PCI_MSIX_FLAGS 2
@@ -306,6 +313,7 @@
 #define  PCI_MSIX_FLAGS_ENABLE (1 << 15)
 #define  PCI_MSIX_FLAGS_MASKALL(1 << 14)
 #define PCI_MSIX_FLAGS_BIRMASK (7 << 0)
+#definePCI_CAP_MSIX_SIZEOF 12  /* size of MSIX registers */
 
 /* CompactPCI Hotswap Register */
 
@@ -328,6 +336,7 @@
 #define  PCI_AF_CTRL_FLR   0x01
 #define PCI_AF_STATUS  5
 #define  PCI_AF_STATUS_TP  0x01
+#definePCI_CAP_AF_SIZEOF   6   /* size of AF registers */
 
 /* PCI-X registers */
 
@@ -364,6 +373,9 @@
 #define  PCI_X_STATUS_SPL_ERR  0x2000  /* Rcvd Split Completion Error 
Msg */
 #define  PCI_X_STATUS_266MHZ   0x4000  /* 266 MHz capable */
 #define  PCI_X_STATUS_533MHZ   0x8000  /* 533 MHz capable */
+#definePCI_X_ECC_CSR   8   /* ECC control and status */
+#define PCI_CAP_PCIX_SIZEOF_V0 8   /* size of registers for Version 0 */
+#define PCI_CAP_PCIX_SIZEOF_V1224  /* size for Version 1 & 2 */
 
 /* PCI Bridge Subsystem ID registers */
 
@@ -451,6 +463,7 @@
 #define  PCI_EXP_LNKSTA_DLLLA  0x2000  /* Data Link Layer Link Active */
 #define  PCI_EXP_LNKSTA_LBMS   0x4000  /* Link Bandwidth Management Status */
 #define  PCI_EXP_LNKSTA_LABS   0x8000  /* Link Autonomous Bandwidth Status */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V1 20  /* v1 endpoints end here */
 #define PCI_EXP_SLTCAP 20  /* Slot Capabilities */
 #define  PCI_EXP_SLTCAP_ABP0x0001 /* Attention Button Present */
 #define  PCI_EXP_SLTCAP_PCP0x0002 /* Power Controller Present */
@@ -498,6 +511,7 @@
 #define  PCI_EXP_DEVCAP2_ARI   0x20/* Alternative Routing-ID */
 #define PCI_EXP_DEVCTL240  /* Device Control 2 */
 #define  PCI_EXP_DEVCTL2_ARI   0x20/* Alternative Routing-ID */
+#define PCI_CAP_EXP_ENDPOINT_SIZEOF_V2 44  /* v2 endpoints end here */
 #define PCI_EXP_LNKCTL248  /* Link Control 2 */
 #define PCI_EXP_SLTCTL256  /* Slot Control 2 */
 
@@ -506,20 +520,42 @@
 #define PCI_EXT_CAP_VER(header)((header >> 16) & 0xf)
 #define PCI_EXT_CAP_NEXT(header)   ((header >> 20) & 0xffc)
 
-#define PCI_EXT_CAP_ID_ERR 1
-#define PCI_EXT_CAP_ID_VC  2
-#define PCI_EXT_CAP_ID_DSN 3
-#define PCI_EXT_CAP_ID_PWR 4
-#define PCI_EXT_CAP_ID_VNDR11
-#define PCI_EXT_CAP_ID_ACS 13
-#define PCI_EXT_CAP_ID_ARI 14
-#define PCI_EXT_CAP_ID_ATS

Re: [PATCH 0/5] Fixes, non-PCI-2.3 support, EOI enhancements

2010-11-01 Thread Tom Lyon
I've applied all your patches. Thanks!

On Saturday, October 30, 2010 09:58:55 am Alex Williamson wrote:
> Hi Tom,
> 
> I've updated some patches I've been working on to v5 and wanted to
> see what you think.  I also found a couple minor bugs, fixed in this
> series.
> 
> The main idea is that since the VFIO interface defines that the INTx
> interrupt is disabled when it fires, we should provide an interface
> to re-enable it.  We currently have to do a read-modify-write to PCI
> config space, requring two ioctls.  I introduce an ioctl to do this
> for us.  An important aspect of this is that now we can support
> non-PCI-2.3 devices since re-enabling the interrupt is abstracted
> (with the same caveat as current KVM device assignment, that they
> must get an exclusive interrupt).  The real trick though is that we
> can then also create an irqfd-like mechanism to trigger re-enabling
> interrupts from and eventfd.  This allows qemu to do all the setup
> for connecting interrupts from VFIO into KVM and EOIs from KVM to
> VFIO, then lets userspace get completely out of the way.
> 
> Hope you like it.  Thanks,
> 
> Alex
> ---
> 
> Alex Williamson (5):
>   vfio: Add a new ioctl to support EOI via eventfd
>   vfio: Add support for non-PCI 2.3 compliant devices
>   vfio: Add ioctl to re-enable interrupts
>   vfio: Fix requested regions
>   vfio: Fix the ROM mask
> 
> 
>  drivers/vfio/vfio_intrs.c  |  254
> +--- drivers/vfio/vfio_main.c   | 
>  37 +-
>  drivers/vfio/vfio_pci_config.c |2
>  drivers/vfio/vfio_rdwr.c   |4 -
>  include/linux/vfio.h   |   14 ++
>  5 files changed, 279 insertions(+), 32 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Extended capability fixes

2010-11-02 Thread Tom Lyon
Applied. Thanks!

On Monday, November 01, 2010 10:08:35 pm Alex Williamson wrote:
> - Virtual channel position gets truncated as a u8
>  - Print the ecap that's unknown, not the last cap we saw
>  - Print actual config offset, which provides enough info to make
>some sense of the error.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  drivers/vfio/vfio_pci_config.c |8 
>  1 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_pci_config.c
> b/drivers/vfio/vfio_pci_config.c index 8af995d..8304316 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -410,7 +410,7 @@ static int vfio_msi_cap_len(struct vfio_dev *vdev, u8
> pos) * Determine extended capability length for VC (2 & 9) and
>   * MFVC capabilities
>   */
> -static int vfio_vc_cap_len(struct vfio_dev *vdev, u8 pos)
> +static int vfio_vc_cap_len(struct vfio_dev *vdev, u16 pos)
>  {
>   struct pci_dev *pdev = vdev->pdev;
>   u32 dw;
> @@ -580,7 +580,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
>   printk(KERN_WARNING
>   "%s: pci config conflict at %x, "
>   "caps %x %x\n",
> - __func__, i, map[pos+i], cap);
> + __func__, pos+i, map[pos+i], cap);
>   map[pos+i] = cap;
>   }
>   ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> @@ -683,7 +683,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
>   if (len == 0 || len == 0xFF) {
>   printk(KERN_WARNING
>   "%s: unknown length for pci ext cap %x\n",
> - __func__, cap);
> + __func__, ecap);
>   len = PCI_CAP_SIZEOF;
>   }
>   for (i = 0; i < len; i++) {
> @@ -691,7 +691,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
>   printk(KERN_WARNING
>   "%s: pci config conflict at %x, "
>   "caps %x %x\n",
> - __func__, i, map[epos+i], ecap);
> + __func__, epos+i, map[epos+i], ecap);
>   map[epos+i] = ecap;
>   }
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Extended capability fixes

2010-11-02 Thread Tom Lyon
On Tuesday, November 02, 2010 12:11:08 pm Michael S. Tsirkin wrote:
> On Mon, Nov 01, 2010 at 11:08:35PM -0600, Alex Williamson wrote:
> > - Virtual channel position gets truncated as a u8
> > 
> >  - Print the ecap that's unknown, not the last cap we saw
> >  - Print actual config offset, which provides enough info to make
> >  
> >some sense of the error.
> > 
> > Signed-off-by: Alex Williamson 
> > ---
> > 
> >  drivers/vfio/vfio_pci_config.c |8 
> >  1 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_pci_config.c
> > b/drivers/vfio/vfio_pci_config.c index 8af995d..8304316 100644
> > --- a/drivers/vfio/vfio_pci_config.c
> > +++ b/drivers/vfio/vfio_pci_config.c
> > @@ -410,7 +410,7 @@ static int vfio_msi_cap_len(struct vfio_dev *vdev, u8
> > pos)
> > 
> >   * Determine extended capability length for VC (2 & 9) and
> >   * MFVC capabilities
> >   */
> > 
> > -static int vfio_vc_cap_len(struct vfio_dev *vdev, u8 pos)
> > +static int vfio_vc_cap_len(struct vfio_dev *vdev, u16 pos)
> > 
> >  {
> >  
> > struct pci_dev *pdev = vdev->pdev;
> > u32 dw;
> > 
> > @@ -580,7 +580,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
> > 
> > printk(KERN_WARNING
> > 
> > "%s: pci config conflict at %x, "
> > "caps %x %x\n",
> > 
> > -   __func__, i, map[pos+i], cap);
> > +   __func__, pos+i, map[pos+i], cap);
> > 
> > map[pos+i] = cap;
> > 
> > }
> > ret = pci_read_config_byte(pdev, pos + PCI_CAP_LIST_NEXT, &pos);
> > 
> > @@ -683,7 +683,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
> > 
> > if (len == 0 || len == 0xFF) {
> > 
> > printk(KERN_WARNING
> > 
> > "%s: unknown length for pci ext cap %x\n",
> > 
> > -   __func__, cap);
> > +   __func__, ecap);
> > 
> > len = PCI_CAP_SIZEOF;
> > 
> > }
> > for (i = 0; i < len; i++) {
> > 
> > @@ -691,7 +691,7 @@ int vfio_build_config_map(struct vfio_dev *vdev)
> > 
> > printk(KERN_WARNING
> > 
> > "%s: pci config conflict at %x, "
> > "caps %x %x\n",
> > 
> > -   __func__, i, map[epos+i], ecap);
> > +   __func__, epos+i, map[epos+i], ecap);
> > 
> > map[epos+i] = ecap;
> 
> Not related to this patch, but I am surprised checkpatch does not
> complain about lack of spaces around + here and elsewhere.
> Or does it?
It did not complain.

> 
> > }
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> > in the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vfio: Fix PCI 2.3 shared interrupt

2010-11-03 Thread Tom Lyon
Applied.

On Wednesday, November 03, 2010 01:17:33 pm Alex Williamson wrote:
> Trying to be too clever with setting the irq_disabled flag.  PCI 2.3
> disabled devices can still share IRQs, which can lead to clearing
> the irq_disabled flag, preventing the EOI from registering, and leaving
> the device without interrupts.  Interrupt handler should only ever
> set irq_disabled and we can exit earlier to avoid the config space
> access if we know we're disabled.
> 
> Signed-off-by: Alex Williamson 
> ---
> 
>  drivers/vfio/vfio_intrs.c |9 -
>  1 files changed, 8 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_intrs.c b/drivers/vfio/vfio_intrs.c
> index 604082c..73e3deb 100644
> --- a/drivers/vfio/vfio_intrs.c
> +++ b/drivers/vfio/vfio_intrs.c
> @@ -57,6 +57,12 @@ irqreturn_t vfio_interrupt(int irq, void *dev_id)
> 
>   spin_lock_irq(&vdev->irqlock);
> 
> + /* INTX disabled interrupts can still be shared */
> + if (vdev->irq_disabled) {
> + spin_unlock_irq(&vdev->irqlock);
> + return ret;
> + }
> +
>   if (vdev->pci_2_3) {
>   pci_block_user_cfg_access(pdev);
> 
> @@ -87,7 +93,8 @@ done:
>   ret = IRQ_HANDLED;
>   }
> 
> - vdev->irq_disabled = (ret == IRQ_HANDLED);
> + if (ret == IRQ_HANDLED)
> + vdev->irq_disabled = true;
> 
>   spin_unlock_irq(&vdev->irqlock);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] vfio: virtualize INTX_DISABLE

2010-11-09 Thread Tom Lyon
Alex - I am rejecting these 2 patches.

For patch 1/2, I started with yours and found a couple of problems, but then I 
got into the spirit and did a buinch more cleaning up. My patch to follow.

For patch 2/2, the INTX stuff, I don't really see the problem.  If the user 
turns on the bit, it'll result in at most one more interrupt, right?  If he 
turns off the bit, then he doesn't want interrupts.

On Friday, November 05, 2010 10:30:15 am Alex Williamson wrote:
> Tom,
> 
> Since we use INTX_DISABLE internally for PCI 2.3 devices, it's probably
> not a good idea to allow vfio users direct access to it.  In trying to
> virtualize it, I stumbled on some config space virtualization issues.
> I think we want vconfig and the value written to hardware to be different
> when there are virtualized bits, but we lump them together.  I think this
> is why I've never been able to rely on the memory enable bit of config
> space through vfio.  With this first patch, all virtualized bits that
> are writable are sticky in vconfig.  Non-writable bits come from either
> the old value or the physical hardware depending on whether it's
> virtualized.  This means we can get rid of a lot of duplicated setting
> and reading of vconfig, and only add code for more complicated
> behaviors.  This is tricky code, to please double check that I'm not
> doing something stupid.  Thanks,
> 
> Alex
> 
> ---
> 
> Alex Williamson (2):
>   vfio: Virtualize PCI_COMMAND_INTX_DISABLE
>   vfio: Fix config space virtualization
> 
> 
>  drivers/vfio/vfio_intrs.c  |   38 
>  drivers/vfio/vfio_main.c   |   14 
>  drivers/vfio/vfio_pci_config.c |  131
>  include/linux/vfio.h   | 
>   4 +
>  4 files changed, 84 insertions(+), 103 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vfio: fix config virtualization, esp command byte

2010-11-09 Thread Tom Lyon
Cleans up config space virtualization, especialy handling of bytes
which have some virtual and some real bits, like PCI_COMMAND.

Alex, I hope you can test this with your setups.

Signed-off-by: Tom Lyon 
---
 drivers/vfio/vfio_pci_config.c |  166 +---
 1 files changed, 53 insertions(+), 113 deletions(-)

diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
index 8304316..7132ac4 100644
--- a/drivers/vfio/vfio_pci_config.c
+++ b/drivers/vfio/vfio_pci_config.c
@@ -745,6 +745,8 @@ static int vfio_virt_init(struct vfio_dev *vdev)
  */
 static void vfio_bar_restore(struct vfio_dev *vdev)
 {
+   if (vdev->pdev->is_virtfn)
+   return;
printk(KERN_WARNING "%s: reset recovery - restoring bars\n", __func__);
 
 #define do_bar(off, which) \
@@ -815,26 +817,15 @@ static inline int vfio_read_config_byte(struct vfio_dev 
*vdev,
 static inline int vfio_write_config_byte(struct vfio_dev *vdev,
int pos, u8 val)
 {
-   vdev->vconfig[pos] = val;
return pci_user_write_config_byte(vdev->pdev, pos, val);
 }
 
 /* handle virtualized fields in the basic config space */
-static u8 vfio_virt_basic(struct vfio_dev *vdev, int write,
-   u16 pos, u16 off, u8 val, u8 newval)
+static void vfio_virt_basic(struct vfio_dev *vdev, int write, u16 pos, u8 *rbp)
 {
-   switch (off) {
-   /*
-* vendor and device are virt because they don't
-* show up otherwise for sr-iov vfs
-*/
-   case PCI_VENDOR_ID:
-   case PCI_VENDOR_ID + 1:
-   case PCI_DEVICE_ID:
-   case PCI_DEVICE_ID + 1:
-   /* read only */
-   val = vdev->vconfig[pos];
-   break;
+   u8 val;
+
+   switch (pos) {
case PCI_COMMAND:
/*
 * If the real mem or IO enable bits are zero
@@ -842,100 +833,58 @@ static u8 vfio_virt_basic(struct vfio_dev *vdev, int 
write,
 * Restore the real BARs before allowing those
 * bits to re-enable
 */
+   val = vdev->vconfig[pos];
if (vdev->pdev->is_virtfn)
val |= PCI_COMMAND_MEMORY;
if (write) {
-   int upd = 0;
-
-   upd = (newval & PCI_COMMAND_MEMORY) >
- (val & PCI_COMMAND_MEMORY);
-   upd += (newval & PCI_COMMAND_IO) >
-  (val & PCI_COMMAND_IO);
-   if (upd)
-   vfio_bar_restore(vdev);
-   vfio_write_config_byte(vdev, pos, newval);
-   }
-   break;
-   case PCI_BASE_ADDRESS_0:
-   case PCI_BASE_ADDRESS_0+1:
-   case PCI_BASE_ADDRESS_0+2:
-   case PCI_BASE_ADDRESS_0+3:
-   case PCI_BASE_ADDRESS_1:
-   case PCI_BASE_ADDRESS_1+1:
-   case PCI_BASE_ADDRESS_1+2:
-   case PCI_BASE_ADDRESS_1+3:
-   case PCI_BASE_ADDRESS_2:
-   case PCI_BASE_ADDRESS_2+1:
-   case PCI_BASE_ADDRESS_2+2:
-   case PCI_BASE_ADDRESS_2+3:
-   case PCI_BASE_ADDRESS_3:
-   case PCI_BASE_ADDRESS_3+1:
-   case PCI_BASE_ADDRESS_3+2:
-   case PCI_BASE_ADDRESS_3+3:
-   case PCI_BASE_ADDRESS_4:
-   case PCI_BASE_ADDRESS_4+1:
-   case PCI_BASE_ADDRESS_4+2:
-   case PCI_BASE_ADDRESS_4+3:
-   case PCI_BASE_ADDRESS_5:
-   case PCI_BASE_ADDRESS_5+1:
-   case PCI_BASE_ADDRESS_5+2:
-   case PCI_BASE_ADDRESS_5+3:
-   case PCI_ROM_ADDRESS:
-   case PCI_ROM_ADDRESS+1:
-   case PCI_ROM_ADDRESS+2:
-   case PCI_ROM_ADDRESS+3:
-   if (write) {
-   vdev->vconfig[pos] = newval;
-   vdev->bardirty = 1;
-   } else {
-   if (vdev->bardirty)
-   vfio_bar_fixup(vdev);
-   val = vdev->vconfig[pos];
+
+   if (((val & PCI_COMMAND_MEMORY) >
+   (*rbp & PCI_COMMAND_MEMORY)) ||
+   ((val & PCI_COMMAND_IO) >
+   (*rbp & PCI_COMMAND_IO)))
+   vfio_bar_restore(vdev);
+   *rbp &= ~(PCI_COMMAND_MEMORY + PCI_COMMAND_IO);
+   *rbp |= val & (PCI_COMMAND_MEMORY + PCI_COMMAND_IO);
}
+   vdev->vconfig[pos] = val;
break;
-   default:
+   case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3:
+   case PCI_ROM_ADDRESS ... PCI_ROM_ADDRESS + 3:
if (write)
-   vdev->vconfig[pos] = newval;
-   else
-   val = vdev->vconfig[pos];
+   vdev->bardirt

Re: [PATCH] vfio: fix config virtualization, esp command byte

2010-11-16 Thread Tom Lyon
Applied.

On Tuesday, November 16, 2010 10:57:27 am Alex Williamson wrote:
> On Tue, 2010-11-16 at 10:54 -0700, Alex Williamson wrote:
> > On Tue, 2010-11-09 at 17:09 -0800, Tom Lyon wrote:
> > > Cleans up config space virtualization, especialy handling of bytes
> > > which have some virtual and some real bits, like PCI_COMMAND.
> > > 
> > > Alex, I hope you can test this with your setups.
> > 
> > Sorry for the delay.  FWIW, I'm not having much luck with this, I'll try
> > to debug the problem.  Thanks,
> 
> This seems to be the bug.  Thanks,
> 
> Alex
> 
> vfio: Don't write random bits on read
> 
> Signed-off-by: Alex Williamson 
> ---
> 
> diff --git a/drivers/vfio/vfio_pci_config.c
> b/drivers/vfio/vfio_pci_config.c index 7132ac4..422d7b1 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -964,11 +964,6 @@ static int vfio_config_rwbyte(int write,
>   return 0;
>   }
> 
> - if (write) {
> - if (copy_from_user(&newval, buf, 1))
> - return -EFAULT;
> - }
> -
>   if (~virt) {/* mix of real and virt bits */
>   /* update vconfig with latest hw bits */
>   ret = vfio_read_config_byte(vdev, pos, &realbits);
> @@ -978,9 +973,14 @@ static int vfio_config_rwbyte(int write,
>   (vdev->vconfig[pos] & virt) | (realbits & ~virt);
>   }
> 
> - /* update vconfig with writeable bits */
> - vdev->vconfig[pos] =
> - (vdev->vconfig[pos] & ~wr) | (newval & wr);
> + if (write) {
> + if (copy_from_user(&newval, buf, 1))
> + return -EFAULT;
> +
> + /* update vconfig with writeable bits */
> + vdev->vconfig[pos] =
> + (vdev->vconfig[pos] & ~wr) | (newval & wr);
> +}
> 
>   /*
>* Now massage virtual fields
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ANNOUNCE] VFIO V6 & public VFIO repositories

2010-11-22 Thread Tom Lyon
VFIO "driver" development has moved to a publicly accessible respository
on github:

git://github.com/pugs/vfio-linux-2.6.git

This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio 
branch (which is the default). There is a tag 'vfio-v6' marking the latest
"release" of VFIO.

In addition, I am open-sourcing my user level code which uses VFIO.
It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
hardware drivers. This code is available at:

git://github.com/pugs/vfio-user-level-drivers.git
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-30 Thread Tom Lyon
Thanks, Alex!
Am incorporating...

On Tuesday 29 June 2010 11:14:12 pm Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > The VFIO "driver" is used to allow privileged AND non-privileged processes 
> > to 
> > implement user-level device drivers for any well-behaved PCI, PCI-X, and 
> > PCIe
> > devices.
> 
> Hi Tom,
> 
> I found a few bugs.  Patch below.  The first chunk clears the
> pci_config_map on close, otherwise we end up passing virtualized state
> from one user to the next.  The second is an off by one in the basic
> perms.  Finally, vfio_bar_fixup() needs an overhaul.  It wasn't setting
> the lower bits right and is allowing virtual writes of bits that aren't
> aligned to the size.  This section probably needs another pass or two of
> refinement.  Thanks,
> 
> Alex
> 
> Signed-off-by: Alex Williamson 
> ---
> 
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> index 96639e5..a0e8227 100644
> --- a/drivers/vfio/vfio_main.c
> +++ b/drivers/vfio/vfio_main.c
> @@ -129,6 +129,10 @@ static int vfio_release(struct inode *inode, struct file 
> *filep)
>   eventfd_ctx_put(vdev->ev_msi);
>   vdev->ev_irq = NULL;
>   }
> + if (vdev->pci_config_map) {
> + kfree(vdev->pci_config_map);
> + vdev->pci_config_map = NULL;
> + }
>   vfio_domain_unset(vdev);
>   /* reset to known state if we can */
>   (void) pci_reset_function(vdev->pdev);
> diff --git a/drivers/vfio/vfio_pci_config.c b/drivers/vfio/vfio_pci_config.c
> index c821b5d..f6e26b1 100644
> --- a/drivers/vfio/vfio_pci_config.c
> +++ b/drivers/vfio/vfio_pci_config.c
> @@ -79,18 +79,18 @@ struct perm_bits {
>  static struct perm_bits pci_cap_basic_perm[] = {
>   { 0x,   0, },   /* 0x00 vendor & device id - RO */
>   { 0,0xFFFC, },  /* 0x04 cmd & status except mem/io */
> - { 0,0xFF00, },  /* 0x08 bist, htype, lat, cache */
> - { 0x,   0x, },  /* 0x0c bar */
> + { 0,0, },   /* 0x08 class code & revision id */
> + { 0,0xFF00, },  /* 0x0c bist, htype, lat, cache */
>   { 0x,   0x, },  /* 0x10 bar */
>   { 0x,   0x, },  /* 0x14 bar */
>   { 0x,   0x, },  /* 0x18 bar */
>   { 0x,   0x, },  /* 0x1c bar */
>   { 0x,   0x, },  /* 0x20 bar */
> - { 0,0, },   /* 0x24 cardbus - not yet */
> - { 0,0, },   /* 0x28 subsys vendor & dev */
> - { 0x,   0x, },  /* 0x2c rom bar */
> - { 0,0, },   /* 0x30 capability ptr & resv */
> - { 0,0, },   /* 0x34 resv */
> + { 0x,   0x, },  /* 0x24 bar */
> + { 0,0, },   /* 0x28 cardbus - not yet */
> + { 0,0, },   /* 0x2c subsys vendor & dev */
> + { 0x,   0x, },  /* 0x30 rom bar */
> + { 0,0, },   /* 0x34 capability ptr & resv */
>   { 0,0, },   /* 0x38 resv */
>   { 0x00FF,   0x00FF, },  /* 0x3c max_lat ... irq */
>  };
> @@ -318,30 +318,55 @@ static void vfio_virt_init(struct vfio_dev *vdev)
>  static void vfio_bar_fixup(struct vfio_dev *vdev)
>  {
>   struct pci_dev *pdev = vdev->pdev;
> - int bar;
> - u32 *lp;
> - u32 len;
> + int bar, mem64 = 0;
> + u32 *lp = NULL;
> + u64 len = 0;
>  
>   for (bar = 0; bar <= 5; bar++) {
> - len = pci_resource_len(pdev, bar);
> - lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> - if (len == 0) {
> - *lp = 0;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_MEM) {
> - *lp &= ~0x1;
> - *lp = (*lp & ~(len-1)) |
> - (*lp & ~PCI_BASE_ADDRESS_MEM_MASK);
> - if (*lp & PCI_BASE_ADDRESS_MEM_TYPE_64)
> - bar++;
> - } else if (pci_resource_flags(pdev, bar) & IORESOURCE_IO) {
> + if (!mem64) {
> + len = pci_resource_len(pdev, bar);
> + lp = (u32 *)&vdev->vinfo.bar[bar * 4];
> + if (len == 0) {
> + *lp = 0;
> + continue;
> + }
> +
> + 

Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-30 Thread Tom Lyon
On Wednesday 30 June 2010 03:32:56 pm Michael S. Tsirkin wrote:
> On Wed, Jun 30, 2010 at 03:17:55PM -0700, Tom Lyon wrote:
> > Thanks, Alex!
> > Am incorporating...
> 
> I get it there's no chance you'll drop the "virtualization"
> from the driver then?
> 
I think it'll get a whole lot simpler by depending on iommu interrupt remapping,
but  I thinke some of it is still useful.

I am now stuck trying to get access to a system with interrupt remapping.
Turns out my Intel IOMMU doesn't have it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] 2.6.34: simple IOMMU API extension to check safe interrupt remapping

2010-06-30 Thread Tom Lyon
This patch allows IOMMU users to determine whether the hardware and software
support safe, isolated interrupt remapping.  Not all Intel IOMMUs have the
hardware, and the software for AMD is not there yet.
Signed-off-by: Tom Lyon 
---

MST has convinced me that any user level driver for PCI master devices can't 
be safe unless there is an IOMMU protecting the APIC MSI/MSI-X interrupt 
addresses from device writes.  This interrupt remapping is not present in all
Intel IOMMUs and the code for the interrupt mapping in the AMD IOMMUs is not
implemented yet.

Needed by not-yet-accepted VFIO driver.

diff -uprN linux-2.6.34/drivers/pci/intel-iommu.c 
iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c
--- linux-2.6.34/drivers/pci/intel-iommu.c  2010-05-16 14:17:36.0 
-0700
+++ iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c 2010-06-30 
15:47:10.0 -0700
@@ -3705,6 +3705,10 @@ static int intel_iommu_domain_has_cap(st
 
if (cap == IOMMU_CAP_CACHE_COHERENCY)
return dmar_domain->iommu_snooping;
+#ifdef CONFIG_INTR_REMAP
+   if (cap == IOMMU_CAP_SAFE_INTR_REMAP)
+   return intr_remapping_enabled;
+#endif
 
return 0;
 }
diff -uprN linux-2.6.34/include/linux/iommu.h 
iommuapi-linux-2.6.34/include/linux/iommu.h
--- linux-2.6.34/include/linux/iommu.h  2010-05-16 14:17:36.0 -0700
+++ iommuapi-linux-2.6.34/include/linux/iommu.h 2010-06-30 15:47:34.0 
-0700
@@ -30,6 +30,7 @@ struct iommu_domain {
 };
 
 #define IOMMU_CAP_CACHE_COHERENCY  0x1
+#define IOMMU_CAP_SAFE_INTR_REMAP  0x2 /* isolates device intrs */
 
 struct iommu_ops {
int (*domain_init)(struct iommu_domain *domain);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-06-30 Thread Tom Lyon
On Wednesday 30 June 2010 09:16:23 pm Alex Williamson wrote:
> On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > +int vfio_dma_unmap_dm(struct vfio_listener *listener, struct vfio_dma_map 
> > *dmp)
> > +{
> > +   unsigned long start, npage;
> > +   struct dma_map_page *mlp;
> > +   struct list_head *pos, *pos2;
> > +   int ret;
> > +
> > +   start = dmp->vaddr & ~PAGE_SIZE;
> > +   npage = dmp->size >> PAGE_SHIFT;
> > +
> > +   ret = -ENXIO;
> > +   mutex_lock(&listener->vdev->dgate);
> > +   list_for_each_safe(pos, pos2, &listener->dm_list) {
> > +   mlp = list_entry(pos, struct dma_map_page, list);
> > +   if (dmp->vaddr != mlp->vaddr || mlp->npage != npage)
> > +   continue;
> > +   ret = 0;
> > +   vfio_dma_unmap(listener, mlp);
> > +   break;
> > +   }
> 
> Hi Tom,
> 
> Shouldn't we be matching the mlp based on daddr instead of vaddr?  We
> can have multiple dma address pointing at the same virtual address, so
> dma address is the unique element.  I'm also nervous about this dm_list.
> For qemu device assignment, we're potentially statically mapping many GB
> of iova space.  It seems like this could get incredibly bloated and
> slow.  Thanks,
> 
> Alex

In weird circumstances, differing user vaddrs could reolve to the same physical 
address,
so the uniqueness of any mapping is the .

Yes, a linear list is slow, but does qemu need a lot of mappings, or just big 
ones?
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] VFIO driver: Non-privileged user level PCI drivers

2010-07-01 Thread Tom Lyon
On Thursday 01 July 2010 08:48:41 am Alex Williamson wrote:
> On Thu, 2010-07-01 at 18:31 +0300, Michael S. Tsirkin wrote:
> > On Thu, Jul 01, 2010 at 09:29:04AM -0600, Alex Williamson wrote:
> > > On Tue, 2010-06-08 at 14:21 -0700, Tom Lyon wrote:
> > > > +The VFIO_DMA_MASK ioctl is used to set the maximum permissible DMA 
> > > > address
> > > > +(device dependent). It takes a single unsigned 64 bit integer as an 
> > > > argument.
> > > > +This call also has the side effect of enabling PCI bus mastership.
> > > 
> > > Hi Tom,
> > > 
> > > This interface doesn't make sense for the MAP_IOVA user.  Especially in
> > > qemu, we have no idea what the DMA mask is for the device we're
> > > assigning.  It doesn't really matter though because the guest will use
> > > bounce buffers internally once it loads the device specific drivers and
> > > discovers the DMA mask.  This only seems relevant if we're using a
> > > DMA_MAP call that gets to pick the dmaaddr, so I'd propose we only make
> > > this a required call for that interface, and create a separate ioctl for
> > > actually enabling bus master.  Thanks,
> > > 
> > > Alex
> > 
> > I expect there's no need for a separate ioctl to do this:
> > you can do this by write to the control register.
> 
> Nope, vfio only allows direct writes to the memory and io space bits of
> the command register, all other bits are virtualized.  I wonder if
> that's necessary though since we require the device to be attached to an
> iommu domain before we allow config space access.
> 
I had already gotten rid of the mask setting & master mode ioctl. It was a 
remnant
of using the dma_map_sg API which is no longer in there.  And I tweaked the
config stuff to allow writing the master bit.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V2] 2.6.34: simple IOMMU API extension to check safe interrupt remapping

2010-07-01 Thread Tom Lyon
This patch allows IOMMU users to determine whether the hardware and software
support safe, isolated interrupt remapping.  Not all Intel IOMMUs have the
hardware, and the software for AMD is not there yet.
Signed-off-by: Tom Lyon 
---

Version 2: previous ifdefs not needed.

MST has convinced me that any user level driver for PCI master devices can't 
be safe unless there is an IOMMU protecting the APIC MSI/MSI-X interrupt 
addresses from device writes.  This interrupt remapping is not present in all
Intel IOMMUs and the code for the interrupt mapping in the AMD IOMMUs is not
implemented yet.

Needed by not-yet-accepted VFIO driver.

diff -uprN linux-2.6.34/drivers/pci/intel-iommu.c 
iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c
--- linux-2.6.34/drivers/pci/intel-iommu.c  2010-05-16 14:17:36.0 
-0700
+++ iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c 2010-06-30 
15:47:10.0 -0700
@@ -3705,6 +3705,10 @@ static int intel_iommu_domain_has_cap(st
 
if (cap == IOMMU_CAP_CACHE_COHERENCY)
return dmar_domain->iommu_snooping;
+   if (cap == IOMMU_CAP_SAFE_INTR_REMAP)
+   return intr_remapping_enabled;
 
return 0;
 }
diff -uprN linux-2.6.34/include/linux/iommu.h 
iommuapi-linux-2.6.34/include/linux/iommu.h
--- linux-2.6.34/include/linux/iommu.h  2010-05-16 14:17:36.0 -0700
+++ iommuapi-linux-2.6.34/include/linux/iommu.h 2010-06-30 15:47:34.0 
-0700
@@ -30,6 +30,7 @@ struct iommu_domain {
 };
 
 #define IOMMU_CAP_CACHE_COHERENCY  0x1
+#define IOMMU_CAP_SAFE_INTR_REMAP  0x2 /* isolates device intrs */
 
 struct iommu_ops {
int (*domain_init)(struct iommu_domain *domain);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] 2.6.34: simple IOMMU API extension to check safe interrupt remapping

2010-07-02 Thread Tom Lyon
On Friday 02 July 2010 02:26:46 am Roedel, Joerg wrote:
> On Thu, Jul 01, 2010 at 05:24:32PM -0400, Tom Lyon wrote:
> > This patch allows IOMMU users to determine whether the hardware and software
> > support safe, isolated interrupt remapping.  Not all Intel IOMMUs have the
> > hardware, and the software for AMD is not there yet.
> > Signed-off-by: Tom Lyon 
> 
> It does not make a lot of sense to check for this feature in the
> IOMMU-API currently because there is no support for intr-remapping in
> there. But it will be there when intr-remapping support for AMD IOMMU is
> implemented. So this change can be considered as a first step in that
> direction. Please repost with the change requested below an I'll add
> this one to my tree.
OK, but I'm not sure what you mean by the first sentence.  The Intel stuff
today does intr remapping as a side effect of X2APIC support.

> 
> > Version 2: previous ifdefs not needed.
> > 
> > MST has convinced me that any user level driver for PCI master devices 
> > can't 
> > be safe unless there is an IOMMU protecting the APIC MSI/MSI-X interrupt 
> > addresses from device writes.  This interrupt remapping is not present in 
> > all
> > Intel IOMMUs and the code for the interrupt mapping in the AMD IOMMUs is not
> > implemented yet.
> > 
> > Needed by not-yet-accepted VFIO driver.
> > 
> > diff -uprN linux-2.6.34/drivers/pci/intel-iommu.c 
> > iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c
> > --- linux-2.6.34/drivers/pci/intel-iommu.c  2010-05-16 14:17:36.0 
> > -0700
> > +++ iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c 2010-06-30 
> > 15:47:10.0 -0700
> > @@ -3705,6 +3705,10 @@ static int intel_iommu_domain_has_cap(st
> >  
> > if (cap == IOMMU_CAP_CACHE_COHERENCY)
> > return dmar_domain->iommu_snooping;
> > +   if (cap == IOMMU_CAP_SAFE_INTR_REMAP)
> > +   return intr_remapping_enabled;
> >  
> > return 0;
> >  }
> > diff -uprN linux-2.6.34/include/linux/iommu.h 
> > iommuapi-linux-2.6.34/include/linux/iommu.h
> > --- linux-2.6.34/include/linux/iommu.h  2010-05-16 14:17:36.0 
> > -0700
> > +++ iommuapi-linux-2.6.34/include/linux/iommu.h 2010-06-30 
> > 15:47:34.0 -0700
> > @@ -30,6 +30,7 @@ struct iommu_domain {
> >  };
> >  
> >  #define IOMMU_CAP_CACHE_COHERENCY  0x1
> > +#define IOMMU_CAP_SAFE_INTR_REMAP  0x2 /* isolates device intrs */
> 
> I think the SAFE_ is not necessary in the name. It is misleading because
> it indicates that there is an unsafe variant of intr-remapping available
> when this capability is not set. Just call it IOMMU_CAP_INTR_REMAPPING.
> 
> 
>   Joerg
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V3] 2.6.34: simple IOMMU API extension to check safe interrupt remapping

2010-07-02 Thread Tom Lyon
This patch allows IOMMU users to determine whether the hardware and software
support safe, isolated interrupt remapping.  Not all Intel IOMMUs have the
hardware, and the software for AMD is not there yet.
Signed-off-by: Tom Lyon 
---
Version 3: shorter name requested by Joerg.
Version 2: previous ifdefs not needed.

MST has convinced me that any user level driver for PCI master devices can't 
be safe unless there is an IOMMU protecting the APIC MSI/MSI-X interrupt 
addresses from device writes.  This interrupt remapping is not present in all
Intel IOMMUs and the code for the interrupt mapping in the AMD IOMMUs is not
implemented yet.

Needed by not-yet-accepted VFIO driver.

diff -uprN linux-2.6.34/drivers/pci/intel-iommu.c 
iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c
--- linux-2.6.34/drivers/pci/intel-iommu.c  2010-05-16 14:17:36.0 
-0700
+++ iommuapi-linux-2.6.34/drivers/pci/intel-iommu.c 2010-07-02 
12:39:19.0 -0700
@@ -3705,6 +3705,8 @@ static int intel_iommu_domain_has_cap(st
 
if (cap == IOMMU_CAP_CACHE_COHERENCY)
return dmar_domain->iommu_snooping;
+   if (cap == IOMMU_CAP_INTR_REMAP)
+   return intr_remapping_enabled;
 
return 0;
 }
diff -uprN linux-2.6.34/include/linux/iommu.h 
iommuapi-linux-2.6.34/include/linux/iommu.h
--- linux-2.6.34/include/linux/iommu.h  2010-05-16 14:17:36.0 -0700
+++ iommuapi-linux-2.6.34/include/linux/iommu.h 2010-07-02 12:38:45.0 
-0700
@@ -30,6 +30,7 @@ struct iommu_domain {
 };
 
 #define IOMMU_CAP_CACHE_COHERENCY  0x1
+#define IOMMU_CAP_INTR_REMAP   0x2 /* isolates device intrs */
 
 struct iommu_ops {
int (*domain_init)(struct iommu_domain *domain);
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

2010-07-27 Thread Tom Lyon
[ Sorry for the long hiatus, I've been wrapped up in other issues.]

I think the fundamental issue to resolve is to decide on the model which the 
VFIO driver presents to its users.

Fundamentally, VFIO as part of the OS must protect the system from its users 
and also protect the users from each other.  No disagreement here.

But another fundamental purpose of an OS to to present an abstract model of 
the underlying hardware to its users, so that the users don't have to deal 
with the full complexity of the hardware.

So I think VFIO should present a 'virtual', abstracted PCI device to its users 
whereas Michael has argued for a simpler model of presenting the real PCI
device config registers while preventing writes only to the registers which 
would clearly disrupt the system.

Now, the virtual model *could* look little like the real hardware, and use 
bunches of ioctls for everything it needs, or it could look a lot like PCI and 
use reads and writes of the virtual PCI config registers to trigger its 
actions.  The latter makes things more amenable to those porting drivers from 
other environments.

I realize that to date the VFIO driver has been a  bit of a mish-mash between 
the ioctl and config based techniques; I intend to clean that up.  And, yes, 
the abstract model presented by VFIO will need plenty of documentation.

Since KVM/qemu already has its own notion of a virtual PCI device which it 
presents to the guest OS, we either need to reconcile VFIO and qemu, or 
provide a bypass of the VFIO virtual model.  This could be direct access 
through sysfs, or else an ioctl to VFIO.  Since I have no internals knowledge 
of qemu, I look to others to choose.

Other little things:
1. Yes, I can share some code with sysfs if I can get the right EXPORTs there.
2. I'll add multiple MSI support, but I wish to point out that even though the 
PCI MSI API supports it, none of the architectures do.
3. FLR needs work.  I was foolish enough to assume that FLR wouldn't reset 
BARs; now I know better.
4. I'll get rid of the vfio config_map in sysfs; it was there for debugging.
5. I'm still looking to support hotplug/unplug and power management stuff via 
generic netlink notifications.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] VFIO driver: Non-privileged user level PCI drivers

2010-07-28 Thread Tom Lyon
On Tuesday, July 27, 2010 04:53:22 pm Michael S. Tsirkin wrote:
> On Tue, Jul 27, 2010 at 03:13:14PM -0700, Tom Lyon wrote:
> > [ Sorry for the long hiatus, I've been wrapped up in other issues.]
> > 
> > I think the fundamental issue to resolve is to decide on the model which
> > the VFIO driver presents to its users.
> > 
> > Fundamentally, VFIO as part of the OS must protect the system from its
> > users and also protect the users from each other.  No disagreement here.
> > 
> > But another fundamental purpose of an OS to to present an abstract model
> > of the underlying hardware to its users, so that the users don't have to
> > deal with the full complexity of the hardware.
> > 
> > So I think VFIO should present a 'virtual', abstracted PCI device to its
> > users whereas Michael has argued for a simpler model of presenting the
> > real PCI device config registers while preventing writes only to the
> > registers which would clearly disrupt the system.
> 
> In fact, there is no contradiction. I am all for an abstracted
> API *and* I think the virtualization concept is a bad way
> to build this API.
> 
> The 'virtual' interface you present is very complex and hardware specific:
> you do not hide literally *anything*. Deciding which functionality
> userspace needs, and exposing it to userspace as a set of APIs would be
> abstract. Instead you ask people to go read the PCI spec, the device spec,
> and bang on PCI registers, little-endian-ness and all, then try to
> interpret what do the virtual values mean.

Exactly! The PCI bus is far better *specified*, *documented*, and widely 
implemented than a Linux driver could ever hope to be.  And there are lots of 
current Linux drivers which bang around in pci config space simply because the 
authors were not aware of some api call buried deep in linux which would do 
the work for them - or - got tired of using OS-specific APIs when porting a 
driver and decided to just ask the hardware.

> Example:
> 
> How do I find # of MSI-X vectors? Sure, scan the capability list,
> find the capability, read the value, convert from little endian
> at each step.
> A page or two of code, and let's hope I have a ppc to test on.
> And note no driver has this code - they all use OS routines.
> 
> So why wouldn't
>   ioctl(dev, VFIO_GET_MSIX_VECTORS, &n);
> better serve the declared goal of presenting an abstracted PCI device to
> users?

By and large, the user drivers just know how many because the hardware is 
constant.

And inventing 20 or 30 ioctls to do a bunch of random stuff is gross when you 
can instead use normal read and write calls to a well defined structure.
> 
> > Now, the virtual model *could* look little like the real hardware, and
> > use bunches of ioctls for everything it needs,
> 
> Or reads/writes at special offsets, or sysfs attributes.
> 
> > or it could look a lot like PCI and
> > use reads and writes of the virtual PCI config registers to trigger its
> > actions.  The latter makes things more amenable to those porting drivers
> > from other environments.
> 
> I really doubt this helps at all. Drivers typically use OS-specific
> APIs. It is very uncommon for them to touch standard registers,
> which is 100% of what your patch seem to be dealing with.
> 
> And again, how about a small userspace library that would wrap vfio and
> add the abstractions for drivers that do need them?

Yes, there will be userspace libraries - I already have a vfio backend for 
libpci.
> 
> > I realize that to date the VFIO driver has been a  bit of a mish-mash
> > between the ioctl and config based techniques; I intend to clean that
> > up.  And, yes, the abstract model presented by VFIO will need plenty of
> > documentation.
> 
> And, it will need to be maintained forever, bugs and all.
> For example, if you change some register you emulated
> to fix a bug, to the driver this looks like a hardware change,
> and it will crash.

The changes will come only to allow for a more-perfect emulation, so I doubt 
that  will cause driver problems.  No different than discovering and fixing 
bugs in the ioctls needed in you scenario.

> 
> The PCI spec has some weak versioning support, but it
> is mostly not a problem in that space: a specific driver needs to
> only deal with a specific device.  We have a generic driver so PCI
> configuration space is a bad interface to use.

PCI has great versioning. Damn near every change made in 16+ years has been 
upwards compatible.  BIOS and OS writers don't have trouble with generic PCI, 
why should vfio?

> 
> > Since KVM/qemu already has its own notion of a virtual PCI device which
> > it