[Qemu-devel] Fwd: Direct guest device access from nested guest

2013-08-29 Thread Aaron Fabbri
Sorry. Resending in plain text.  (Gmail).

-- Forwarded message --

Has anyone considered a paravirt approach?  That is:

Guest kernel:  Write a new IOMMU API back end which does KVM
hypercalls.  Exposes VFIO to guest user processes (nested VMs) as
usual.

Host kernel: KVM does things like collapse {guest_va -> guest_pa ->
host_pa} mappings to {guest_va -> host_pa}, and call through to
underlying IOMMU.

Opinions?



On Wed, Aug 28, 2013 at 12:18 PM, Lluís Vilanova  wrote:
>
> Jan Kiszka writes:
>
> > On 2013-08-28 20:12, Lluís Vilanova wrote:
> >> Jan Kiszka writes:
> >> [...]
>  Is it possible to give a nested guest direct access to a device on the 
>  guest?
>  (more specifically, an AHCI controller).
> >>
> >>> Nope, we are lacking support for emulating or (securely) forwarding
> >>> VT-d/IOMMU features to the first level guest. Would be cool to have,
> >>> just not yet there. But I've talked to Intel people recently, and they
> >>> are considering to support some nested VT-d with KVM.
> >>
> >> Thanks a lot. I've been told there's some patches floating around to add 
> >> such
> >> support, but I suppose they've been long outdated and only work as POCs.
>
> > I haven't seen anything in public.
>
> This is what I've found:
>
>   https://lists.nongnu.org/archive/html/qemu-devel/2011-04/msg01970.html
>
>
> > PS: You have Mail-Followup-To set in your answers - people will drop you
> > from CC this way.
>
> Yes, my mail client tried to be clever but didn't know I'm not subscribed to 
> the
> KVM list :)
>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
>



Re: [Qemu-devel] Direct guest device access from nested guest

2013-08-29 Thread Aaron Fabbri
Has anyone considered a paravirt approach?  That is:

Guest kernel:  Write a new IOMMU API back end which does KVM hypercalls.
 Exposes VFIO to guest user processes (nested VMs) as usual.

Host kernel: KVM does things like collapse {guest_va -> guest_pa ->
host_pa} mappings to {guest_va -> host_pa}, and call through to underlying
IOMMU.

Opinions?



On Wed, Aug 28, 2013 at 12:18 PM, Lluís Vilanova wrote:

> Jan Kiszka writes:
>
> > On 2013-08-28 20:12, Lluís Vilanova wrote:
> >> Jan Kiszka writes:
> >> [...]
>  Is it possible to give a nested guest direct access to a device on
> the guest?
>  (more specifically, an AHCI controller).
> >>
> >>> Nope, we are lacking support for emulating or (securely) forwarding
> >>> VT-d/IOMMU features to the first level guest. Would be cool to have,
> >>> just not yet there. But I've talked to Intel people recently, and they
> >>> are considering to support some nested VT-d with KVM.
> >>
> >> Thanks a lot. I've been told there's some patches floating around to
> add such
> >> support, but I suppose they've been long outdated and only work as POCs.
>
> > I haven't seen anything in public.
>
> This is what I've found:
>
>   https://lists.nongnu.org/archive/html/qemu-devel/2011-04/msg01970.html
>
>
> > PS: You have Mail-Followup-To set in your answers - people will drop you
> > from CC this way.
>
> Yes, my mail client tried to be clever but didn't know I'm not subscribed
> to the
> KVM list :)
>
>
> Lluis
>
> --
>  "And it's much the same thing with knowledge, for whenever you learn
>  something new, the whole world becomes that much richer."
>  -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
>  Tollbooth
>
>


Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread Aaron Fabbri



On 12/20/11 8:30 PM, "Alex Williamson"  wrote:

> On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
>> On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
>>> On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:

>>> 
>>> Well, the iommu-api was designed for amd-vi and vt-d. But its concepts
>>> turn out to be more general and by no way x86-centric anymore.
>> 
>> It's improving, but there are still plenty of x86isms there.
> 
> Having worked on ia64 for a while, it's interesting to see this x86
> bashing from the other side.  Everyone is more than willing to make
> architecture neutral interfaces (jeez, look at the extent of the vfio
> reworks), but it's not fair to throw away interfaces as x86-centric if
> you're not pushing your requirements and making use of the code.
> 
> It seems like we'd be better served today to start with the vfio code we
> have and let that be the catalyst to drive an iommu api that better
> serves non-x86.  I don't see how this group management tangent is really
> getting us anywhere.  Thanks,

I'd agree that incremental approach here is key.  VFIO has already seen a
ton of rework to accommodate all architectures.  Let's not bite off a bunch
of these other subsystem rewrites in the same chunk as our VFIO effort.

-Aaron




Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework

2011-11-15 Thread Aaron Fabbri



On 11/15/11 12:10 PM, "Scott Wood"  wrote:

> On 11/15/2011 12:34 AM, David Gibson wrote:
 
>>> +static int allow_unsafe_intrs;
>>> +module_param(allow_unsafe_intrs, int, 0);
>>> +MODULE_PARM_DESC(allow_unsafe_intrs,
>>> +"Allow use of IOMMUs which do not support interrupt remapping");
>> 
>> This should not be a global option, but part of the AMD/Intel IOMMU
>> specific code.  In general it's a question of how strict the IOMMU
>> driver is about isolation when it determines what the groups are, and
>> only the IOMMU driver can know what the possibilities are for its
>> class of hardware.
> 
> It's also a concern that is specific to MSIs.  In any case, I'm not sure
> that the ability to cause a spurious IRQ is bad enough to warrant
> disabling the entire subsystem by default on certain hardware.

I think the issue is more that the ability to create fake MSI interrupts can
lead to bigger exploits.

Originally we didn't have this parameter. It was added it to reflect the
fact that MSI's triggered by guests are dangerous without the isolation that
interrupt remapping provides.

That is, it *should* be inconvenient to run without interrupt mapping HW
support.

-Aaron

> Probably best to just print a warning on module init if there are any
> known isolation holes, and let the admin decide whom (if anyone) to let
> use this.  If the hole is bad enough that it must be confirmed, it
> should require at most a sysfs poke.
> 
> -Scott
> 




Re: [Qemu-devel] [RFC PATCH] vfio: VFIO Driver core framework

2011-11-08 Thread Aaron Fabbri
I'm going to send out chunks of comments as I go over this stuff.  Below
I've covered the documentation file and vfio_iommu.c.  More comments coming
soon...

On 11/3/11 1:12 PM, "Alex Williamson"  wrote:

> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson 
> ---

> +
> +Groups, Devices, IOMMUs, oh my
> +-
> --
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a

Can you define this acronym the first time you use it, i.e.

+ PEs (partitionable endpoints), ...

> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for

restrictions

> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices

distinguish

> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +

> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE

explicitly


> +
> +Device tree devices also invlude ioctls for further defining the

include


> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c

> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +  dma_addr_t start, size_t size)
> +{
> +struct list_head *pos;
> +struct dma_map_page *mlp;
> +
> +list_for_each(pos, &iommu->dm_list) {
> +mlp = list_entry(pos, struct dma_map_page, list);
> +if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +   start, size))
> +return mlp;
> +}
> +return NULL;
> +}
> +

This function below should be static.

> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +size_t size, struct dma_map_page *mlp)
> +{
> +struct dma_map_page *split;
> +int npage_lo, npage_hi;
> +
> +/* Existing dma region is completely covered, unmap all */
> +if (start <= mlp->daddr &&
> +start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +list_del(&mlp->list);
> +npage_lo = mlp->npage;
> +kfree(mlp);
> +return npage_lo;
> +}
> +
> +/* Overlap low address of existing range */
> +if (start <= mlp->daddr) {
> +size_t overlap;
> +
> +overlap = start + size - mlp->daddr;
> +npage_lo = overlap >> PAGE_SHIFT;
> +npage_hi = mlp->npage - npage_lo;

npage_hi not used.. Delete this line ^

> +
> +vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +mlp->daddr += overlap;
> +mlp->vaddr += overlap;
> +mlp->npage -= npage_lo;
> +return npage_lo;
> +}
> +
> +/* Overlap high address of existing range */
> +if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +size_t overlap;
> +
> +overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +npage_hi = overlap >> PAGE_SHIFT;
> +npage_lo = mlp->npage - npage_hi;
> +
> +vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +mlp->npage -= npage_hi;
> +return npage_hi;
> +}
> +
> +/* Split existing */
> +npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +split = kzalloc(sizeof *split, GFP_KERNEL);
> +if (!split)
> +return -ENOMEM;
> +
> +vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +mlp->npage = npage_lo;
> +
> +split->npage = npage_hi;
> +split->daddr = start + size;
> +split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +split->rdwr = mlp->rdwr;
> +list_add(&split->list, &iommu->dm_list);
> +return size >> PAGE_SHIFT;
> +}
> +

Function should be static.

> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_d

Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-26 Thread Aaron Fabbri



On 8/26/11 12:35 PM, "Chris Wright"  wrote:

> * Aaron Fabbri (aafab...@cisco.com) wrote:
>> On 8/26/11 7:07 AM, "Alexander Graf"  wrote:
>>> Forget the KVM case for a moment and think of a user space device driver. I
>>> as
>>> a user am not root. But I as a user when having access to /dev/vfioX want to
>>> be able to access the device and manage it - and only it. The admin of that
>>> box needs to set it up properly for me to be able to access it.
>>> 
>>> So having two steps is really the correct way to go:
>>> 
>>>   * create VFIO group
>>>   * use VFIO group
>>> 
>>> because the two are done by completely different users.
>> 
>> This is not the case for my userspace drivers using VFIO today.
>> 
>> Each process will open vfio devices on the fly, and they need to be able to
>> share IOMMU resources.
> 
> How do you share IOMMU resources w/ multiple processes, are the processes
> sharing memory?

Sorry, bad wording.  I share IOMMU domains *within* each process.

E.g. If one process has 3 devices and another has 10, I can get by with two
iommu domains (and can share buffers among devices within each process).

If I ever need to share devices across processes, the shared memory case
might be interesting.

> 
>> So I need the ability to dynamically bring up devices and assign them to a
>> group.  The number of actual devices and how they map to iommu domains is
>> not known ahead of time.  We have a single piece of silicon that can expose
>> hundreds of pci devices.
> 
> This does not seem fundamentally different from the KVM use case.
> 
> We have 2 kinds of groupings.
> 
> 1) low-level system or topoolgy grouping
> 
>Some may have multiple devices in a single group
> 
>* the PCIe-PCI bridge example
>* the POWER partitionable endpoint
> 
>Many will not
> 
>* singleton group, e.g. typical x86 PCIe function (majority of
>  assigned devices)
> 
>Not sure it makes sense to have these administratively defined as
>opposed to system defined.
> 
> 2) logical grouping
> 
>* multiple low-level groups (singleton or otherwise) attached to same
>  process, allowing things like single set of io page tables where
>  applicable.
> 
>These are nominally adminstratively defined.  In the KVM case, there
>is likely a privileged task (i.e. libvirtd) involved w/ making the
>device available to the guest and can do things like group merging.
>In your userspace case, perhaps it should be directly exposed.

Yes.  In essence, I'd rather not have to run any other admin processes.
Doing things programmatically, on the fly, from each process, is the
cleanest model right now.

> 
>> In my case, the only administrative task would be to give my processes/users
>> access to the vfio groups (which are initially singletons), and the
>> application actually opens them and needs the ability to merge groups
>> together to conserve IOMMU resources (assuming we're not going to expose
>> uiommu).
> 
> I agree, we definitely need to expose _some_ way to do this.
> 
> thanks,
> -chris




Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-26 Thread Aaron Fabbri



On 8/26/11 7:07 AM, "Alexander Graf"  wrote:

> 

> 
> Forget the KVM case for a moment and think of a user space device driver. I as
> a user am not root. But I as a user when having access to /dev/vfioX want to
> be able to access the device and manage it - and only it. The admin of that
> box needs to set it up properly for me to be able to access it.
> 
> So having two steps is really the correct way to go:
> 
>   * create VFIO group
>   * use VFIO group
> 
> because the two are done by completely different users.

This is not the case for my userspace drivers using VFIO today.

Each process will open vfio devices on the fly, and they need to be able to
share IOMMU resources.

So I need the ability to dynamically bring up devices and assign them to a
group.  The number of actual devices and how they map to iommu domains is
not known ahead of time.  We have a single piece of silicon that can expose
hundreds of pci devices.

In my case, the only administrative task would be to give my processes/users
access to the vfio groups (which are initially singletons), and the
application actually opens them and needs the ability to merge groups
together to conserve IOMMU resources (assuming we're not going to expose
uiommu).

-Aaron




Re: [Qemu-devel] kvm PCI assignment & VFIO ramblings

2011-08-23 Thread Aaron Fabbri



On 8/23/11 10:01 AM, "Alex Williamson"  wrote:

> On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote:
>> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote:
>> 
>>> I'm not following you.
>>> 
>>> You have to enforce group/iommu domain assignment whether you have the
>>> existing uiommu API, or if you change it to your proposed
>>> ioctl(inherit_iommu) API.
>>> 
>>> The only change needed to VFIO here should be to make uiommu fd assignment
>>> happen on the groups instead of on device fds.  That operation fails or
>>> succeeds according to the group semantics (all-or-none assignment/same
>>> uiommu).
>> 
>> Ok, so I missed that part where you change uiommu to operate on group
>> fd's rather than device fd's, my apologies if you actually wrote that
>> down :-) It might be obvious ... bare with me I just flew back from the
>> US and I am badly jet lagged ...
> 
> I missed it too, the model I'm proposing entirely removes the uiommu
> concept.
> 
>> So I see what you mean, however...
>> 
>>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow
>>> arbitrary mapping (satisfying group constraints) as we do today.
>>> 
>>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
>>> ability and definitely think the uiommu approach is cleaner than the
>>> ioctl(inherit_iommu) approach.  We considered that approach before but it
>>> seemed less clean so we went with the explicit uiommu context.
>> 
>> Possibly, the question that interest me the most is what interface will
>> KVM end up using. I'm also not terribly fan with the (perceived)
>> discrepancy between using uiommu to create groups but using the group fd
>> to actually do the mappings, at least if that is still the plan.
> 
> Current code: uiommu creates the domain, we bind a vfio device to that
> domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do
> mappings via MAP_DMA on the vfio device (affecting all the vfio devices
> bound to the domain)
> 
> My current proposal: "groups" are predefined.  groups ~= iommu domain.

This is my main objection.  I'd rather not lose the ability to have multiple
devices (which are all predefined as singleton groups on x86 w/o PCI
bridges) share IOMMU resources.  Otherwise, 20 devices sharing buffers would
require 20x the IOMMU/ioTLB resources.  KVM doesn't care about this case?

> The iommu domain would probably be allocated when the first device is
> bound to vfio.  As each device is bound, it gets attached to the group.
> DMAs are done via an ioctl on the group.
> 
> I think group + uiommu leads to effectively reliving most of the
> problems with the current code.  The only benefit is the group
> assignment to enforce hardware restrictions.  We still have the problem
> that uiommu open() = iommu_domain_alloc(), whose properties are
> meaningless without attached devices (groups).  Which I think leads to
> the same awkward model of attaching groups to define the domain, then we
> end up doing mappings via the group to enforce ordering.

Is there a better way to allow groups to share an IOMMU domain?

Maybe, instead of having an ioctl to allow a group A to inherit the same
iommu domain as group B, we could have an ioctl to fully merge two groups
(could be what Ben was thinking):

A.ioctl(MERGE_TO_GROUP, B)

The group A now goes away and its devices join group B.  If A ever had an
iommu domain assigned (and buffers mapped?) we fail.

Groups cannot get smaller (they are defined as minimum granularity of an
IOMMU, initially).  They can get bigger if you want to share IOMMU
resources, though.

Any downsides to this approach?

-AF

> 
>> If the separate uiommu interface is kept, then anything that wants to be
>> able to benefit from the ability to put multiple devices (or existing
>> groups) into such a "meta group" would need to be explicitly modified to
>> deal with the uiommu APIs.
>> 
>> I tend to prefer such "meta groups" as being something you create
>> statically using a configuration interface, either via sysfs, netlink or
>> ioctl's to a "control" vfio device driven by a simple command line tool
>> (which can have the configuration stored in /etc and re-apply it at
>> boot).
> 
> I cringe anytime there's a mention of "static".  IMHO, we have to
> support hotplug.  That means "meta groups" change dynamically.  Maybe
> this supports the idea that we should be able to retrieve a new fd from
> the group to do mappings.  Any groups bound together will return the
> same fd and the fd will persist so long as any member of the group is
> open.
> 
>> That way, any program capable of exploiting VFIO "groups" will
>> automatically be able to exploit those "meta groups" (or groups of
>> groups) as well as long as they are supported on the system.
>> 
>> If we ever have system specific constraints as to how such groups can be
>> created, then it can all be handled at the level of that configuration
>> tool without impact on whatever programs know how to exploit t