Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-30 Thread David Gibson
On Wed, Jan 25, 2012 at 04:44:53PM -0700, Alex Williamson wrote:
> On Wed, 2012-01-25 at 14:13 +1100, David Gibson wrote:
> > On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
> > > On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> > > > On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > > > > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > > > > Well.. that's not where it is in Alex's code either.  The iommu 
> > > > > > layer
> > > > > > (to the extent that there is such a "layer") supplies the group 
> > > > > > info,
> > > > > > but the group management is in vfio, not the iommu layer.  With mine
> > > > > > it is in the driver core because the struct device seemed the 
> > > > > > logical
> > > > > > place for the group id.
> > > > > 
> > > > > Okay, seems we have different ideas of what the 'grouping code' is. I
> > > > > talked about the group enumeration code only. But group handling code 
> > > > > is
> > > > > certainly important to some degree too. But before we argue about the
> > > > > right place of the code we should agree on the semantics such code
> > > > > should provide.
> > > > > 
> > > > > For me it is fine when the code is in VFIO for now, since VFIO is the
> > > > > only user at the moment. When more users pop up we can easily move it
> > > > > out to somewhere else. But the semantics influence the interface to
> > > > > user-space too, so it is more important now. It splits up into a 
> > > > > number
> > > > > of sub problems:
> > > > > 
> > > > >   1) How userspace detects the device<->group relationship?
> > > > >   2) Do we want group-binding/unbinding to device drivers?
> > > > >   3) Group attach/detach to iommu-domains?
> > > > >   4) What to do with hot-added devices?
> > > > > 
> > > > > For 1) I think the current solution with the iommu_group file is fine.
> > > > > It is somewhat expensive for user-space to figure out the per-group
> > > > > device-sets, but that is a one-time effort so it doesn't really 
> > > > > matter.
> > > > > Probably we can rename 'iommu_group' to 'isolation_group' or
> > > > > something.
> > > > 
> > > > Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> > > > group, short of walking every device in the system.  And it provides
> > > > no way to attach information to a group.  It just seems foolish to me
> > > > to have this concept without some kind of in-kernel handle on it, and
> > > 
> > > Who else needs to enumerate groups right now?  Who else needs to attach
> > > data to a group.  We seem to be caught in this loop of arguing that we
> > > need driver core based group management, but we don't have any plans to
> > > make use of it, so it just bloats the kernel for most of the users that
> > > don't care about it.
> > 
> > So, Ben and I discussed this with David Woodhouse during
> > linux.conf.au.  He does want to update the core iommu_ops and dma_ops
> > handling to be group aware, and is willing to do the work for that.
> > So I will be doing another spin of my isolation code as a basis for
> > that work of his.  So there's our other user.
> 
> Hmm...
> 
> > > > if you're managing the in-kernel representation you might as well
> > > > expose it to userspace there as well.
> > > 
> > > Unfortunately this is just the start a peeling back layers of the onion.
> > > We manage groups in the driver core, so the driver core should expose
> > > them to userspace.  The driver core exposes them to userspace, so now it
> > > needs to manage permissions for userspace.
> > 
> > That doesn't necessarily follow.  The current draft has the core group
> > code export character devices on which permissions are managed, but
> > I'm also considering options where it only exports sysfs and something
> > else does the character device and permissions.
> 
> Let's start to define it then.  A big problem I have with the isolation
> infrastructure you're proposing is that it doesn't have well defined
> bounds or interfaces.  It pulls devices out of the normal driver
> model,

It doesn't pull them out of the driver model.  The struct devices are
all there just like usual, it's just a flag which inhibits driver binding.

> and even goes so far as to start exposing chardevs for groups to
> userspace.

I'm planning to split the patch so the character devices (and maybe
even the concept of binders) aren't in the core group part, which
should be all dwmw2 needs.

>  The former feels like completely the wrong approach and the
> latter oversteps the bounds of a core interface for a very niche usage,
> IMO.
> 
> Your argument for making groups more pervasive is that the iommu drivers
> need to know about them already and we should make use of groups in the
> dma_ops and iommu_ops interfaces.  Fine, that seems reasonable and I
> assume is what Woodhouse has agreed to work on.  I don't however see
> that as an argument for creating a group driver class or isolation
> API.

No, it's n

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-25 Thread Alex Williamson
On Wed, 2012-01-25 at 14:13 +1100, David Gibson wrote:
> On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
> > On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> > > On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > > > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > > > Well.. that's not where it is in Alex's code either.  The iommu layer
> > > > > (to the extent that there is such a "layer") supplies the group info,
> > > > > but the group management is in vfio, not the iommu layer.  With mine
> > > > > it is in the driver core because the struct device seemed the logical
> > > > > place for the group id.
> > > > 
> > > > Okay, seems we have different ideas of what the 'grouping code' is. I
> > > > talked about the group enumeration code only. But group handling code is
> > > > certainly important to some degree too. But before we argue about the
> > > > right place of the code we should agree on the semantics such code
> > > > should provide.
> > > > 
> > > > For me it is fine when the code is in VFIO for now, since VFIO is the
> > > > only user at the moment. When more users pop up we can easily move it
> > > > out to somewhere else. But the semantics influence the interface to
> > > > user-space too, so it is more important now. It splits up into a number
> > > > of sub problems:
> > > > 
> > > > 1) How userspace detects the device<->group relationship?
> > > > 2) Do we want group-binding/unbinding to device drivers?
> > > > 3) Group attach/detach to iommu-domains?
> > > > 4) What to do with hot-added devices?
> > > > 
> > > > For 1) I think the current solution with the iommu_group file is fine.
> > > > It is somewhat expensive for user-space to figure out the per-group
> > > > device-sets, but that is a one-time effort so it doesn't really matter.
> > > > Probably we can rename 'iommu_group' to 'isolation_group' or
> > > > something.
> > > 
> > > Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> > > group, short of walking every device in the system.  And it provides
> > > no way to attach information to a group.  It just seems foolish to me
> > > to have this concept without some kind of in-kernel handle on it, and
> > 
> > Who else needs to enumerate groups right now?  Who else needs to attach
> > data to a group.  We seem to be caught in this loop of arguing that we
> > need driver core based group management, but we don't have any plans to
> > make use of it, so it just bloats the kernel for most of the users that
> > don't care about it.
> 
> So, Ben and I discussed this with David Woodhouse during
> linux.conf.au.  He does want to update the core iommu_ops and dma_ops
> handling to be group aware, and is willing to do the work for that.
> So I will be doing another spin of my isolation code as a basis for
> that work of his.  So there's our other user.

Hmm...

> > > if you're managing the in-kernel representation you might as well
> > > expose it to userspace there as well.
> > 
> > Unfortunately this is just the start a peeling back layers of the onion.
> > We manage groups in the driver core, so the driver core should expose
> > them to userspace.  The driver core exposes them to userspace, so now it
> > needs to manage permissions for userspace.
> 
> That doesn't necessarily follow.  The current draft has the core group
> code export character devices on which permissions are managed, but
> I'm also considering options where it only exports sysfs and something
> else does the character device and permissions.

Let's start to define it then.  A big problem I have with the isolation
infrastructure you're proposing is that it doesn't have well defined
bounds or interfaces.  It pulls devices out of the normal driver model,
and even goes so far as to start exposing chardevs for groups to
userspace.  The former feels like completely the wrong approach and the
latter oversteps the bounds of a core interface for a very niche usage,
IMO.

Your argument for making groups more pervasive is that the iommu drivers
need to know about them already and we should make use of groups in the
dma_ops and iommu_ops interfaces.  Fine, that seems reasonable and I
assume is what Woodhouse has agreed to work on.  I don't however see
that as an argument for creating a group driver class or isolation API.
In fact, we could probably layer a fairly similar version of vfio on top
of "group enlightened" driver core with minimal changes.

> >  Then we add permissions and
> > now we need to provide group access, then we need a channel to an actual
> > userspace device driver, zing! we add a whole API there,
> 
> And the other option I was looking add had the core providing the char
> device but having it's fops get passed straight through to the binder.

I'm not a fan of that idea.  Core driver code should not be creating
chardevs and I doubt anyone is onboard with the idea of a chardev with
an arbitrary API bound

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-24 Thread David Gibson
On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
> On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> > On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > > Well.. that's not where it is in Alex's code either.  The iommu layer
> > > > (to the extent that there is such a "layer") supplies the group info,
> > > > but the group management is in vfio, not the iommu layer.  With mine
> > > > it is in the driver core because the struct device seemed the logical
> > > > place for the group id.
> > > 
> > > Okay, seems we have different ideas of what the 'grouping code' is. I
> > > talked about the group enumeration code only. But group handling code is
> > > certainly important to some degree too. But before we argue about the
> > > right place of the code we should agree on the semantics such code
> > > should provide.
> > > 
> > > For me it is fine when the code is in VFIO for now, since VFIO is the
> > > only user at the moment. When more users pop up we can easily move it
> > > out to somewhere else. But the semantics influence the interface to
> > > user-space too, so it is more important now. It splits up into a number
> > > of sub problems:
> > > 
> > >   1) How userspace detects the device<->group relationship?
> > >   2) Do we want group-binding/unbinding to device drivers?
> > >   3) Group attach/detach to iommu-domains?
> > >   4) What to do with hot-added devices?
> > > 
> > > For 1) I think the current solution with the iommu_group file is fine.
> > > It is somewhat expensive for user-space to figure out the per-group
> > > device-sets, but that is a one-time effort so it doesn't really matter.
> > > Probably we can rename 'iommu_group' to 'isolation_group' or
> > > something.
> > 
> > Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> > group, short of walking every device in the system.  And it provides
> > no way to attach information to a group.  It just seems foolish to me
> > to have this concept without some kind of in-kernel handle on it, and
> 
> Who else needs to enumerate groups right now?  Who else needs to attach
> data to a group.  We seem to be caught in this loop of arguing that we
> need driver core based group management, but we don't have any plans to
> make use of it, so it just bloats the kernel for most of the users that
> don't care about it.

So, Ben and I discussed this with David Woodhouse during
linux.conf.au.  He does want to update the core iommu_ops and dma_ops
handling to be group aware, and is willing to do the work for that.
So I will be doing another spin of my isolation code as a basis for
that work of his.  So there's our other user.

> > if you're managing the in-kernel representation you might as well
> > expose it to userspace there as well.
> 
> Unfortunately this is just the start a peeling back layers of the onion.
> We manage groups in the driver core, so the driver core should expose
> them to userspace.  The driver core exposes them to userspace, so now it
> needs to manage permissions for userspace.

That doesn't necessarily follow.  The current draft has the core group
code export character devices on which permissions are managed, but
I'm also considering options where it only exports sysfs and something
else does the character device and permissions.

>  Then we add permissions and
> now we need to provide group access, then we need a channel to an actual
> userspace device driver, zing! we add a whole API there,

And the other option I was looking add had the core providing the char
device but having it's fops get passed straight through to the binder.

> then we need
> group driver binding, then we need group device driver binding, blam!
> another API, then we need...  I don't see a clear end marker that
> doesn't continue to bloat the core and add functionality that nobody
> else needs and we don't even have plans of integrating more pervasively.
> This appears to end with 80 to 90% of the vfio core code moving into the
> driver core.

I don't agree.  Working out the right boundary isn't totally obvious,
certainly, but that doesn't mean a reasonable boundary can't be found.

> 
> > > Regarding 2), I think providing user-space a way to unbind groups of
> > > devices from their drivers is a horrible idea.
> > 
> > Well, I'm not wed to unbinding all the drivers at once.  But what I
> > *do* think is essential is that we can atomically switch off automatic
> > driver matching for the whole group.  Without that nothing really
> > stops a driver reattaching to the things you've unbound, so you end up
> > bailing a leakey boat.
> 
> Huh?  There is no issue with removing a device from one driver and
> attaching it to another.  This happens all the time.  If you're talking
> about hotplug, all we have to do is sit on the bus notifier chain and we
> get called when devices are added, before the driver has a chance to
> attach.  We ca

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-21 Thread Joerg Roedel
On Wed, Dec 21, 2011 at 02:32:35PM +1100, David Gibson wrote:
> On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > For 1) I think the current solution with the iommu_group file is fine.
> > It is somewhat expensive for user-space to figure out the per-group
> > device-sets, but that is a one-time effort so it doesn't really matter.
> > Probably we can rename 'iommu_group' to 'isolation_group' or
> > something.
> 
> Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> group, short of walking every device in the system.  And it provides
> no way to attach information to a group.  It just seems foolish to me
> to have this concept without some kind of in-kernel handle on it, and
> if you're managing the in-kernel representation you might as well
> expose it to userspace there as well.

I agree that we should have an in-kernel handle for groups in the
future. This would be generic code the iommu-drivers can use to manage
their groups and where the pci-quirks can chime in. But we don't need
that immediatly, lets add this incrementally.
>From the implementation side I would prefer to introduce a dev->iommu
pointer (a generalization of the dev->archdata->iommu pointer we have on
x86, ia64 and arm already) and link this to the handle of the group.
This makes it also easy to walk all devices in a group within the
kernel.
I still disagree that we need to expose this to user-space in any other
way then what we currently have. It is just not worth the effort when it
is used that rarely. It is perfectly fine if user-space has to scan the
sysfs device tree to build its own group data structures.

> > Regarding 2), I think providing user-space a way to unbind groups of
> > devices from their drivers is a horrible idea.
> 
> Well, I'm not wed to unbinding all the drivers at once.  But what I
> *do* think is essential is that we can atomically switch off automatic
> driver matching for the whole group.  Without that nothing really
> stops a driver reattaching to the things you've unbound, so you end up
> bailing a leakey boat.

Okay, makes sense for hotadd to have this. What we need from the
driver-core here is a method to disable driver probing for a device. The
iommu-layer can then call this method on the DEVICE_ADD notifier if
required. This also can be done incrementally to the current VFIO
implementation.

> Ok, that's not the usage model I had in mind.  What I'm thinking here
> is that the admin removes groups that will be used in guests from the
> host kernel (probably via boot-time scripts).  The guests will pick
> them up later, so that when a guest quits then restarts, we don't have
> devices appearing transiently in the host.

This 'remove-groups-from-the-host' thing can easily be done (also at
boot time) by just binding the devices in question to the VFIO driver.
And the devices can stay with VFIO for the lifetime of the system (or
until the host want to make use of it). I don't think we need any
additional logic for that.

> > For the remaining two questions I think the concept of a default-domain
> > is helpful.  The default-domain is a per-group domain which is created
> > by the iommu-driver at initialization time. It is the domain each device
> > is assigned to when it is not assigned to any other domain (which means
> > that each device/group is always attached to a domain). The default
> > domain will be used by the DMA-API layer. This implicitly means, that a
> > device which is not in the default-domain can't be used with the
> > dma-api. The dma_supported() function will return false for those
> > devices.
> 
> But.. by definition every device in the group must belong to the same
> domain.  So how is this "default domain" in any way different from
> "current domain".

The default domain is created by the iommu-drivers for all groups it
finds on the system and the devices are automatically assigned to it at
iommu-driver initialization time. This domain will be used for mapping
dma-api calls.
The "current domain" on the other side can also be a domain created by
VFIO (or KVM as it is today) for mapping the device dma space into a
guest.

> The domain_{attach,detach} functions absolutely should be group based
> not device based.  That's what it means to be a group.

I disagree with that. This would force all iommu-drivers that have no
group-concept (basically all current ARM iommu implementations) to work
on groups too. That makes the iommu-api much harder to use.

We could do instead:

1) Force users of the iommu-api to bind each device in the group
   individually.
2) Attach all devices in the group automatically to the same
   domain if a single device is attached to it.

I have no strong opinion which way we go, but have a slight favour for
way 2. To make it harder to mis-use we can force that way 2 only works
if the group is in its default-domain and there are no pending
dma-allocations.

> > Question 4) is also solved with the defaul

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread Aaron Fabbri



On 12/20/11 8:30 PM, "Alex Williamson"  wrote:

> On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
>> On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
>>> On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:

>>> 
>>> Well, the iommu-api was designed for amd-vi and vt-d. But its concepts
>>> turn out to be more general and by no way x86-centric anymore.
>> 
>> It's improving, but there are still plenty of x86isms there.
> 
> Having worked on ia64 for a while, it's interesting to see this x86
> bashing from the other side.  Everyone is more than willing to make
> architecture neutral interfaces (jeez, look at the extent of the vfio
> reworks), but it's not fair to throw away interfaces as x86-centric if
> you're not pushing your requirements and making use of the code.
> 
> It seems like we'd be better served today to start with the vfio code we
> have and let that be the catalyst to drive an iommu api that better
> serves non-x86.  I don't see how this group management tangent is really
> getting us anywhere.  Thanks,

I'd agree that incremental approach here is key.  VFIO has already seen a
ton of rework to accommodate all architectures.  Let's not bite off a bunch
of these other subsystem rewrites in the same chunk as our VFIO effort.

-Aaron




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread Alex Williamson
On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
> On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> > On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > > Well.. that's not where it is in Alex's code either.  The iommu layer
> > > (to the extent that there is such a "layer") supplies the group info,
> > > but the group management is in vfio, not the iommu layer.  With mine
> > > it is in the driver core because the struct device seemed the logical
> > > place for the group id.
> > 
> > Okay, seems we have different ideas of what the 'grouping code' is. I
> > talked about the group enumeration code only. But group handling code is
> > certainly important to some degree too. But before we argue about the
> > right place of the code we should agree on the semantics such code
> > should provide.
> > 
> > For me it is fine when the code is in VFIO for now, since VFIO is the
> > only user at the moment. When more users pop up we can easily move it
> > out to somewhere else. But the semantics influence the interface to
> > user-space too, so it is more important now. It splits up into a number
> > of sub problems:
> > 
> > 1) How userspace detects the device<->group relationship?
> > 2) Do we want group-binding/unbinding to device drivers?
> > 3) Group attach/detach to iommu-domains?
> > 4) What to do with hot-added devices?
> > 
> > For 1) I think the current solution with the iommu_group file is fine.
> > It is somewhat expensive for user-space to figure out the per-group
> > device-sets, but that is a one-time effort so it doesn't really matter.
> > Probably we can rename 'iommu_group' to 'isolation_group' or
> > something.
> 
> Hrm.  Alex's group code also provides no in-kernel way to enumerate a
> group, short of walking every device in the system.  And it provides
> no way to attach information to a group.  It just seems foolish to me
> to have this concept without some kind of in-kernel handle on it, and

Who else needs to enumerate groups right now?  Who else needs to attach
data to a group.  We seem to be caught in this loop of arguing that we
need driver core based group management, but we don't have any plans to
make use of it, so it just bloats the kernel for most of the users that
don't care about it.

> if you're managing the in-kernel representation you might as well
> expose it to userspace there as well.

Unfortunately this is just the start a peeling back layers of the onion.
We manage groups in the driver core, so the driver core should expose
them to userspace.  The driver core exposes them to userspace, so now it
needs to manage permissions for userspace.  Then we add permissions and
now we need to provide group access, then we need a channel to an actual
userspace device driver, zing! we add a whole API there, then we need
group driver binding, then we need group device driver binding, blam!
another API, then we need...  I don't see a clear end marker that
doesn't continue to bloat the core and add functionality that nobody
else needs and we don't even have plans of integrating more pervasively.
This appears to end with 80 to 90% of the vfio core code moving into the
driver core.

> > Regarding 2), I think providing user-space a way to unbind groups of
> > devices from their drivers is a horrible idea.
> 
> Well, I'm not wed to unbinding all the drivers at once.  But what I
> *do* think is essential is that we can atomically switch off automatic
> driver matching for the whole group.  Without that nothing really
> stops a driver reattaching to the things you've unbound, so you end up
> bailing a leakey boat.

Huh?  There is no issue with removing a device from one driver and
attaching it to another.  This happens all the time.  If you're talking
about hotplug, all we have to do is sit on the bus notifier chain and we
get called when devices are added, before the driver has a chance to
attach.  We can then force a vfio driver to attach when needed.  Hell,
we can just set dev->driver to something to prevent standard driver
probing.

> > It makes it too easy for
> > the user to shoot himself in the foot. For example when the user wants
> > to assign a network card to a guest, but that card is in the same group
> > as the GPU and the screen wents blank when the guest is started.
> > Requiring devices to be unbound one-by-one is better because this way
> > the user always needs to know what he is doing.
> 
> Ok, that's not the usage model I had in mind.  What I'm thinking here
> is that the admin removes groups that will be used in guests from the
> host kernel (probably via boot-time scripts).  The guests will pick
> them up later, so that when a guest quits then restarts, we don't have
> devices appearing transiently in the host.

I don't think that model is dynamic enough for our existing use cases.
A user shouldn't need to decide at boot time which devices are going to
be used for what purpose.  It's entirely valid for a user to start up a
V

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread David Gibson
On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
> On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> > Well.. that's not where it is in Alex's code either.  The iommu layer
> > (to the extent that there is such a "layer") supplies the group info,
> > but the group management is in vfio, not the iommu layer.  With mine
> > it is in the driver core because the struct device seemed the logical
> > place for the group id.
> 
> Okay, seems we have different ideas of what the 'grouping code' is. I
> talked about the group enumeration code only. But group handling code is
> certainly important to some degree too. But before we argue about the
> right place of the code we should agree on the semantics such code
> should provide.
> 
> For me it is fine when the code is in VFIO for now, since VFIO is the
> only user at the moment. When more users pop up we can easily move it
> out to somewhere else. But the semantics influence the interface to
> user-space too, so it is more important now. It splits up into a number
> of sub problems:
> 
>   1) How userspace detects the device<->group relationship?
>   2) Do we want group-binding/unbinding to device drivers?
>   3) Group attach/detach to iommu-domains?
>   4) What to do with hot-added devices?
> 
> For 1) I think the current solution with the iommu_group file is fine.
> It is somewhat expensive for user-space to figure out the per-group
> device-sets, but that is a one-time effort so it doesn't really matter.
> Probably we can rename 'iommu_group' to 'isolation_group' or
> something.

Hrm.  Alex's group code also provides no in-kernel way to enumerate a
group, short of walking every device in the system.  And it provides
no way to attach information to a group.  It just seems foolish to me
to have this concept without some kind of in-kernel handle on it, and
if you're managing the in-kernel representation you might as well
expose it to userspace there as well.

> Regarding 2), I think providing user-space a way to unbind groups of
> devices from their drivers is a horrible idea.

Well, I'm not wed to unbinding all the drivers at once.  But what I
*do* think is essential is that we can atomically switch off automatic
driver matching for the whole group.  Without that nothing really
stops a driver reattaching to the things you've unbound, so you end up
bailing a leakey boat.

> It makes it too easy for
> the user to shoot himself in the foot. For example when the user wants
> to assign a network card to a guest, but that card is in the same group
> as the GPU and the screen wents blank when the guest is started.
> Requiring devices to be unbound one-by-one is better because this way
> the user always needs to know what he is doing.

Ok, that's not the usage model I had in mind.  What I'm thinking here
is that the admin removes groups that will be used in guests from the
host kernel (probably via boot-time scripts).  The guests will pick
them up later, so that when a guest quits then restarts, we don't have
devices appearing transiently in the host.

> For the remaining two questions I think the concept of a default-domain
> is helpful.  The default-domain is a per-group domain which is created
> by the iommu-driver at initialization time. It is the domain each device
> is assigned to when it is not assigned to any other domain (which means
> that each device/group is always attached to a domain). The default
> domain will be used by the DMA-API layer. This implicitly means, that a
> device which is not in the default-domain can't be used with the
> dma-api. The dma_supported() function will return false for those
> devices.

But.. by definition every device in the group must belong to the same
domain.  So how is this "default domain" in any way different from
"current domain".

In addition making dma_supported() doesn't seem like a strong enough
constraint.  With this a kernel driver which does not use DMA, or
which is initializing and hasn't yet hit a dma_supported() check could
be accessing a device which is in the same group as something a guest
is simultaneously accessing.  Since there's no DMA (on the kernel
side) we can't get DMA conflicts but there are other forms of
isolation that the group could be enforcing which would make that
unsafe. e.g. irqs from the two devices can't be reliably separated,
debug registers on one device let config space be altered to move it
on top of the other, one can cause a bus error which will mess up the
other.

> So what does this mean for point 3? I think we can implement attaching
> and detaching groups in the iommu-api. This interface is not exposed to
> userspace and can help VFIO and possible future users. Semantic is, that
> domain_attach_group() only works when all devices in the group are in
> their default domain and domain_detach_group() puts them back into the
> default domain.

The domain_{attach,detach} functions absolutely should be group based
not device based.  That's wha

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Gibson
On Mon, Dec 19, 2011 at 10:56:40PM +, David Woodhouse wrote:
> On Tue, 2011-12-20 at 09:31 +1100, David Gibson wrote:
> > When we're running paravirtualized under pHyp, it's impossible to
> > merge multiple PEs into one domain per se.  We could fake it rather
> > nastily by replicating all map/unmaps across mutiple PEs.  When
> > running bare metal, we could do so a bit more nicely by assigning
> > multiple PEs the same TCE pointer, but we have no mechanism to do so
> > at present. 
> 
> VT-d does share the page tables, as you could on bare metal. But it's an
> implementation detail — there's nothing *fundamentally* wrong with
> having to do the map/unmap for each PE, is there? It's only at VM setup
> time, so it doesn't really matter if it's slow.
> 
> Surely that's the only way you're going to present the guest with the
> illusion of having no IOMMU; so that DMA to any given guest physical
> address "just works".
> 
> On the other hand, perhaps you don't want to do that at all. Perhaps
> you're better off presenting a virtualised IOMMU to the guest and
> *insisting* that it fully uses it in order to do any DMA at all?

Not only do we want to, we more or less *have* to.  Existing kernels,
which are used to being paravirt under phyp expect and need a paravirt
iommu.  DMA without iommu setup just doesn't happen.  And the
map/unmap hypercalls are frequently a hot path, so slow does matter.

The other problem is that each domain's IOVA window is often fairly
small, a limitation that would get even worse if we try to put too
many devices in there.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Woodhouse
On Tue, 2011-12-20 at 09:31 +1100, David Gibson wrote:
> When we're running paravirtualized under pHyp, it's impossible to
> merge multiple PEs into one domain per se.  We could fake it rather
> nastily by replicating all map/unmaps across mutiple PEs.  When
> running bare metal, we could do so a bit more nicely by assigning
> multiple PEs the same TCE pointer, but we have no mechanism to do so
> at present. 

VT-d does share the page tables, as you could on bare metal. But it's an
implementation detail — there's nothing *fundamentally* wrong with
having to do the map/unmap for each PE, is there? It's only at VM setup
time, so it doesn't really matter if it's slow.

Surely that's the only way you're going to present the guest with the
illusion of having no IOMMU; so that DMA to any given guest physical
address "just works".

On the other hand, perhaps you don't want to do that at all. Perhaps
you're better off presenting a virtualised IOMMU to the guest and
*insisting* that it fully uses it in order to do any DMA at all?

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Gibson
On Mon, Dec 19, 2011 at 03:46:38PM +, David Woodhouse wrote:
> On Mon, 2011-12-19 at 11:11 +1100, David Gibson wrote:
> >   They have no inbuilt concept
> > of domains (though we could fake in software in some circumstances).
> 
> That sentence doesn't make much sense to me.
> 
> Either you're saying that every device behind a given IOMMU is in *one*
> domain (i.e. there's one domain per PCI host bridge), or you're saying
> that each device has its *own* domain (maximum isolation, but still
> perhaps not really true if you end up with PCIe-to-PCI bridges or broken
> hardware such as the ones we've been discovering, where multifunction
> devices do their DMA from the wrong function).
> 
> Either way, you *do* have domains. You just might not have thought about
> it before.

Right, sorry, what I mean is that there's no concept of runtime
assignment of devices to domains.  The concept used in the
documentation is a "Partitionable Endpoint" (PE) - which would
correspond to the isolation groups I'm proposing.  These are generally
assigned by firmware based on various hardware dependent isolation
constraints.

When we're running paravirtualized under pHyp, it's impossible to
merge multiple PEs into one domain per se.  We could fake it rather
nastily by replicating all map/unmaps across mutiple PEs.  When
running bare metal, we could do so a bit more nicely by assigning
multiple PEs the same TCE pointer, but we have no mechanism to do so
at present.

Older hardware usually does have just one PE per host bridge, but it
also often has only one slot per host bridge, so in practice is often
both one domain per host bridge _and_ one device per host bridge.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Woodhouse
On Mon, 2011-12-19 at 11:11 +1100, David Gibson wrote:
>   They have no inbuilt concept
> of domains (though we could fake in software in some circumstances).

That sentence doesn't make much sense to me.

Either you're saying that every device behind a given IOMMU is in *one*
domain (i.e. there's one domain per PCI host bridge), or you're saying
that each device has its *own* domain (maximum isolation, but still
perhaps not really true if you end up with PCIe-to-PCI bridges or broken
hardware such as the ones we've been discovering, where multifunction
devices do their DMA from the wrong function).

Either way, you *do* have domains. You just might not have thought about
it before.

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread Joerg Roedel
On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
> Well.. that's not where it is in Alex's code either.  The iommu layer
> (to the extent that there is such a "layer") supplies the group info,
> but the group management is in vfio, not the iommu layer.  With mine
> it is in the driver core because the struct device seemed the logical
> place for the group id.

Okay, seems we have different ideas of what the 'grouping code' is. I
talked about the group enumeration code only. But group handling code is
certainly important to some degree too. But before we argue about the
right place of the code we should agree on the semantics such code
should provide.

For me it is fine when the code is in VFIO for now, since VFIO is the
only user at the moment. When more users pop up we can easily move it
out to somewhere else. But the semantics influence the interface to
user-space too, so it is more important now. It splits up into a number
of sub problems:

1) How userspace detects the device<->group relationship?
2) Do we want group-binding/unbinding to device drivers?
3) Group attach/detach to iommu-domains?
4) What to do with hot-added devices?

For 1) I think the current solution with the iommu_group file is fine.
It is somewhat expensive for user-space to figure out the per-group
device-sets, but that is a one-time effort so it doesn't really matter.
Probably we can rename 'iommu_group' to 'isolation_group' or something.

Regarding 2), I think providing user-space a way to unbind groups of
devices from their drivers is a horrible idea. It makes it too easy for
the user to shoot himself in the foot. For example when the user wants
to assign a network card to a guest, but that card is in the same group
as the GPU and the screen wents blank when the guest is started.
Requiring devices to be unbound one-by-one is better because this way
the user always needs to know what he is doing.

For the remaining two questions I think the concept of a default-domain
is helpful.  The default-domain is a per-group domain which is created
by the iommu-driver at initialization time. It is the domain each device
is assigned to when it is not assigned to any other domain (which means
that each device/group is always attached to a domain). The default
domain will be used by the DMA-API layer. This implicitly means, that a
device which is not in the default-domain can't be used with the
dma-api. The dma_supported() function will return false for those
devices.

So what does this mean for point 3? I think we can implement attaching
and detaching groups in the iommu-api. This interface is not exposed to
userspace and can help VFIO and possible future users. Semantic is, that
domain_attach_group() only works when all devices in the group are in
their default domain and domain_detach_group() puts them back into the
default domain.

Question 4) is also solved with the default-domain concept. A hot-added
device is put in the domain of its group automatically. If the group is
owned by VFIO and another driver attaches to the device dma_supported
will return false and initialization will fail.

> Right, so, the other problem is that a well boundaried "iommu-driver'
> is something that only exists on x86 at present, and the "iommu api"
> is riddled with x86-centric thinking.  Or more accurately, design
> based on how the current intel and amd iommus work.  On systems like
> POWER, use of the iommu is not optional - it's built into the PCI host
> bridge and must be initialized when the bridge is probed, much earlier
> than iommu driver initialization on x86.  They have no inbuilt concept
> of domains (though we could fake in software in some circumstances).

Well, the iommu-api was designed for amd-vi and vt-d. But its concepts
turn out to be more general and by no way x86-centric anymore. We
support a couple of ARM platforms too for example. More to come. With
small extensions to the API we will also support GART-like IOMMUs in the
future.
For your hardware the domain-concept will work too. In terms of the
iommu-api a domain is nothing more than an address space. As far as I
understand there is a 1-1 mapping between a hardware iommu and a domain
in your case. The easiest solution then is to share the datastructures
which describe the address space to the hardware between all iommus in a
particular domain.


Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-18 Thread David Gibson
On Fri, Dec 16, 2011 at 03:53:53PM +0100, Joerg Roedel wrote:
> On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
> > Starting with it in the core and hand waving some future use that we
> > don't plan to implement right now seems like the wrong direction.
> 
> I agree with Alex. First of all, I havn't seen any real vfio problem
> that can't be solved with the current approach, and it has the great
> advantage of simplicity. It doesn't require a re-implementation of the
> driver-core based on groups.

I'm not re-implementing the driver core in terms of groups, just
adding the concept of groups to the driver core.

> I agree that we need some improvements to
> Alex' code for the dma-api layer to solve the problem with broken devices
> using the wrong requestor-id. But that can be done incrementally with
> the current (current == in the iommu-tree) approach implemented by Alex.
> 
> I also think that all this does not belong into the driver core for two
> reasons:
> 
>   1) The information for building the device groups is provided
>  by the iommu-layer

Yes.. no change there.

>   2) The group information is provided to vfio by the iommu-api

Um.. huh?  Currently, the iommu-api supplies the info vfio, therefore
it should?  I assume there's a non-circular argument you're trying to
make here, but I can't figure out what it is.

> This makes the iommu-layer the logical point to place the grouping
> code.

Well.. that's not where it is in Alex's code either.  The iommu layer
(to the extent that there is such a "layer") supplies the group info,
but the group management is in vfio, not the iommu layer.  With mine
it is in the driver core because the struct device seemed the logical
place for the group id.

Moving the group management into the iommu code itself probably does
make more sense, although I think that would be a change more of code
location than any actual content change.

> There are some sources outside of the iommu-layer that may influence
> grouping (like pci-quirks), but most of the job is done by the
> iommu-drivers.

Right, so, the other problem is that a well boundaried "iommu-driver'
is something that only exists on x86 at present, and the "iommu api"
is riddled with x86-centric thinking.  Or more accurately, design
based on how the current intel and amd iommus work.  On systems like
POWER, use of the iommu is not optional - it's built into the PCI host
bridge and must be initialized when the bridge is probed, much earlier
than iommu driver initialization on x86.  They have no inbuilt concept
of domains (though we could fake in software in some circumstances).

Now, that is something that needs to be fixed longer term.  I'm just
not sure how to deal with that and sorting out some sort of device
isolation / passthrough system.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-16 Thread Joerg Roedel
On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
> Starting with it in the core and hand waving some future use that we
> don't plan to implement right now seems like the wrong direction.

I agree with Alex. First of all, I havn't seen any real vfio problem
that can't be solved with the current approach, and it has the great
advantage of simplicity. It doesn't require a re-implementation of the
driver-core based on groups. I agree that we need some improvements to
Alex' code for the dma-api layer to solve the problem with broken devices
using the wrong requestor-id. But that can be done incrementally with
the current (current == in the iommu-tree) approach implemented by Alex.

I also think that all this does not belong into the driver core for two
reasons:

1) The information for building the device groups is provided
   by the iommu-layer
2) The group information is provided to vfio by the iommu-api

This makes the iommu-layer the logical point to place the grouping code.
There are some sources outside of the iommu-layer that may influence
grouping (like pci-quirks), but most of the job is done by the
iommu-drivers.

Thanks,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread David Gibson
On Thu, Dec 15, 2011 at 09:49:05PM -0700, Alex Williamson wrote:
> On Fri, 2011-12-16 at 12:40 +1100, David Gibson wrote:
> > On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
> > > On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
> > > > Here's the second spin of my preferred approach to handling grouping
> > > > of devices for safe assignment to guests.
> > > > 
> > > > Changes since v1:
> > > >  * Many name changes and file moves for improved consistency
> > > >  * Bugfixes and cleanups
> > > >  * The interface to the next layer up is considerably fleshed out,
> > > >although it still needs work.
> > > >  * Example initialization of groups for p5ioc2 and p7ioc.
> > > > 
> > > > TODO:
> > > >  * Need sample initialization of groups for intel and/or amd iommus
> > > 
> > > I think this very well might imposed significant bloat for those
> > > implementations.  On POWER you typically don't have singleton groups,
> > > while it's the norm on x86.  I don't know that either intel or amd iommu
> > > code have existing structures that they can simply tack the group
> > > pointer to.
> > 
> > Actually, I think they can probably just use the group pointer in the
> > struct device.  Each PCI function will typically allocate a new group
> > and put the pointer in the struct device and no-where else.  Devices
> > hidden under bridges copy the pointer from the bridge parent instead.
> > I will have to check the unplug path to ensure we can manage the group
> > lifetime properly, of course.
> > 
> > >  Again, this is one of the reasons that I think the current
> > > vfio implementation is the right starting point.  We keep groups within
> > > vfio, imposing zero overhead for systems not making use of it and only
> > > require iommu drivers to implement a trivial function to opt-in.  As we
> > > start to make groups more pervasive in the dma layer, independent of
> > > userspace driver exposure, we can offload pieces to the core.  Starting
> > > with it in the core and hand waving some future use that we don't plan
> > > to implement right now seems like the wrong direction.
> > 
> > Well, I think we must agree to disagree here; I think treating groups
> > as identifiable objects is worthwhile.  That said, I am looking for
> > ways to whittle down the overhead when they're not in use.
> > 
> > > >  * Use of sysfs attributes to control group permission is probably a
> > > >mistake.  Although it seems a bit odd, registering a chardev for
> > > >each group is probably better, because perms can be set from udev
> > > >rules, just like everything else.
> > > 
> > > I agree, this is a horrible mistake.  Reinventing file permissions via
> > > sysfs is bound to be broken and doesn't account for selinux permissions.
> > > Again, I know you don't like aspects of the vfio group management, but
> > > it gets this right imho.
> > 
> > Yeah.  I came up with this because I was trying to avoid registering a
> > device whose only purpose was to act as a permissioned "handle" on the
> > group.  But it is a better approach, despite that.  I just wanted to
> > send out the new patches for comment without waiting to do that
> > rework.
> 
> So we end up with a chardev created by the core, whose only purpose is
> setting the group access permissions for userspace usage, which only
> becomes useful with something like vfio?  It's an odd conflict that
> isolation groups would get involved with userspace permissions to access
> the group, but leave enforcement of isolation via iommu groups to the
> "binder" driver

Hm, perhaps.  I'll think about it.

> (where it seems like vfio is still going to need some
> kind of merge interface to share a domain between isolation groups).

That was always going to be the case, but I wish we could stop
thinking of it as the "merge" interface, since I think that term is
distorting thinking about how the interface works.

For example, I think opening a new domain, then adding / removing
groups provides a much cleaner model than "merging' groups without a
separate handle on the iommu domain we're building.

> Is this same chardev going to be a generic conduit for
> read/write/mmap/ioctl to the "binder" driver or does vfio need to create
> it's own chardev for that?

Right, I was thinking that the binder could supply its own fops or
something for the group chardev once the group is bound.

>  In the former case, are we ok with a chardev
> that has an entirely modular API behind it, or maybe you're planning to
> define some basic API infrastructure, in which case this starts smelling
> like implementing vfio in the core.

I think it can work, but I do need to look closer.

>  In the latter case (isolation
> chardev + vfio chardev) coordinating permissions sounds like a mess.

Absolutely.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _wa

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Fri, 2011-12-16 at 12:40 +1100, David Gibson wrote:
> On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
> > On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
> > > Here's the second spin of my preferred approach to handling grouping
> > > of devices for safe assignment to guests.
> > > 
> > > Changes since v1:
> > >  * Many name changes and file moves for improved consistency
> > >  * Bugfixes and cleanups
> > >  * The interface to the next layer up is considerably fleshed out,
> > >although it still needs work.
> > >  * Example initialization of groups for p5ioc2 and p7ioc.
> > > 
> > > TODO:
> > >  * Need sample initialization of groups for intel and/or amd iommus
> > 
> > I think this very well might imposed significant bloat for those
> > implementations.  On POWER you typically don't have singleton groups,
> > while it's the norm on x86.  I don't know that either intel or amd iommu
> > code have existing structures that they can simply tack the group
> > pointer to.
> 
> Actually, I think they can probably just use the group pointer in the
> struct device.  Each PCI function will typically allocate a new group
> and put the pointer in the struct device and no-where else.  Devices
> hidden under bridges copy the pointer from the bridge parent instead.
> I will have to check the unplug path to ensure we can manage the group
> lifetime properly, of course.
> 
> >  Again, this is one of the reasons that I think the current
> > vfio implementation is the right starting point.  We keep groups within
> > vfio, imposing zero overhead for systems not making use of it and only
> > require iommu drivers to implement a trivial function to opt-in.  As we
> > start to make groups more pervasive in the dma layer, independent of
> > userspace driver exposure, we can offload pieces to the core.  Starting
> > with it in the core and hand waving some future use that we don't plan
> > to implement right now seems like the wrong direction.
> 
> Well, I think we must agree to disagree here; I think treating groups
> as identifiable objects is worthwhile.  That said, I am looking for
> ways to whittle down the overhead when they're not in use.
> 
> > >  * Use of sysfs attributes to control group permission is probably a
> > >mistake.  Although it seems a bit odd, registering a chardev for
> > >each group is probably better, because perms can be set from udev
> > >rules, just like everything else.
> > 
> > I agree, this is a horrible mistake.  Reinventing file permissions via
> > sysfs is bound to be broken and doesn't account for selinux permissions.
> > Again, I know you don't like aspects of the vfio group management, but
> > it gets this right imho.
> 
> Yeah.  I came up with this because I was trying to avoid registering a
> device whose only purpose was to act as a permissioned "handle" on the
> group.  But it is a better approach, despite that.  I just wanted to
> send out the new patches for comment without waiting to do that
> rework.

So we end up with a chardev created by the core, whose only purpose is
setting the group access permissions for userspace usage, which only
becomes useful with something like vfio?  It's an odd conflict that
isolation groups would get involved with userspace permissions to access
the group, but leave enforcement of isolation via iommu groups to the
"binder" driver (where it seems like vfio is still going to need some
kind of merge interface to share a domain between isolation groups).

Is this same chardev going to be a generic conduit for
read/write/mmap/ioctl to the "binder" driver or does vfio need to create
it's own chardev for that?  In the former case, are we ok with a chardev
that has an entirely modular API behind it, or maybe you're planning to
define some basic API infrastructure, in which case this starts smelling
like implementing vfio in the core.  In the latter case (isolation
chardev + vfio chardev) coordinating permissions sounds like a mess.

Alex




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread David Gibson
On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
> On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
> > Here's the second spin of my preferred approach to handling grouping
> > of devices for safe assignment to guests.
> > 
> > Changes since v1:
> >  * Many name changes and file moves for improved consistency
> >  * Bugfixes and cleanups
> >  * The interface to the next layer up is considerably fleshed out,
> >although it still needs work.
> >  * Example initialization of groups for p5ioc2 and p7ioc.
> > 
> > TODO:
> >  * Need sample initialization of groups for intel and/or amd iommus
> 
> I think this very well might imposed significant bloat for those
> implementations.  On POWER you typically don't have singleton groups,
> while it's the norm on x86.  I don't know that either intel or amd iommu
> code have existing structures that they can simply tack the group
> pointer to.

Actually, I think they can probably just use the group pointer in the
struct device.  Each PCI function will typically allocate a new group
and put the pointer in the struct device and no-where else.  Devices
hidden under bridges copy the pointer from the bridge parent instead.
I will have to check the unplug path to ensure we can manage the group
lifetime properly, of course.

>  Again, this is one of the reasons that I think the current
> vfio implementation is the right starting point.  We keep groups within
> vfio, imposing zero overhead for systems not making use of it and only
> require iommu drivers to implement a trivial function to opt-in.  As we
> start to make groups more pervasive in the dma layer, independent of
> userspace driver exposure, we can offload pieces to the core.  Starting
> with it in the core and hand waving some future use that we don't plan
> to implement right now seems like the wrong direction.

Well, I think we must agree to disagree here; I think treating groups
as identifiable objects is worthwhile.  That said, I am looking for
ways to whittle down the overhead when they're not in use.

> >  * Use of sysfs attributes to control group permission is probably a
> >mistake.  Although it seems a bit odd, registering a chardev for
> >each group is probably better, because perms can be set from udev
> >rules, just like everything else.
> 
> I agree, this is a horrible mistake.  Reinventing file permissions via
> sysfs is bound to be broken and doesn't account for selinux permissions.
> Again, I know you don't like aspects of the vfio group management, but
> it gets this right imho.

Yeah.  I came up with this because I was trying to avoid registering a
device whose only purpose was to act as a permissioned "handle" on the
group.  But it is a better approach, despite that.  I just wanted to
send out the new patches for comment without waiting to do that
rework.

> >  * Need more details of what the binder structure will need to
> >contain.
> >  * Handle complete removal of groups.
> >  * Clarify what will need to happen on the hot unplug path.
> 
> We're still also removing devices from the driver model, this means
> drivers like vfio need to re-implement a lot of the driver core for
> driving each individual device in the group,

Ah, so, that's a bit I've yet to flesh out.  The subtle distinction is
that we prevent driver _matching_ not driver _binding.  It's
intentionally still possible to explicitly bind drivers to devices in
the group, by bypassing the automatic match mechanism.

I'm intending that when a group is bound, the binder struct will
(optionally) specify a driver (or several, for different bus types)
which will be "force bound" to all the devices in the group.

> and as you indicate, it's
> unclear what happens in the hotplug path

It's clear enough in concept.  I just have to work out exactly where
we need to call d_i_dev_remove(), and whether we'll need some sort of
callback to the bridge / iommu code.

> and I wonder if things like
> suspend/resume will also require non-standard support.  I really prefer
> attaching individual bus drivers to devices using the standard
> bind/unbind mechanisms.  I have a hard time seeing how this is an
> improvement from vfio.  Thanks,
> 
> Alex
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Thu, 2011-12-15 at 11:05 -0700, Alex Williamson wrote:
> On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
> > Here's the second spin of my preferred approach to handling grouping
> > of devices for safe assignment to guests.
> > 
> > Changes since v1:
> >  * Many name changes and file moves for improved consistency
> >  * Bugfixes and cleanups
> >  * The interface to the next layer up is considerably fleshed out,
> >although it still needs work.
> >  * Example initialization of groups for p5ioc2 and p7ioc.
> > 
> > TODO:
> >  * Need sample initialization of groups for intel and/or amd iommus
> 
> I think this very well might imposed significant bloat for those
> implementations.  On POWER you typically don't have singleton groups,
> while it's the norm on x86.  I don't know that either intel or amd iommu
> code have existing structures that they can simply tack the group
> pointer to.  Again, this is one of the reasons that I think the current
> vfio implementation is the right starting point.  We keep groups within
> vfio, imposing zero overhead for systems not making use of it and only
> require iommu drivers to implement a trivial function to opt-in.  As we
> start to make groups more pervasive in the dma layer, independent of
> userspace driver exposure, we can offload pieces to the core.  Starting
> with it in the core and hand waving some future use that we don't plan
> to implement right now seems like the wrong direction.
> 
> >  * Use of sysfs attributes to control group permission is probably a
> >mistake.  Although it seems a bit odd, registering a chardev for
> >each group is probably better, because perms can be set from udev
> >rules, just like everything else.
> 
> I agree, this is a horrible mistake.  Reinventing file permissions via
> sysfs is bound to be broken and doesn't account for selinux permissions.
> Again, I know you don't like aspects of the vfio group management, but
> it gets this right imho.
> 
> >  * Need more details of what the binder structure will need to
> >contain.
> >  * Handle complete removal of groups.
> >  * Clarify what will need to happen on the hot unplug path.
> 
> We're still also removing devices from the driver model, this means
> drivers like vfio need to re-implement a lot of the driver core for
> driving each individual device in the group, and as you indicate, it's
> unclear what happens in the hotplug path and I wonder if things like
> suspend/resume will also require non-standard support.  I really prefer
> attaching individual bus drivers to devices using the standard
> bind/unbind mechanisms.  I have a hard time seeing how this is an
> improvement from vfio.  Thanks,

I should also mention that I just pushed a new version of the vfio
series out to github, it can be found here:

git://github.com/awilliam/linux-vfio.git vfio-next-20111215

This fixes many bugs, including PCI config space access sizes and the
todo item of actually preventing user access to the MSI-X table area.  I
think the vfio-pci driver is much closer to being read to submit now.
It think I've addressed all the previous review comments.  Still pending
is a documentation refresh and some decision about what and how we
expose iommu information.

As usual, the matching qemu tree is here:

git://github.com/awilliam/qemu-vfio.git vfio-ng

I've tested this with an Intel 82576 (both PF and VF), Broadcom BCM5755,
Intel HD Audio controller, and legacy PCI SB Live.  Thanks,

Alex




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
> Here's the second spin of my preferred approach to handling grouping
> of devices for safe assignment to guests.
> 
> Changes since v1:
>  * Many name changes and file moves for improved consistency
>  * Bugfixes and cleanups
>  * The interface to the next layer up is considerably fleshed out,
>although it still needs work.
>  * Example initialization of groups for p5ioc2 and p7ioc.
> 
> TODO:
>  * Need sample initialization of groups for intel and/or amd iommus

I think this very well might imposed significant bloat for those
implementations.  On POWER you typically don't have singleton groups,
while it's the norm on x86.  I don't know that either intel or amd iommu
code have existing structures that they can simply tack the group
pointer to.  Again, this is one of the reasons that I think the current
vfio implementation is the right starting point.  We keep groups within
vfio, imposing zero overhead for systems not making use of it and only
require iommu drivers to implement a trivial function to opt-in.  As we
start to make groups more pervasive in the dma layer, independent of
userspace driver exposure, we can offload pieces to the core.  Starting
with it in the core and hand waving some future use that we don't plan
to implement right now seems like the wrong direction.

>  * Use of sysfs attributes to control group permission is probably a
>mistake.  Although it seems a bit odd, registering a chardev for
>each group is probably better, because perms can be set from udev
>rules, just like everything else.

I agree, this is a horrible mistake.  Reinventing file permissions via
sysfs is bound to be broken and doesn't account for selinux permissions.
Again, I know you don't like aspects of the vfio group management, but
it gets this right imho.

>  * Need more details of what the binder structure will need to
>contain.
>  * Handle complete removal of groups.
>  * Clarify what will need to happen on the hot unplug path.

We're still also removing devices from the driver model, this means
drivers like vfio need to re-implement a lot of the driver core for
driving each individual device in the group, and as you indicate, it's
unclear what happens in the hotplug path and I wonder if things like
suspend/resume will also require non-standard support.  I really prefer
attaching individual bus drivers to devices using the standard
bind/unbind mechanisms.  I have a hard time seeing how this is an
improvement from vfio.  Thanks,

Alex




[Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-14 Thread David Gibson
Here's the second spin of my preferred approach to handling grouping
of devices for safe assignment to guests.

Changes since v1:
 * Many name changes and file moves for improved consistency
 * Bugfixes and cleanups
 * The interface to the next layer up is considerably fleshed out,
   although it still needs work.
 * Example initialization of groups for p5ioc2 and p7ioc.

TODO:
 * Need sample initialization of groups for intel and/or amd iommus
 * Use of sysfs attributes to control group permission is probably a
   mistake.  Although it seems a bit odd, registering a chardev for
   each group is probably better, because perms can be set from udev
   rules, just like everything else.
 * Need more details of what the binder structure will need to
   contain.
 * Handle complete removal of groups.
 * Clarify what will need to happen on the hot unplug path.