Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-30 Thread David Gibson
On Wed, Jan 25, 2012 at 04:44:53PM -0700, Alex Williamson wrote:
 On Wed, 2012-01-25 at 14:13 +1100, David Gibson wrote:
  On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
   On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
 On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
  Well.. that's not where it is in Alex's code either.  The iommu 
  layer
  (to the extent that there is such a layer) supplies the group 
  info,
  but the group management is in vfio, not the iommu layer.  With mine
  it is in the driver core because the struct device seemed the 
  logical
  place for the group id.
 
 Okay, seems we have different ideas of what the 'grouping code' is. I
 talked about the group enumeration code only. But group handling code 
 is
 certainly important to some degree too. But before we argue about the
 right place of the code we should agree on the semantics such code
 should provide.
 
 For me it is fine when the code is in VFIO for now, since VFIO is the
 only user at the moment. When more users pop up we can easily move it
 out to somewhere else. But the semantics influence the interface to
 user-space too, so it is more important now. It splits up into a 
 number
 of sub problems:
 
   1) How userspace detects the device-group relationship?
   2) Do we want group-binding/unbinding to device drivers?
   3) Group attach/detach to iommu-domains?
   4) What to do with hot-added devices?
 
 For 1) I think the current solution with the iommu_group file is fine.
 It is somewhat expensive for user-space to figure out the per-group
 device-sets, but that is a one-time effort so it doesn't really 
 matter.
 Probably we can rename 'iommu_group' to 'isolation_group' or
 something.

Hrm.  Alex's group code also provides no in-kernel way to enumerate a
group, short of walking every device in the system.  And it provides
no way to attach information to a group.  It just seems foolish to me
to have this concept without some kind of in-kernel handle on it, and
   
   Who else needs to enumerate groups right now?  Who else needs to attach
   data to a group.  We seem to be caught in this loop of arguing that we
   need driver core based group management, but we don't have any plans to
   make use of it, so it just bloats the kernel for most of the users that
   don't care about it.
  
  So, Ben and I discussed this with David Woodhouse during
  linux.conf.au.  He does want to update the core iommu_ops and dma_ops
  handling to be group aware, and is willing to do the work for that.
  So I will be doing another spin of my isolation code as a basis for
  that work of his.  So there's our other user.
 
 Hmm...
 
if you're managing the in-kernel representation you might as well
expose it to userspace there as well.
   
   Unfortunately this is just the start a peeling back layers of the onion.
   We manage groups in the driver core, so the driver core should expose
   them to userspace.  The driver core exposes them to userspace, so now it
   needs to manage permissions for userspace.
  
  That doesn't necessarily follow.  The current draft has the core group
  code export character devices on which permissions are managed, but
  I'm also considering options where it only exports sysfs and something
  else does the character device and permissions.
 
 Let's start to define it then.  A big problem I have with the isolation
 infrastructure you're proposing is that it doesn't have well defined
 bounds or interfaces.  It pulls devices out of the normal driver
 model,

It doesn't pull them out of the driver model.  The struct devices are
all there just like usual, it's just a flag which inhibits driver binding.

 and even goes so far as to start exposing chardevs for groups to
 userspace.

I'm planning to split the patch so the character devices (and maybe
even the concept of binders) aren't in the core group part, which
should be all dwmw2 needs.

  The former feels like completely the wrong approach and the
 latter oversteps the bounds of a core interface for a very niche usage,
 IMO.
 
 Your argument for making groups more pervasive is that the iommu drivers
 need to know about them already and we should make use of groups in the
 dma_ops and iommu_ops interfaces.  Fine, that seems reasonable and I
 assume is what Woodhouse has agreed to work on.  I don't however see
 that as an argument for creating a group driver class or isolation
 API.

No, it's not, hence the split.

 In fact, we could probably layer a fairly similar version of vfio on top
 of group enlightened driver core with minimal changes.

Yes, you could.

Then we add permissions and
   now we need to provide group access, then we need a channel to an actual
   userspace device 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-25 Thread Alex Williamson
On Wed, 2012-01-25 at 14:13 +1100, David Gibson wrote:
 On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
  On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
   On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
 Well.. that's not where it is in Alex's code either.  The iommu layer
 (to the extent that there is such a layer) supplies the group info,
 but the group management is in vfio, not the iommu layer.  With mine
 it is in the driver core because the struct device seemed the logical
 place for the group id.

Okay, seems we have different ideas of what the 'grouping code' is. I
talked about the group enumeration code only. But group handling code is
certainly important to some degree too. But before we argue about the
right place of the code we should agree on the semantics such code
should provide.

For me it is fine when the code is in VFIO for now, since VFIO is the
only user at the moment. When more users pop up we can easily move it
out to somewhere else. But the semantics influence the interface to
user-space too, so it is more important now. It splits up into a number
of sub problems:

1) How userspace detects the device-group relationship?
2) Do we want group-binding/unbinding to device drivers?
3) Group attach/detach to iommu-domains?
4) What to do with hot-added devices?

For 1) I think the current solution with the iommu_group file is fine.
It is somewhat expensive for user-space to figure out the per-group
device-sets, but that is a one-time effort so it doesn't really matter.
Probably we can rename 'iommu_group' to 'isolation_group' or
something.
   
   Hrm.  Alex's group code also provides no in-kernel way to enumerate a
   group, short of walking every device in the system.  And it provides
   no way to attach information to a group.  It just seems foolish to me
   to have this concept without some kind of in-kernel handle on it, and
  
  Who else needs to enumerate groups right now?  Who else needs to attach
  data to a group.  We seem to be caught in this loop of arguing that we
  need driver core based group management, but we don't have any plans to
  make use of it, so it just bloats the kernel for most of the users that
  don't care about it.
 
 So, Ben and I discussed this with David Woodhouse during
 linux.conf.au.  He does want to update the core iommu_ops and dma_ops
 handling to be group aware, and is willing to do the work for that.
 So I will be doing another spin of my isolation code as a basis for
 that work of his.  So there's our other user.

Hmm...

   if you're managing the in-kernel representation you might as well
   expose it to userspace there as well.
  
  Unfortunately this is just the start a peeling back layers of the onion.
  We manage groups in the driver core, so the driver core should expose
  them to userspace.  The driver core exposes them to userspace, so now it
  needs to manage permissions for userspace.
 
 That doesn't necessarily follow.  The current draft has the core group
 code export character devices on which permissions are managed, but
 I'm also considering options where it only exports sysfs and something
 else does the character device and permissions.

Let's start to define it then.  A big problem I have with the isolation
infrastructure you're proposing is that it doesn't have well defined
bounds or interfaces.  It pulls devices out of the normal driver model,
and even goes so far as to start exposing chardevs for groups to
userspace.  The former feels like completely the wrong approach and the
latter oversteps the bounds of a core interface for a very niche usage,
IMO.

Your argument for making groups more pervasive is that the iommu drivers
need to know about them already and we should make use of groups in the
dma_ops and iommu_ops interfaces.  Fine, that seems reasonable and I
assume is what Woodhouse has agreed to work on.  I don't however see
that as an argument for creating a group driver class or isolation API.
In fact, we could probably layer a fairly similar version of vfio on top
of group enlightened driver core with minimal changes.

   Then we add permissions and
  now we need to provide group access, then we need a channel to an actual
  userspace device driver, zing! we add a whole API there,
 
 And the other option I was looking add had the core providing the char
 device but having it's fops get passed straight through to the binder.

I'm not a fan of that idea.  Core driver code should not be creating
chardevs and I doubt anyone is onboard with the idea of a chardev with
an arbitrary API bound behind it.

  then we need
  group driver binding, then we need group device driver binding, blam!
  another API, then we need...  I don't see a clear end marker that
  doesn't continue to bloat 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2012-01-24 Thread David Gibson
On Tue, Dec 20, 2011 at 09:30:37PM -0700, Alex Williamson wrote:
 On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
  On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
   On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
Well.. that's not where it is in Alex's code either.  The iommu layer
(to the extent that there is such a layer) supplies the group info,
but the group management is in vfio, not the iommu layer.  With mine
it is in the driver core because the struct device seemed the logical
place for the group id.
   
   Okay, seems we have different ideas of what the 'grouping code' is. I
   talked about the group enumeration code only. But group handling code is
   certainly important to some degree too. But before we argue about the
   right place of the code we should agree on the semantics such code
   should provide.
   
   For me it is fine when the code is in VFIO for now, since VFIO is the
   only user at the moment. When more users pop up we can easily move it
   out to somewhere else. But the semantics influence the interface to
   user-space too, so it is more important now. It splits up into a number
   of sub problems:
   
 1) How userspace detects the device-group relationship?
 2) Do we want group-binding/unbinding to device drivers?
 3) Group attach/detach to iommu-domains?
 4) What to do with hot-added devices?
   
   For 1) I think the current solution with the iommu_group file is fine.
   It is somewhat expensive for user-space to figure out the per-group
   device-sets, but that is a one-time effort so it doesn't really matter.
   Probably we can rename 'iommu_group' to 'isolation_group' or
   something.
  
  Hrm.  Alex's group code also provides no in-kernel way to enumerate a
  group, short of walking every device in the system.  And it provides
  no way to attach information to a group.  It just seems foolish to me
  to have this concept without some kind of in-kernel handle on it, and
 
 Who else needs to enumerate groups right now?  Who else needs to attach
 data to a group.  We seem to be caught in this loop of arguing that we
 need driver core based group management, but we don't have any plans to
 make use of it, so it just bloats the kernel for most of the users that
 don't care about it.

So, Ben and I discussed this with David Woodhouse during
linux.conf.au.  He does want to update the core iommu_ops and dma_ops
handling to be group aware, and is willing to do the work for that.
So I will be doing another spin of my isolation code as a basis for
that work of his.  So there's our other user.

  if you're managing the in-kernel representation you might as well
  expose it to userspace there as well.
 
 Unfortunately this is just the start a peeling back layers of the onion.
 We manage groups in the driver core, so the driver core should expose
 them to userspace.  The driver core exposes them to userspace, so now it
 needs to manage permissions for userspace.

That doesn't necessarily follow.  The current draft has the core group
code export character devices on which permissions are managed, but
I'm also considering options where it only exports sysfs and something
else does the character device and permissions.

  Then we add permissions and
 now we need to provide group access, then we need a channel to an actual
 userspace device driver, zing! we add a whole API there,

And the other option I was looking add had the core providing the char
device but having it's fops get passed straight through to the binder.

 then we need
 group driver binding, then we need group device driver binding, blam!
 another API, then we need...  I don't see a clear end marker that
 doesn't continue to bloat the core and add functionality that nobody
 else needs and we don't even have plans of integrating more pervasively.
 This appears to end with 80 to 90% of the vfio core code moving into the
 driver core.

I don't agree.  Working out the right boundary isn't totally obvious,
certainly, but that doesn't mean a reasonable boundary can't be found.

 
   Regarding 2), I think providing user-space a way to unbind groups of
   devices from their drivers is a horrible idea.
  
  Well, I'm not wed to unbinding all the drivers at once.  But what I
  *do* think is essential is that we can atomically switch off automatic
  driver matching for the whole group.  Without that nothing really
  stops a driver reattaching to the things you've unbound, so you end up
  bailing a leakey boat.
 
 Huh?  There is no issue with removing a device from one driver and
 attaching it to another.  This happens all the time.  If you're talking
 about hotplug, all we have to do is sit on the bus notifier chain and we
 get called when devices are added, before the driver has a chance to
 attach.  We can then force a vfio driver to attach when needed.

But what is the transition point at which you know force attaching a
vfio driver is the right thing?  My 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-21 Thread Joerg Roedel
On Wed, Dec 21, 2011 at 02:32:35PM +1100, David Gibson wrote:
 On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
  For 1) I think the current solution with the iommu_group file is fine.
  It is somewhat expensive for user-space to figure out the per-group
  device-sets, but that is a one-time effort so it doesn't really matter.
  Probably we can rename 'iommu_group' to 'isolation_group' or
  something.
 
 Hrm.  Alex's group code also provides no in-kernel way to enumerate a
 group, short of walking every device in the system.  And it provides
 no way to attach information to a group.  It just seems foolish to me
 to have this concept without some kind of in-kernel handle on it, and
 if you're managing the in-kernel representation you might as well
 expose it to userspace there as well.

I agree that we should have an in-kernel handle for groups in the
future. This would be generic code the iommu-drivers can use to manage
their groups and where the pci-quirks can chime in. But we don't need
that immediatly, lets add this incrementally.
From the implementation side I would prefer to introduce a dev-iommu
pointer (a generalization of the dev-archdata-iommu pointer we have on
x86, ia64 and arm already) and link this to the handle of the group.
This makes it also easy to walk all devices in a group within the
kernel.
I still disagree that we need to expose this to user-space in any other
way then what we currently have. It is just not worth the effort when it
is used that rarely. It is perfectly fine if user-space has to scan the
sysfs device tree to build its own group data structures.

  Regarding 2), I think providing user-space a way to unbind groups of
  devices from their drivers is a horrible idea.
 
 Well, I'm not wed to unbinding all the drivers at once.  But what I
 *do* think is essential is that we can atomically switch off automatic
 driver matching for the whole group.  Without that nothing really
 stops a driver reattaching to the things you've unbound, so you end up
 bailing a leakey boat.

Okay, makes sense for hotadd to have this. What we need from the
driver-core here is a method to disable driver probing for a device. The
iommu-layer can then call this method on the DEVICE_ADD notifier if
required. This also can be done incrementally to the current VFIO
implementation.

 Ok, that's not the usage model I had in mind.  What I'm thinking here
 is that the admin removes groups that will be used in guests from the
 host kernel (probably via boot-time scripts).  The guests will pick
 them up later, so that when a guest quits then restarts, we don't have
 devices appearing transiently in the host.

This 'remove-groups-from-the-host' thing can easily be done (also at
boot time) by just binding the devices in question to the VFIO driver.
And the devices can stay with VFIO for the lifetime of the system (or
until the host want to make use of it). I don't think we need any
additional logic for that.

  For the remaining two questions I think the concept of a default-domain
  is helpful.  The default-domain is a per-group domain which is created
  by the iommu-driver at initialization time. It is the domain each device
  is assigned to when it is not assigned to any other domain (which means
  that each device/group is always attached to a domain). The default
  domain will be used by the DMA-API layer. This implicitly means, that a
  device which is not in the default-domain can't be used with the
  dma-api. The dma_supported() function will return false for those
  devices.
 
 But.. by definition every device in the group must belong to the same
 domain.  So how is this default domain in any way different from
 current domain.

The default domain is created by the iommu-drivers for all groups it
finds on the system and the devices are automatically assigned to it at
iommu-driver initialization time. This domain will be used for mapping
dma-api calls.
The current domain on the other side can also be a domain created by
VFIO (or KVM as it is today) for mapping the device dma space into a
guest.

 The domain_{attach,detach} functions absolutely should be group based
 not device based.  That's what it means to be a group.

I disagree with that. This would force all iommu-drivers that have no
group-concept (basically all current ARM iommu implementations) to work
on groups too. That makes the iommu-api much harder to use.

We could do instead:

1) Force users of the iommu-api to bind each device in the group
   individually.
2) Attach all devices in the group automatically to the same
   domain if a single device is attached to it.

I have no strong opinion which way we go, but have a slight favour for
way 2. To make it harder to mis-use we can force that way 2 only works
if the group is in its default-domain and there are no pending
dma-allocations.

  Question 4) is also solved with the default-domain concept. A hot-added
  device is put in the domain of its 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread David Gibson
On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
 On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
  Well.. that's not where it is in Alex's code either.  The iommu layer
  (to the extent that there is such a layer) supplies the group info,
  but the group management is in vfio, not the iommu layer.  With mine
  it is in the driver core because the struct device seemed the logical
  place for the group id.
 
 Okay, seems we have different ideas of what the 'grouping code' is. I
 talked about the group enumeration code only. But group handling code is
 certainly important to some degree too. But before we argue about the
 right place of the code we should agree on the semantics such code
 should provide.
 
 For me it is fine when the code is in VFIO for now, since VFIO is the
 only user at the moment. When more users pop up we can easily move it
 out to somewhere else. But the semantics influence the interface to
 user-space too, so it is more important now. It splits up into a number
 of sub problems:
 
   1) How userspace detects the device-group relationship?
   2) Do we want group-binding/unbinding to device drivers?
   3) Group attach/detach to iommu-domains?
   4) What to do with hot-added devices?
 
 For 1) I think the current solution with the iommu_group file is fine.
 It is somewhat expensive for user-space to figure out the per-group
 device-sets, but that is a one-time effort so it doesn't really matter.
 Probably we can rename 'iommu_group' to 'isolation_group' or
 something.

Hrm.  Alex's group code also provides no in-kernel way to enumerate a
group, short of walking every device in the system.  And it provides
no way to attach information to a group.  It just seems foolish to me
to have this concept without some kind of in-kernel handle on it, and
if you're managing the in-kernel representation you might as well
expose it to userspace there as well.

 Regarding 2), I think providing user-space a way to unbind groups of
 devices from their drivers is a horrible idea.

Well, I'm not wed to unbinding all the drivers at once.  But what I
*do* think is essential is that we can atomically switch off automatic
driver matching for the whole group.  Without that nothing really
stops a driver reattaching to the things you've unbound, so you end up
bailing a leakey boat.

 It makes it too easy for
 the user to shoot himself in the foot. For example when the user wants
 to assign a network card to a guest, but that card is in the same group
 as the GPU and the screen wents blank when the guest is started.
 Requiring devices to be unbound one-by-one is better because this way
 the user always needs to know what he is doing.

Ok, that's not the usage model I had in mind.  What I'm thinking here
is that the admin removes groups that will be used in guests from the
host kernel (probably via boot-time scripts).  The guests will pick
them up later, so that when a guest quits then restarts, we don't have
devices appearing transiently in the host.

 For the remaining two questions I think the concept of a default-domain
 is helpful.  The default-domain is a per-group domain which is created
 by the iommu-driver at initialization time. It is the domain each device
 is assigned to when it is not assigned to any other domain (which means
 that each device/group is always attached to a domain). The default
 domain will be used by the DMA-API layer. This implicitly means, that a
 device which is not in the default-domain can't be used with the
 dma-api. The dma_supported() function will return false for those
 devices.

But.. by definition every device in the group must belong to the same
domain.  So how is this default domain in any way different from
current domain.

In addition making dma_supported() doesn't seem like a strong enough
constraint.  With this a kernel driver which does not use DMA, or
which is initializing and hasn't yet hit a dma_supported() check could
be accessing a device which is in the same group as something a guest
is simultaneously accessing.  Since there's no DMA (on the kernel
side) we can't get DMA conflicts but there are other forms of
isolation that the group could be enforcing which would make that
unsafe. e.g. irqs from the two devices can't be reliably separated,
debug registers on one device let config space be altered to move it
on top of the other, one can cause a bus error which will mess up the
other.

 So what does this mean for point 3? I think we can implement attaching
 and detaching groups in the iommu-api. This interface is not exposed to
 userspace and can help VFIO and possible future users. Semantic is, that
 domain_attach_group() only works when all devices in the group are in
 their default domain and domain_detach_group() puts them back into the
 default domain.

The domain_{attach,detach} functions absolutely should be group based
not device based.  That's what it means to be a group.

 Question 4) is also solved with the 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread Alex Williamson
On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
 On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
  On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
   Well.. that's not where it is in Alex's code either.  The iommu layer
   (to the extent that there is such a layer) supplies the group info,
   but the group management is in vfio, not the iommu layer.  With mine
   it is in the driver core because the struct device seemed the logical
   place for the group id.
  
  Okay, seems we have different ideas of what the 'grouping code' is. I
  talked about the group enumeration code only. But group handling code is
  certainly important to some degree too. But before we argue about the
  right place of the code we should agree on the semantics such code
  should provide.
  
  For me it is fine when the code is in VFIO for now, since VFIO is the
  only user at the moment. When more users pop up we can easily move it
  out to somewhere else. But the semantics influence the interface to
  user-space too, so it is more important now. It splits up into a number
  of sub problems:
  
  1) How userspace detects the device-group relationship?
  2) Do we want group-binding/unbinding to device drivers?
  3) Group attach/detach to iommu-domains?
  4) What to do with hot-added devices?
  
  For 1) I think the current solution with the iommu_group file is fine.
  It is somewhat expensive for user-space to figure out the per-group
  device-sets, but that is a one-time effort so it doesn't really matter.
  Probably we can rename 'iommu_group' to 'isolation_group' or
  something.
 
 Hrm.  Alex's group code also provides no in-kernel way to enumerate a
 group, short of walking every device in the system.  And it provides
 no way to attach information to a group.  It just seems foolish to me
 to have this concept without some kind of in-kernel handle on it, and

Who else needs to enumerate groups right now?  Who else needs to attach
data to a group.  We seem to be caught in this loop of arguing that we
need driver core based group management, but we don't have any plans to
make use of it, so it just bloats the kernel for most of the users that
don't care about it.

 if you're managing the in-kernel representation you might as well
 expose it to userspace there as well.

Unfortunately this is just the start a peeling back layers of the onion.
We manage groups in the driver core, so the driver core should expose
them to userspace.  The driver core exposes them to userspace, so now it
needs to manage permissions for userspace.  Then we add permissions and
now we need to provide group access, then we need a channel to an actual
userspace device driver, zing! we add a whole API there, then we need
group driver binding, then we need group device driver binding, blam!
another API, then we need...  I don't see a clear end marker that
doesn't continue to bloat the core and add functionality that nobody
else needs and we don't even have plans of integrating more pervasively.
This appears to end with 80 to 90% of the vfio core code moving into the
driver core.

  Regarding 2), I think providing user-space a way to unbind groups of
  devices from their drivers is a horrible idea.
 
 Well, I'm not wed to unbinding all the drivers at once.  But what I
 *do* think is essential is that we can atomically switch off automatic
 driver matching for the whole group.  Without that nothing really
 stops a driver reattaching to the things you've unbound, so you end up
 bailing a leakey boat.

Huh?  There is no issue with removing a device from one driver and
attaching it to another.  This happens all the time.  If you're talking
about hotplug, all we have to do is sit on the bus notifier chain and we
get called when devices are added, before the driver has a chance to
attach.  We can then force a vfio driver to attach when needed.  Hell,
we can just set dev-driver to something to prevent standard driver
probing.

  It makes it too easy for
  the user to shoot himself in the foot. For example when the user wants
  to assign a network card to a guest, but that card is in the same group
  as the GPU and the screen wents blank when the guest is started.
  Requiring devices to be unbound one-by-one is better because this way
  the user always needs to know what he is doing.
 
 Ok, that's not the usage model I had in mind.  What I'm thinking here
 is that the admin removes groups that will be used in guests from the
 host kernel (probably via boot-time scripts).  The guests will pick
 them up later, so that when a guest quits then restarts, we don't have
 devices appearing transiently in the host.

I don't think that model is dynamic enough for our existing use cases.
A user shouldn't need to decide at boot time which devices are going to
be used for what purpose.  It's entirely valid for a user to start up a
VM, decide they want more network performance and reassign a NIC from
host use to guest use.  When they 

Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-20 Thread Aaron Fabbri



On 12/20/11 8:30 PM, Alex Williamson alex.william...@redhat.com wrote:

 On Wed, 2011-12-21 at 14:32 +1100, David Gibson wrote:
 On Mon, Dec 19, 2011 at 04:41:56PM +0100, Joerg Roedel wrote:
 On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
snip
 
 Well, the iommu-api was designed for amd-vi and vt-d. But its concepts
 turn out to be more general and by no way x86-centric anymore.
 
 It's improving, but there are still plenty of x86isms there.
 
 Having worked on ia64 for a while, it's interesting to see this x86
 bashing from the other side.  Everyone is more than willing to make
 architecture neutral interfaces (jeez, look at the extent of the vfio
 reworks), but it's not fair to throw away interfaces as x86-centric if
 you're not pushing your requirements and making use of the code.
 
 It seems like we'd be better served today to start with the vfio code we
 have and let that be the catalyst to drive an iommu api that better
 serves non-x86.  I don't see how this group management tangent is really
 getting us anywhere.  Thanks,

I'd agree that incremental approach here is key.  VFIO has already seen a
ton of rework to accommodate all architectures.  Let's not bite off a bunch
of these other subsystem rewrites in the same chunk as our VFIO effort.

-Aaron




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread Joerg Roedel
On Mon, Dec 19, 2011 at 11:11:25AM +1100, David Gibson wrote:
 Well.. that's not where it is in Alex's code either.  The iommu layer
 (to the extent that there is such a layer) supplies the group info,
 but the group management is in vfio, not the iommu layer.  With mine
 it is in the driver core because the struct device seemed the logical
 place for the group id.

Okay, seems we have different ideas of what the 'grouping code' is. I
talked about the group enumeration code only. But group handling code is
certainly important to some degree too. But before we argue about the
right place of the code we should agree on the semantics such code
should provide.

For me it is fine when the code is in VFIO for now, since VFIO is the
only user at the moment. When more users pop up we can easily move it
out to somewhere else. But the semantics influence the interface to
user-space too, so it is more important now. It splits up into a number
of sub problems:

1) How userspace detects the device-group relationship?
2) Do we want group-binding/unbinding to device drivers?
3) Group attach/detach to iommu-domains?
4) What to do with hot-added devices?

For 1) I think the current solution with the iommu_group file is fine.
It is somewhat expensive for user-space to figure out the per-group
device-sets, but that is a one-time effort so it doesn't really matter.
Probably we can rename 'iommu_group' to 'isolation_group' or something.

Regarding 2), I think providing user-space a way to unbind groups of
devices from their drivers is a horrible idea. It makes it too easy for
the user to shoot himself in the foot. For example when the user wants
to assign a network card to a guest, but that card is in the same group
as the GPU and the screen wents blank when the guest is started.
Requiring devices to be unbound one-by-one is better because this way
the user always needs to know what he is doing.

For the remaining two questions I think the concept of a default-domain
is helpful.  The default-domain is a per-group domain which is created
by the iommu-driver at initialization time. It is the domain each device
is assigned to when it is not assigned to any other domain (which means
that each device/group is always attached to a domain). The default
domain will be used by the DMA-API layer. This implicitly means, that a
device which is not in the default-domain can't be used with the
dma-api. The dma_supported() function will return false for those
devices.

So what does this mean for point 3? I think we can implement attaching
and detaching groups in the iommu-api. This interface is not exposed to
userspace and can help VFIO and possible future users. Semantic is, that
domain_attach_group() only works when all devices in the group are in
their default domain and domain_detach_group() puts them back into the
default domain.

Question 4) is also solved with the default-domain concept. A hot-added
device is put in the domain of its group automatically. If the group is
owned by VFIO and another driver attaches to the device dma_supported
will return false and initialization will fail.

 Right, so, the other problem is that a well boundaried iommu-driver'
 is something that only exists on x86 at present, and the iommu api
 is riddled with x86-centric thinking.  Or more accurately, design
 based on how the current intel and amd iommus work.  On systems like
 POWER, use of the iommu is not optional - it's built into the PCI host
 bridge and must be initialized when the bridge is probed, much earlier
 than iommu driver initialization on x86.  They have no inbuilt concept
 of domains (though we could fake in software in some circumstances).

Well, the iommu-api was designed for amd-vi and vt-d. But its concepts
turn out to be more general and by no way x86-centric anymore. We
support a couple of ARM platforms too for example. More to come. With
small extensions to the API we will also support GART-like IOMMUs in the
future.
For your hardware the domain-concept will work too. In terms of the
iommu-api a domain is nothing more than an address space. As far as I
understand there is a 1-1 mapping between a hardware iommu and a domain
in your case. The easiest solution then is to share the datastructures
which describe the address space to the hardware between all iommus in a
particular domain.


Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Woodhouse
On Mon, 2011-12-19 at 11:11 +1100, David Gibson wrote:
   They have no inbuilt concept
 of domains (though we could fake in software in some circumstances).

That sentence doesn't make much sense to me.

Either you're saying that every device behind a given IOMMU is in *one*
domain (i.e. there's one domain per PCI host bridge), or you're saying
that each device has its *own* domain (maximum isolation, but still
perhaps not really true if you end up with PCIe-to-PCI bridges or broken
hardware such as the ones we've been discovering, where multifunction
devices do their DMA from the wrong function).

Either way, you *do* have domains. You just might not have thought about
it before.

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Gibson
On Mon, Dec 19, 2011 at 03:46:38PM +, David Woodhouse wrote:
 On Mon, 2011-12-19 at 11:11 +1100, David Gibson wrote:
They have no inbuilt concept
  of domains (though we could fake in software in some circumstances).
 
 That sentence doesn't make much sense to me.
 
 Either you're saying that every device behind a given IOMMU is in *one*
 domain (i.e. there's one domain per PCI host bridge), or you're saying
 that each device has its *own* domain (maximum isolation, but still
 perhaps not really true if you end up with PCIe-to-PCI bridges or broken
 hardware such as the ones we've been discovering, where multifunction
 devices do their DMA from the wrong function).
 
 Either way, you *do* have domains. You just might not have thought about
 it before.

Right, sorry, what I mean is that there's no concept of runtime
assignment of devices to domains.  The concept used in the
documentation is a Partitionable Endpoint (PE) - which would
correspond to the isolation groups I'm proposing.  These are generally
assigned by firmware based on various hardware dependent isolation
constraints.

When we're running paravirtualized under pHyp, it's impossible to
merge multiple PEs into one domain per se.  We could fake it rather
nastily by replicating all map/unmaps across mutiple PEs.  When
running bare metal, we could do so a bit more nicely by assigning
multiple PEs the same TCE pointer, but we have no mechanism to do so
at present.

Older hardware usually does have just one PE per host bridge, but it
also often has only one slot per host bridge, so in practice is often
both one domain per host bridge _and_ one device per host bridge.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Woodhouse
On Tue, 2011-12-20 at 09:31 +1100, David Gibson wrote:
 When we're running paravirtualized under pHyp, it's impossible to
 merge multiple PEs into one domain per se.  We could fake it rather
 nastily by replicating all map/unmaps across mutiple PEs.  When
 running bare metal, we could do so a bit more nicely by assigning
 multiple PEs the same TCE pointer, but we have no mechanism to do so
 at present. 

VT-d does share the page tables, as you could on bare metal. But it's an
implementation detail — there's nothing *fundamentally* wrong with
having to do the map/unmap for each PE, is there? It's only at VM setup
time, so it doesn't really matter if it's slow.

Surely that's the only way you're going to present the guest with the
illusion of having no IOMMU; so that DMA to any given guest physical
address just works.

On the other hand, perhaps you don't want to do that at all. Perhaps
you're better off presenting a virtualised IOMMU to the guest and
*insisting* that it fully uses it in order to do any DMA at all?

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-19 Thread David Gibson
On Mon, Dec 19, 2011 at 10:56:40PM +, David Woodhouse wrote:
 On Tue, 2011-12-20 at 09:31 +1100, David Gibson wrote:
  When we're running paravirtualized under pHyp, it's impossible to
  merge multiple PEs into one domain per se.  We could fake it rather
  nastily by replicating all map/unmaps across mutiple PEs.  When
  running bare metal, we could do so a bit more nicely by assigning
  multiple PEs the same TCE pointer, but we have no mechanism to do so
  at present. 
 
 VT-d does share the page tables, as you could on bare metal. But it's an
 implementation detail — there's nothing *fundamentally* wrong with
 having to do the map/unmap for each PE, is there? It's only at VM setup
 time, so it doesn't really matter if it's slow.
 
 Surely that's the only way you're going to present the guest with the
 illusion of having no IOMMU; so that DMA to any given guest physical
 address just works.
 
 On the other hand, perhaps you don't want to do that at all. Perhaps
 you're better off presenting a virtualised IOMMU to the guest and
 *insisting* that it fully uses it in order to do any DMA at all?

Not only do we want to, we more or less *have* to.  Existing kernels,
which are used to being paravirt under phyp expect and need a paravirt
iommu.  DMA without iommu setup just doesn't happen.  And the
map/unmap hypercalls are frequently a hot path, so slow does matter.

The other problem is that each domain's IOVA window is often fairly
small, a limitation that would get even worse if we try to put too
many devices in there.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-18 Thread David Gibson
On Fri, Dec 16, 2011 at 03:53:53PM +0100, Joerg Roedel wrote:
 On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
  Starting with it in the core and hand waving some future use that we
  don't plan to implement right now seems like the wrong direction.
 
 I agree with Alex. First of all, I havn't seen any real vfio problem
 that can't be solved with the current approach, and it has the great
 advantage of simplicity. It doesn't require a re-implementation of the
 driver-core based on groups.

I'm not re-implementing the driver core in terms of groups, just
adding the concept of groups to the driver core.

 I agree that we need some improvements to
 Alex' code for the dma-api layer to solve the problem with broken devices
 using the wrong requestor-id. But that can be done incrementally with
 the current (current == in the iommu-tree) approach implemented by Alex.
 
 I also think that all this does not belong into the driver core for two
 reasons:
 
   1) The information for building the device groups is provided
  by the iommu-layer

Yes.. no change there.

   2) The group information is provided to vfio by the iommu-api

Um.. huh?  Currently, the iommu-api supplies the info vfio, therefore
it should?  I assume there's a non-circular argument you're trying to
make here, but I can't figure out what it is.

 This makes the iommu-layer the logical point to place the grouping
 code.

Well.. that's not where it is in Alex's code either.  The iommu layer
(to the extent that there is such a layer) supplies the group info,
but the group management is in vfio, not the iommu layer.  With mine
it is in the driver core because the struct device seemed the logical
place for the group id.

Moving the group management into the iommu code itself probably does
make more sense, although I think that would be a change more of code
location than any actual content change.

 There are some sources outside of the iommu-layer that may influence
 grouping (like pci-quirks), but most of the job is done by the
 iommu-drivers.

Right, so, the other problem is that a well boundaried iommu-driver'
is something that only exists on x86 at present, and the iommu api
is riddled with x86-centric thinking.  Or more accurately, design
based on how the current intel and amd iommus work.  On systems like
POWER, use of the iommu is not optional - it's built into the PCI host
bridge and must be initialized when the bridge is probed, much earlier
than iommu driver initialization on x86.  They have no inbuilt concept
of domains (though we could fake in software in some circumstances).

Now, that is something that needs to be fixed longer term.  I'm just
not sure how to deal with that and sorting out some sort of device
isolation / passthrough system.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-16 Thread Joerg Roedel
On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
 Starting with it in the core and hand waving some future use that we
 don't plan to implement right now seems like the wrong direction.

I agree with Alex. First of all, I havn't seen any real vfio problem
that can't be solved with the current approach, and it has the great
advantage of simplicity. It doesn't require a re-implementation of the
driver-core based on groups. I agree that we need some improvements to
Alex' code for the dma-api layer to solve the problem with broken devices
using the wrong requestor-id. But that can be done incrementally with
the current (current == in the iommu-tree) approach implemented by Alex.

I also think that all this does not belong into the driver core for two
reasons:

1) The information for building the device groups is provided
   by the iommu-layer
2) The group information is provided to vfio by the iommu-api

This makes the iommu-layer the logical point to place the grouping code.
There are some sources outside of the iommu-layer that may influence
grouping (like pci-quirks), but most of the job is done by the
iommu-drivers.

Thanks,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
 Here's the second spin of my preferred approach to handling grouping
 of devices for safe assignment to guests.
 
 Changes since v1:
  * Many name changes and file moves for improved consistency
  * Bugfixes and cleanups
  * The interface to the next layer up is considerably fleshed out,
although it still needs work.
  * Example initialization of groups for p5ioc2 and p7ioc.
 
 TODO:
  * Need sample initialization of groups for intel and/or amd iommus

I think this very well might imposed significant bloat for those
implementations.  On POWER you typically don't have singleton groups,
while it's the norm on x86.  I don't know that either intel or amd iommu
code have existing structures that they can simply tack the group
pointer to.  Again, this is one of the reasons that I think the current
vfio implementation is the right starting point.  We keep groups within
vfio, imposing zero overhead for systems not making use of it and only
require iommu drivers to implement a trivial function to opt-in.  As we
start to make groups more pervasive in the dma layer, independent of
userspace driver exposure, we can offload pieces to the core.  Starting
with it in the core and hand waving some future use that we don't plan
to implement right now seems like the wrong direction.

  * Use of sysfs attributes to control group permission is probably a
mistake.  Although it seems a bit odd, registering a chardev for
each group is probably better, because perms can be set from udev
rules, just like everything else.

I agree, this is a horrible mistake.  Reinventing file permissions via
sysfs is bound to be broken and doesn't account for selinux permissions.
Again, I know you don't like aspects of the vfio group management, but
it gets this right imho.

  * Need more details of what the binder structure will need to
contain.
  * Handle complete removal of groups.
  * Clarify what will need to happen on the hot unplug path.

We're still also removing devices from the driver model, this means
drivers like vfio need to re-implement a lot of the driver core for
driving each individual device in the group, and as you indicate, it's
unclear what happens in the hotplug path and I wonder if things like
suspend/resume will also require non-standard support.  I really prefer
attaching individual bus drivers to devices using the standard
bind/unbind mechanisms.  I have a hard time seeing how this is an
improvement from vfio.  Thanks,

Alex




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Thu, 2011-12-15 at 11:05 -0700, Alex Williamson wrote:
 On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
  Here's the second spin of my preferred approach to handling grouping
  of devices for safe assignment to guests.
  
  Changes since v1:
   * Many name changes and file moves for improved consistency
   * Bugfixes and cleanups
   * The interface to the next layer up is considerably fleshed out,
 although it still needs work.
   * Example initialization of groups for p5ioc2 and p7ioc.
  
  TODO:
   * Need sample initialization of groups for intel and/or amd iommus
 
 I think this very well might imposed significant bloat for those
 implementations.  On POWER you typically don't have singleton groups,
 while it's the norm on x86.  I don't know that either intel or amd iommu
 code have existing structures that they can simply tack the group
 pointer to.  Again, this is one of the reasons that I think the current
 vfio implementation is the right starting point.  We keep groups within
 vfio, imposing zero overhead for systems not making use of it and only
 require iommu drivers to implement a trivial function to opt-in.  As we
 start to make groups more pervasive in the dma layer, independent of
 userspace driver exposure, we can offload pieces to the core.  Starting
 with it in the core and hand waving some future use that we don't plan
 to implement right now seems like the wrong direction.
 
   * Use of sysfs attributes to control group permission is probably a
 mistake.  Although it seems a bit odd, registering a chardev for
 each group is probably better, because perms can be set from udev
 rules, just like everything else.
 
 I agree, this is a horrible mistake.  Reinventing file permissions via
 sysfs is bound to be broken and doesn't account for selinux permissions.
 Again, I know you don't like aspects of the vfio group management, but
 it gets this right imho.
 
   * Need more details of what the binder structure will need to
 contain.
   * Handle complete removal of groups.
   * Clarify what will need to happen on the hot unplug path.
 
 We're still also removing devices from the driver model, this means
 drivers like vfio need to re-implement a lot of the driver core for
 driving each individual device in the group, and as you indicate, it's
 unclear what happens in the hotplug path and I wonder if things like
 suspend/resume will also require non-standard support.  I really prefer
 attaching individual bus drivers to devices using the standard
 bind/unbind mechanisms.  I have a hard time seeing how this is an
 improvement from vfio.  Thanks,

I should also mention that I just pushed a new version of the vfio
series out to github, it can be found here:

git://github.com/awilliam/linux-vfio.git vfio-next-20111215

This fixes many bugs, including PCI config space access sizes and the
todo item of actually preventing user access to the MSI-X table area.  I
think the vfio-pci driver is much closer to being read to submit now.
It think I've addressed all the previous review comments.  Still pending
is a documentation refresh and some decision about what and how we
expose iommu information.

As usual, the matching qemu tree is here:

git://github.com/awilliam/qemu-vfio.git vfio-ng

I've tested this with an Intel 82576 (both PF and VF), Broadcom BCM5755,
Intel HD Audio controller, and legacy PCI SB Live.  Thanks,

Alex




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread David Gibson
On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
 On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
  Here's the second spin of my preferred approach to handling grouping
  of devices for safe assignment to guests.
  
  Changes since v1:
   * Many name changes and file moves for improved consistency
   * Bugfixes and cleanups
   * The interface to the next layer up is considerably fleshed out,
 although it still needs work.
   * Example initialization of groups for p5ioc2 and p7ioc.
  
  TODO:
   * Need sample initialization of groups for intel and/or amd iommus
 
 I think this very well might imposed significant bloat for those
 implementations.  On POWER you typically don't have singleton groups,
 while it's the norm on x86.  I don't know that either intel or amd iommu
 code have existing structures that they can simply tack the group
 pointer to.

Actually, I think they can probably just use the group pointer in the
struct device.  Each PCI function will typically allocate a new group
and put the pointer in the struct device and no-where else.  Devices
hidden under bridges copy the pointer from the bridge parent instead.
I will have to check the unplug path to ensure we can manage the group
lifetime properly, of course.

  Again, this is one of the reasons that I think the current
 vfio implementation is the right starting point.  We keep groups within
 vfio, imposing zero overhead for systems not making use of it and only
 require iommu drivers to implement a trivial function to opt-in.  As we
 start to make groups more pervasive in the dma layer, independent of
 userspace driver exposure, we can offload pieces to the core.  Starting
 with it in the core and hand waving some future use that we don't plan
 to implement right now seems like the wrong direction.

Well, I think we must agree to disagree here; I think treating groups
as identifiable objects is worthwhile.  That said, I am looking for
ways to whittle down the overhead when they're not in use.

   * Use of sysfs attributes to control group permission is probably a
 mistake.  Although it seems a bit odd, registering a chardev for
 each group is probably better, because perms can be set from udev
 rules, just like everything else.
 
 I agree, this is a horrible mistake.  Reinventing file permissions via
 sysfs is bound to be broken and doesn't account for selinux permissions.
 Again, I know you don't like aspects of the vfio group management, but
 it gets this right imho.

Yeah.  I came up with this because I was trying to avoid registering a
device whose only purpose was to act as a permissioned handle on the
group.  But it is a better approach, despite that.  I just wanted to
send out the new patches for comment without waiting to do that
rework.

   * Need more details of what the binder structure will need to
 contain.
   * Handle complete removal of groups.
   * Clarify what will need to happen on the hot unplug path.
 
 We're still also removing devices from the driver model, this means
 drivers like vfio need to re-implement a lot of the driver core for
 driving each individual device in the group,

Ah, so, that's a bit I've yet to flesh out.  The subtle distinction is
that we prevent driver _matching_ not driver _binding.  It's
intentionally still possible to explicitly bind drivers to devices in
the group, by bypassing the automatic match mechanism.

I'm intending that when a group is bound, the binder struct will
(optionally) specify a driver (or several, for different bus types)
which will be force bound to all the devices in the group.

 and as you indicate, it's
 unclear what happens in the hotplug path

It's clear enough in concept.  I just have to work out exactly where
we need to call d_i_dev_remove(), and whether we'll need some sort of
callback to the bridge / iommu code.

 and I wonder if things like
 suspend/resume will also require non-standard support.  I really prefer
 attaching individual bus drivers to devices using the standard
 bind/unbind mechanisms.  I have a hard time seeing how this is an
 improvement from vfio.  Thanks,
 
 Alex
 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson



Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread Alex Williamson
On Fri, 2011-12-16 at 12:40 +1100, David Gibson wrote:
 On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
  On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
   Here's the second spin of my preferred approach to handling grouping
   of devices for safe assignment to guests.
   
   Changes since v1:
* Many name changes and file moves for improved consistency
* Bugfixes and cleanups
* The interface to the next layer up is considerably fleshed out,
  although it still needs work.
* Example initialization of groups for p5ioc2 and p7ioc.
   
   TODO:
* Need sample initialization of groups for intel and/or amd iommus
  
  I think this very well might imposed significant bloat for those
  implementations.  On POWER you typically don't have singleton groups,
  while it's the norm on x86.  I don't know that either intel or amd iommu
  code have existing structures that they can simply tack the group
  pointer to.
 
 Actually, I think they can probably just use the group pointer in the
 struct device.  Each PCI function will typically allocate a new group
 and put the pointer in the struct device and no-where else.  Devices
 hidden under bridges copy the pointer from the bridge parent instead.
 I will have to check the unplug path to ensure we can manage the group
 lifetime properly, of course.
 
   Again, this is one of the reasons that I think the current
  vfio implementation is the right starting point.  We keep groups within
  vfio, imposing zero overhead for systems not making use of it and only
  require iommu drivers to implement a trivial function to opt-in.  As we
  start to make groups more pervasive in the dma layer, independent of
  userspace driver exposure, we can offload pieces to the core.  Starting
  with it in the core and hand waving some future use that we don't plan
  to implement right now seems like the wrong direction.
 
 Well, I think we must agree to disagree here; I think treating groups
 as identifiable objects is worthwhile.  That said, I am looking for
 ways to whittle down the overhead when they're not in use.
 
* Use of sysfs attributes to control group permission is probably a
  mistake.  Although it seems a bit odd, registering a chardev for
  each group is probably better, because perms can be set from udev
  rules, just like everything else.
  
  I agree, this is a horrible mistake.  Reinventing file permissions via
  sysfs is bound to be broken and doesn't account for selinux permissions.
  Again, I know you don't like aspects of the vfio group management, but
  it gets this right imho.
 
 Yeah.  I came up with this because I was trying to avoid registering a
 device whose only purpose was to act as a permissioned handle on the
 group.  But it is a better approach, despite that.  I just wanted to
 send out the new patches for comment without waiting to do that
 rework.

So we end up with a chardev created by the core, whose only purpose is
setting the group access permissions for userspace usage, which only
becomes useful with something like vfio?  It's an odd conflict that
isolation groups would get involved with userspace permissions to access
the group, but leave enforcement of isolation via iommu groups to the
binder driver (where it seems like vfio is still going to need some
kind of merge interface to share a domain between isolation groups).

Is this same chardev going to be a generic conduit for
read/write/mmap/ioctl to the binder driver or does vfio need to create
it's own chardev for that?  In the former case, are we ok with a chardev
that has an entirely modular API behind it, or maybe you're planning to
define some basic API infrastructure, in which case this starts smelling
like implementing vfio in the core.  In the latter case (isolation
chardev + vfio chardev) coordinating permissions sounds like a mess.

Alex




Re: [Qemu-devel] [RFC] Device isolation infrastructure v2

2011-12-15 Thread David Gibson
On Thu, Dec 15, 2011 at 09:49:05PM -0700, Alex Williamson wrote:
 On Fri, 2011-12-16 at 12:40 +1100, David Gibson wrote:
  On Thu, Dec 15, 2011 at 11:05:07AM -0700, Alex Williamson wrote:
   On Thu, 2011-12-15 at 17:25 +1100, David Gibson wrote:
Here's the second spin of my preferred approach to handling grouping
of devices for safe assignment to guests.

Changes since v1:
 * Many name changes and file moves for improved consistency
 * Bugfixes and cleanups
 * The interface to the next layer up is considerably fleshed out,
   although it still needs work.
 * Example initialization of groups for p5ioc2 and p7ioc.

TODO:
 * Need sample initialization of groups for intel and/or amd iommus
   
   I think this very well might imposed significant bloat for those
   implementations.  On POWER you typically don't have singleton groups,
   while it's the norm on x86.  I don't know that either intel or amd iommu
   code have existing structures that they can simply tack the group
   pointer to.
  
  Actually, I think they can probably just use the group pointer in the
  struct device.  Each PCI function will typically allocate a new group
  and put the pointer in the struct device and no-where else.  Devices
  hidden under bridges copy the pointer from the bridge parent instead.
  I will have to check the unplug path to ensure we can manage the group
  lifetime properly, of course.
  
Again, this is one of the reasons that I think the current
   vfio implementation is the right starting point.  We keep groups within
   vfio, imposing zero overhead for systems not making use of it and only
   require iommu drivers to implement a trivial function to opt-in.  As we
   start to make groups more pervasive in the dma layer, independent of
   userspace driver exposure, we can offload pieces to the core.  Starting
   with it in the core and hand waving some future use that we don't plan
   to implement right now seems like the wrong direction.
  
  Well, I think we must agree to disagree here; I think treating groups
  as identifiable objects is worthwhile.  That said, I am looking for
  ways to whittle down the overhead when they're not in use.
  
 * Use of sysfs attributes to control group permission is probably a
   mistake.  Although it seems a bit odd, registering a chardev for
   each group is probably better, because perms can be set from udev
   rules, just like everything else.
   
   I agree, this is a horrible mistake.  Reinventing file permissions via
   sysfs is bound to be broken and doesn't account for selinux permissions.
   Again, I know you don't like aspects of the vfio group management, but
   it gets this right imho.
  
  Yeah.  I came up with this because I was trying to avoid registering a
  device whose only purpose was to act as a permissioned handle on the
  group.  But it is a better approach, despite that.  I just wanted to
  send out the new patches for comment without waiting to do that
  rework.
 
 So we end up with a chardev created by the core, whose only purpose is
 setting the group access permissions for userspace usage, which only
 becomes useful with something like vfio?  It's an odd conflict that
 isolation groups would get involved with userspace permissions to access
 the group, but leave enforcement of isolation via iommu groups to the
 binder driver

Hm, perhaps.  I'll think about it.

 (where it seems like vfio is still going to need some
 kind of merge interface to share a domain between isolation groups).

That was always going to be the case, but I wish we could stop
thinking of it as the merge interface, since I think that term is
distorting thinking about how the interface works.

For example, I think opening a new domain, then adding / removing
groups provides a much cleaner model than merging' groups without a
separate handle on the iommu domain we're building.

 Is this same chardev going to be a generic conduit for
 read/write/mmap/ioctl to the binder driver or does vfio need to create
 it's own chardev for that?

Right, I was thinking that the binder could supply its own fops or
something for the group chardev once the group is bound.

  In the former case, are we ok with a chardev
 that has an entirely modular API behind it, or maybe you're planning to
 define some basic API infrastructure, in which case this starts smelling
 like implementing vfio in the core.

I think it can work, but I do need to look closer.

  In the latter case (isolation
 chardev + vfio chardev) coordinating permissions sounds like a mess.

Absolutely.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson