Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices

2016-09-19 Thread Kirti Wankhede


On 8/26/2016 7:43 PM, Kirti Wankhede wrote:
> * PGP Signed: 08/26/2016 at 07:15:44 AM, Decrypted
> On 8/25/2016 2:52 PM, Dong Jia wrote:
>> On Thu, 25 Aug 2016 09:23:53 +0530

>>> +
>>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>>> + size_t count, loff_t *ppos)
>>> +{
>>> +   struct vfio_mdev *vmdev = device_data;
>>> +   struct mdev_device *mdev = vmdev->mdev;
>>> +   struct parent_device *parent = mdev->parent;
>>> +   unsigned int done = 0;
>>> +   int ret;
>>> +
>>> +   if (!parent->ops->read)
>>> +   return -EINVAL;
>>> +
>>> +   while (count) {
>> Here, I have to say sorry to you guys for that I didn't notice the
>> bad impact of this change to my patches during the v6 discussion.
>>
>> For vfio-ccw, I introduced an I/O region to input/output I/O
>> instruction parameters and results for Qemu. The @count of these data
>> currently is 140. So supporting arbitrary lengths in one shot here, and
>> also in vfio_mdev_write, seems the better option for this case.
>>
>> I believe that if the pci drivers want to iterate in a 4 bytes step, you
>> can do that in the parent read/write callbacks instead.
>>
>> What do you think?
>>
> 
> I would like to know Alex's thought on this. He raised concern with this
> approach in v6 reviews:
> "But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer."
> 

Read/write callbacks are for slow path, emulation of mmio region which
are mainly device registers. I do feel it shouldn't support arbitrary
lengths.
Alex, I would like to know your thoughts.

Thanks,
Kirti



Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-09-19 Thread Kirti Wankhede


On 9/12/2016 9:23 PM, Alex Williamson wrote:
> On Mon, 12 Sep 2016 13:19:11 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/12/2016 10:40 AM, Jike Song wrote:
>>> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:  
>>>> On 9/10/2016 12:12 AM, Alex Williamson wrote:  
>>>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>  
>>>>>> On 9/8/2016 1:39 PM, Jike Song wrote:  
>>>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:
>>>>>>  
>>>>>>>>  +---+
>>>>>>>>  |   |
>>>>>>>>  | +---+ |  mdev_register_driver() +--+
>>>>>>>>  | |   | +<+ __init() |
>>>>>>>>  | |  mdev | | |  |
>>>>>>>>  | |  bus  | +>+  |<-> VFIO 
>>>>>>>> user
>>>>>>>>  | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
>>>>>>>>  | |   | | |  |
>>>>>>>>  | +---+ | +--+
>>>>>>>>  |   |
>>>>>>>
>>>>>>> This aimed to have only one single vfio bus driver for all mediated 
>>>>>>> devices,
>>>>>>> right?
>>>>>>>
>>>>>>
>>>>>> Yes. That's correct.
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>>>> +  const struct attribute_group 
>>>>>>>> **groups)
>>>>>>>> +{
>>>>>>>> +  return sysfs_create_groups(>kobj, groups);
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>>>> +  const struct attribute_group 
>>>>>>>> **groups)
>>>>>>>> +{
>>>>>>>> +  sysfs_remove_groups(>kobj, groups);
>>>>>>>> +}
>>>>>>>
>>>>>>> These functions are not necessary. You can always specify the attribute 
>>>>>>> groups
>>>>>>> to dev->groups before registering a new device.
>>>>>>> 
>>>>>>
>>>>>> At the time of mdev device create, I specifically didn't used
>>>>>> dev->groups because we callback in vendor driver before that, see below
>>>>>> code snippet, and those attributes should only be added if create()
>>>>>> callback returns success.
>>>>>>
>>>>>> ret = parent->ops->create(mdev, mdev_params);
>>>>>> if (ret)
>>>>>> return ret;
>>>>>>
>>>>>> ret = mdev_add_attribute_group(>dev,
>>>>>> parent->ops->mdev_attr_groups);
>>>>>> if (ret)
>>>>>> parent->ops->destroy(mdev);
>>>>>>
>>>>>>
>>>>>>  
>>>>>>>> +
>>>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device 
>>>>>>>> *dev)
>>>>>>>> +{
>>>>>>>> +  struct parent_device *parent;
>>>>>>>> +
>>>>>>>> +  mutex_lock(_list_lock);
>>>>>>>> +  parent = mdev_get_parent(__find_parent_device(dev));
>>>>>>>> +  mutex_unlock(_list_lock);
>>>>>>>> +
>>>>>>>> +  return parent;
>>>>>>>> +}
>>>>>>>
>>>>>>> As we have demonstrated, all these refs and locks and release workqueue 
>>>>>>> are not necessary,
>>>>>>> as long as you have an independent device associated with the mdev host 
>>>>>>> device
>>>>>>> ("

Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-09-12 Thread Kirti Wankhede


On 9/12/2016 10:40 AM, Jike Song wrote:
> On 09/10/2016 03:55 AM, Kirti Wankhede wrote:
>> On 9/10/2016 12:12 AM, Alex Williamson wrote:
>>> On Fri, 9 Sep 2016 23:18:45 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>
>>>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>>>
>>>>>>  +---+
>>>>>>  |   |
>>>>>>  | +---+ |  mdev_register_driver() +--+
>>>>>>  | |   | +<+ __init() |
>>>>>>  | |  mdev | | |  |
>>>>>>  | |  bus  | +>+  |<-> VFIO user
>>>>>>  | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
>>>>>>  | |   | | |  |
>>>>>>  | +---+ | +--+
>>>>>>  |   |  
>>>>>
>>>>> This aimed to have only one single vfio bus driver for all mediated 
>>>>> devices,
>>>>> right?
>>>>>  
>>>>
>>>> Yes. That's correct.
>>>>
>>>>
>>>>>> +
>>>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>>>> +const struct attribute_group 
>>>>>> **groups)
>>>>>> +{
>>>>>> +return sysfs_create_groups(>kobj, groups);
>>>>>> +}
>>>>>> +
>>>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>>>> +const struct attribute_group 
>>>>>> **groups)
>>>>>> +{
>>>>>> +sysfs_remove_groups(>kobj, groups);
>>>>>> +}  
>>>>>
>>>>> These functions are not necessary. You can always specify the attribute 
>>>>> groups
>>>>> to dev->groups before registering a new device.
>>>>>   
>>>>
>>>> At the time of mdev device create, I specifically didn't used
>>>> dev->groups because we callback in vendor driver before that, see below
>>>> code snippet, and those attributes should only be added if create()
>>>> callback returns success.
>>>>
>>>> ret = parent->ops->create(mdev, mdev_params);
>>>> if (ret)
>>>> return ret;
>>>>
>>>> ret = mdev_add_attribute_group(>dev,
>>>> parent->ops->mdev_attr_groups);
>>>> if (ret)
>>>> parent->ops->destroy(mdev);
>>>>
>>>>
>>>>
>>>>>> +
>>>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device 
>>>>>> *dev)
>>>>>> +{
>>>>>> +struct parent_device *parent;
>>>>>> +
>>>>>> +mutex_lock(_list_lock);
>>>>>> +parent = mdev_get_parent(__find_parent_device(dev));
>>>>>> +mutex_unlock(_list_lock);
>>>>>> +
>>>>>> +return parent;
>>>>>> +}  
>>>>>
>>>>> As we have demonstrated, all these refs and locks and release workqueue 
>>>>> are not necessary,
>>>>> as long as you have an independent device associated with the mdev host 
>>>>> device
>>>>> ("parent" device here).
>>>>>  
>>>>
>>>> I don't think every lock will go away with that. This also changes how
>>>> mdev devices entries are created in sysfs. It adds an extra directory.
>>>
>>> Exposing the parent-child relationship through sysfs is a desirable
>>> feature, so I'm not sure how this is a negative.  This part of Jike's
>>> conversion was a big improvement, I thought.  Thanks,
>>>
>>
>> Jike's suggestion is to introduced a fake device over parent device i.e.
>> mdev-host, and then all mdev devices are children of 'mdev-host' not
>> children of real parent.
>>
> 
> It really depends on how you define 'real parent' :)
> 
> With a physical-host-mdev hierarchy, the parent of mdev d

Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-09-09 Thread Kirti Wankhede


On 9/10/2016 12:12 AM, Alex Williamson wrote:
> On Fri, 9 Sep 2016 23:18:45 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/8/2016 1:39 PM, Jike Song wrote:
>>> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:  
>>
>>>>  +---+
>>>>  |   |
>>>>  | +---+ |  mdev_register_driver() +--+
>>>>  | |   | +<+ __init() |
>>>>  | |  mdev | | |  |
>>>>  | |  bus  | +>+  |<-> VFIO user
>>>>  | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
>>>>  | |   | | |  |
>>>>  | +---+ | +--+
>>>>  |   |  
>>>
>>> This aimed to have only one single vfio bus driver for all mediated devices,
>>> right?
>>>  
>>
>> Yes. That's correct.
>>
>>
>>>> +
>>>> +static int mdev_add_attribute_group(struct device *dev,
>>>> +  const struct attribute_group **groups)
>>>> +{
>>>> +  return sysfs_create_groups(>kobj, groups);
>>>> +}
>>>> +
>>>> +static void mdev_remove_attribute_group(struct device *dev,
>>>> +  const struct attribute_group **groups)
>>>> +{
>>>> +  sysfs_remove_groups(>kobj, groups);
>>>> +}  
>>>
>>> These functions are not necessary. You can always specify the attribute 
>>> groups
>>> to dev->groups before registering a new device.
>>>   
>>
>> At the time of mdev device create, I specifically didn't used
>> dev->groups because we callback in vendor driver before that, see below
>> code snippet, and those attributes should only be added if create()
>> callback returns success.
>>
>> ret = parent->ops->create(mdev, mdev_params);
>> if (ret)
>> return ret;
>>
>> ret = mdev_add_attribute_group(>dev,
>> parent->ops->mdev_attr_groups);
>> if (ret)
>> parent->ops->destroy(mdev);
>>
>>
>>
>>>> +
>>>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>>>> +{
>>>> +  struct parent_device *parent;
>>>> +
>>>> +  mutex_lock(_list_lock);
>>>> +  parent = mdev_get_parent(__find_parent_device(dev));
>>>> +  mutex_unlock(_list_lock);
>>>> +
>>>> +  return parent;
>>>> +}  
>>>
>>> As we have demonstrated, all these refs and locks and release workqueue are 
>>> not necessary,
>>> as long as you have an independent device associated with the mdev host 
>>> device
>>> ("parent" device here).
>>>  
>>
>> I don't think every lock will go away with that. This also changes how
>> mdev devices entries are created in sysfs. It adds an extra directory.
> 
> Exposing the parent-child relationship through sysfs is a desirable
> feature, so I'm not sure how this is a negative.  This part of Jike's
> conversion was a big improvement, I thought.  Thanks,
> 

Jike's suggestion is to introduced a fake device over parent device i.e.
mdev-host, and then all mdev devices are children of 'mdev-host' not
children of real parent.

For example, directory structure we have now is:
/sys/bus/pci/devices/\:85\:00.0/

mdev devices are in real parents directory.

By introducing fake device it would be:
/sys/bus/pci/devices/\:85\:00.0/mdev-host/

mdev devices are in fake device's directory.

Lock would be still required, to handle the race conditions like
'mdev_create' is still in process and parent device is unregistered by
vendor driver/ parent device is unbind from vendor driver.

With the new changes/discussion, we believe the locking will be
simplified without having fake parent device.

With fake device suggestion, removed pointer to parent device from
mdev_device structure. When a create(struct mdev_device *mdev) callback
comes to vendor driver, how would vendor driver know for which physical
device this mdev device create call is intended to? because then
'parent' would be newly introduced fake device, not the real parent.

Thanks,
Kirti



Re: [Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-09-09 Thread Kirti Wankhede


On 9/8/2016 1:39 PM, Jike Song wrote:
> On 08/25/2016 11:53 AM, Kirti Wankhede wrote:

>>  +---+
>>  |   |
>>  | +---+ |  mdev_register_driver() +--+
>>  | |   | +<+ __init() |
>>  | |  mdev | | |  |
>>  | |  bus  | +>+  |<-> VFIO user
>>  | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
>>  | |   | | |  |
>>  | +---+ | +--+
>>  |   |
> 
> This aimed to have only one single vfio bus driver for all mediated devices,
> right?
>

Yes. That's correct.


>> +
>> +static int mdev_add_attribute_group(struct device *dev,
>> +const struct attribute_group **groups)
>> +{
>> +return sysfs_create_groups(>kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +const struct attribute_group **groups)
>> +{
>> +sysfs_remove_groups(>kobj, groups);
>> +}
> 
> These functions are not necessary. You can always specify the attribute groups
> to dev->groups before registering a new device.
> 

At the time of mdev device create, I specifically didn't used
dev->groups because we callback in vendor driver before that, see below
code snippet, and those attributes should only be added if create()
callback returns success.

ret = parent->ops->create(mdev, mdev_params);
if (ret)
return ret;

ret = mdev_add_attribute_group(>dev,
parent->ops->mdev_attr_groups);
if (ret)
parent->ops->destroy(mdev);



>> +
>> +static struct parent_device *mdev_get_parent_from_dev(struct device *dev)
>> +{
>> +struct parent_device *parent;
>> +
>> +mutex_lock(_list_lock);
>> +parent = mdev_get_parent(__find_parent_device(dev));
>> +mutex_unlock(_list_lock);
>> +
>> +return parent;
>> +}
> 
> As we have demonstrated, all these refs and locks and release workqueue are 
> not necessary,
> as long as you have an independent device associated with the mdev host device
> ("parent" device here).
>

I don't think every lock will go away with that. This also changes how
mdev devices entries are created in sysfs. It adds an extra directory.


> PS, "parent" is somehow a name too generic?
>

This is the term used in Linux Kernel for such cases. See 'struct
device' in include/linux/device.h. I would prefer 'parent'.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char 
>> *mdev_params)
>> +{
>> +struct parent_device *parent = mdev->parent;
>> +int ret;
>> +
>> +ret = parent->ops->create(mdev, mdev_params);
>> +if (ret)
>> +return ret;
>> +
>> +ret = mdev_add_attribute_group(>dev,
>> +parent->ops->mdev_attr_groups);
> 
> Ditto: dev->groups.
> 

See my above response for why this is indented to be so.


>> +ret = parent_create_sysfs_files(dev);
>> +if (ret)
>> +goto add_sysfs_error;
>> +
>> +ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
> 
> parent_create_sysfs_files and mdev_add_attribute_group are kind of doing
> the same thing, do you mind to merge them into one?
> 

Ok. I'll see I can do that.


>> +int mdev_device_get_online_status(struct device *dev, bool *online)
>> +{
>> +int ret = 0;
>> +struct mdev_device *mdev;
>> +struct parent_device *parent;
>> +
>> +mdev = mdev_get_device(to_mdev_device(dev));
>> +if (!mdev)
>> +return -EINVAL;
>> +
>> +parent = mdev->parent;
>> +
>> +if (parent->ops->get_online_status)
>> +ret = parent->ops->get_online_status(mdev, online);
>> +
>> +mdev_put_device(mdev);
>> +
>> +return ret;
>> +}
> 
> The driver core has a perfect 'online' file for a device, with both
> 'show' and 'store' support, you don't need to write another one.
> 
> Please have a look at online_show and online_store in drivers/base/core.c.
> 

This is going to be removed as per the latest discussion.


> +
>> +extern struct class_attribute mdev_class_attrs[];
> 
> This is useless?
>

Oh, I missed to remove it. Thanks for pointing that out.



Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-08 Thread Kirti Wankhede


On 9/8/2016 3:43 AM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 23:36:28 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/7/2016 10:14 PM, Alex Williamson wrote:
>>> On Wed, 7 Sep 2016 21:45:31 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>>> On 9/7/2016 2:58 AM, Alex Williamson wrote:  
>>>>> On Wed, 7 Sep 2016 01:05:11 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>> 
>>>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:
>>>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>>>>   
>>>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:

...

> 
> Philosophically, mdev devices should be entirely independent of one
> another.  A user can set the same iommu context for multiple mdevs
> by placing them in the same container.  A user should be able to
> stop using an mdev in one place and start using it somewhere else.
> It should be a fungible $TYPE device.  It's an NVIDIA-only requirement
> that imposes this association of mdev devices into groups and I don't
> particularly see it as beneficial to the mdev architecture.  So why
> make it a standard part of the interface?
> 

Yes, I agree. This might not be each vendor's requirement.


> We could do keying at the layer you suggest, assuming we can find
> something that doesn't restrict the user, but we could make that
> optional.  

We can key on 'container'. Devices should be in same VFIO 'container'.
open() call should fail if they are found to be in different containers.

> For instance, say we did key on pid, there could be an
> attribute in the supported types hierarchy to indicate this type
> supports(requires) pid-sets.  Each mdev device with this attribute
> would create a pid-group file in sysfs where libvirt could associate
> the device.  Only for those mdev devices requiring it.
> 

We are OK with this suggestion if this works of libvirt integration.
We can have file in types directory in supported types as 'requires_group'.

Thanks,
Kirti

> The alternative is that we need to find some mechanism for this
> association that doesn't impose arbitrary requirements, and potentially
> usage restrictions on vendors that don't have this need.  Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Kirti Wankhede


On 9/7/2016 10:14 PM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 21:45:31 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/7/2016 2:58 AM, Alex Williamson wrote:
>>> On Wed, 7 Sep 2016 01:05:11 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>>> On 9/6/2016 11:10 PM, Alex Williamson wrote:  
>>>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>> 
>>>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:  
>>>>>>>>  We could even do:  
>>>>>>>>>>
>>>>>>>>>> echo $UUID1:$GROUPA > create
>>>>>>>>>>
>>>>>>>>>> where $GROUPA is the group ID of a previously created mdev device 
>>>>>>>>>> into
>>>>>>>>>> which $UUID1 is to be created and added to the same group.  
>>>>>>>>   
>>>>>>>
>>>>>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>>>>>>  could be an additional element in the nodedev-create XML:
>>>>>>>
>>>>>>> 
>>>>>>>   my-vgpu
>>>>>>>   pci__86_00_0
>>>>>>>   
>>>>>>> 
>>>>>>> 0695d332-7831-493f-9e71-1c85c8911a08
>>>>>>> group1
>>>>>>>   
>>>>>>> 
>>>>>>>
>>>>>>> (should group also be a UUID?)
>>>>>>>   
>>>>>>
>>>>>> No, this should be a unique number in a system, similar to iommu_group.  
>>>>>>   
>>>>>
>>>>> Sorry, just trying to catch up on this thread after a long weekend.
>>>>>
>>>>> We're talking about iommu groups here, we're not creating any sort of
>>>>> parallel grouping specific to mdev devices.
>>>>
>>>> I thought we were talking about group of mdev devices and not iommu
>>>> group. IIRC, there were concerns about it (this would be similar to
>>>> UUID+instance) and that would (ab)use iommu groups.  
>>>
>>> What constraints does a group, which is not an iommu group, place on the
>>> usage of the mdev devices?  What happens if we put two mdev devices in
>>> the same "mdev group" and then assign them to separate VMs/users?  I
>>> believe that the answer is that this theoretical "mdev group" doesn't
>>> actually impose any constraints on the devices within the group or how
>>> they're used.
>>>   
>>
>> We feel its not a good idea to try to associate device's iommu groups
>> with mdev device groups. That adds more complications.
>>
>> As in above nodedev-create xml, 'group1' could be a unique number that
>> can be generated by libvirt. Then to create mdev device:
>>
>>   echo $UUID1:group1 > create
>>
>> If user want to add more mdev devices to same group, he/she should use
>> same group number in next nodedev-create devices. So create commands
>> would be:
>>   echo $UUID2:group1 > create
>>   echo $UUID3:group1 > create
> 
> So groups return to being static, libvirt would need to destroy and
> create mdev devices specifically for use within the predefined group?

Yes.

> This imposes limitations on how mdev devices can be used (ie. the mdev
> pool option is once again removed).  We're also back to imposing
> grouping semantics on mdev devices that may not need them.  Do all mdev
> devices for a given user need to be put into the same group?  

Yes.

> Do groups
> span parent devices?  Do they span different vendor drivers?
> 

Yes and yes. Group number would be associated with mdev device
irrespective of its parent.


>> Each mdev device would store this group number in its mdev_device
>> structure.
>>
>> With this, we would add open() and close() callbacks from vfio_mdev
>> module for vendor driver to commit resources. Then we don't need
>> 'start'/'stop' or online/offline interface.
>>
>> To commit resources for all devices associated to that domain/user space
>> application, vendor driver can use 'first open()' and 'last close()' to
>> free those. Or

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Kirti Wankhede


On 9/7/2016 2:58 AM, Alex Williamson wrote:
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/6/2016 11:10 PM, Alex Williamson wrote:
>>> On Sat, 3 Sep 2016 22:04:56 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:  
>>>>>
>>>>>
>>>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>>>>>  We could even do:
>>>>>>>>
>>>>>>>> echo $UUID1:$GROUPA > create
>>>>>>>>
>>>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>>>> which $UUID1 is to be created and added to the same group.
>>>>>> 
>>>>>
>>>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>>>>  could be an additional element in the nodedev-create XML:
>>>>>
>>>>> 
>>>>>   my-vgpu
>>>>>   pci__86_00_0
>>>>>   
>>>>> 
>>>>> 0695d332-7831-493f-9e71-1c85c8911a08
>>>>> group1
>>>>>   
>>>>> 
>>>>>
>>>>> (should group also be a UUID?)
>>>>> 
>>>>
>>>> No, this should be a unique number in a system, similar to iommu_group.  
>>>
>>> Sorry, just trying to catch up on this thread after a long weekend.
>>>
>>> We're talking about iommu groups here, we're not creating any sort of
>>> parallel grouping specific to mdev devices.  
>>
>> I thought we were talking about group of mdev devices and not iommu
>> group. IIRC, there were concerns about it (this would be similar to
>> UUID+instance) and that would (ab)use iommu groups.
> 
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices?  What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users?  I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
> 

We feel its not a good idea to try to associate device's iommu groups
with mdev device groups. That adds more complications.

As in above nodedev-create xml, 'group1' could be a unique number that
can be generated by libvirt. Then to create mdev device:

  echo $UUID1:group1 > create

If user want to add more mdev devices to same group, he/she should use
same group number in next nodedev-create devices. So create commands
would be:
  echo $UUID2:group1 > create
  echo $UUID3:group1 > create

Each mdev device would store this group number in its mdev_device
structure.

With this, we would add open() and close() callbacks from vfio_mdev
module for vendor driver to commit resources. Then we don't need
'start'/'stop' or online/offline interface.

To commit resources for all devices associated to that domain/user space
application, vendor driver can use 'first open()' and 'last close()' to
free those. Or if vendor driver want to commit resources for each device
separately, they can do in each device's open() call. It will depend on
vendor driver how they want to implement.

Libvirt don't have to do anything about assigned group numbers while
managing mdev devices.

QEMU commandline parameter would be same as earlier (don't have to
mention group number here):

  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \
  -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2

In case if two mdev devices from same groups are assigned to different
domains, we can fail open() call of second device. How would driver know
that those are being used by different domain? By checking <group1, pid>
of first device of 'group1'. The two devices in same group should have
same pid in their open() call.

To hot-plug mdev device to a domain in which there is already a mdev
device assigned, mdev device should be created with same group number as
the existing devices are and then hot-plug it. If there is no mdev
device in that domain, then group number should be a unique number.

This simplifies the mdev grouping and also provide flexibility for
vendor driver implementation.

Thanks,
Kirti



Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-06 Thread Kirti Wankhede


On 9/6/2016 11:10 PM, Alex Williamson wrote:
> On Sat, 3 Sep 2016 22:04:56 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 20:33, Kirti Wankhede wrote:  
>>>>  We could even do:  
>>>>>>
>>>>>> echo $UUID1:$GROUPA > create
>>>>>>
>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>> which $UUID1 is to be created and added to the same group.  
>>>>   
>>>
>>> From the point of view of libvirt, I think I prefer Alex's idea.
>>>  could be an additional element in the nodedev-create XML:
>>>
>>> 
>>>   my-vgpu
>>>   pci__86_00_0
>>>   
>>> 
>>> 0695d332-7831-493f-9e71-1c85c8911a08
>>> group1
>>>   
>>> 
>>>
>>> (should group also be a UUID?)
>>>   
>>
>> No, this should be a unique number in a system, similar to iommu_group.
> 
> Sorry, just trying to catch up on this thread after a long weekend.
> 
> We're talking about iommu groups here, we're not creating any sort of
> parallel grouping specific to mdev devices.

I thought we were talking about group of mdev devices and not iommu
group. IIRC, there were concerns about it (this would be similar to
UUID+instance) and that would (ab)use iommu groups.

I'm thinking about your suggestion, but would also like to know your
thought how sysfs interface would look like? Its still no clear to me.
Or will it be better to have grouping at mdev layer?

Kirti.

>  This is why my example
> created a device and then required the user to go find the group number
> given to that device in order to create another device within the same
> group.  iommu group numbering is not within the user's control and is
> not a uuid.  libvirt can refer to the group as anything it wants in the
> xml, but the host group number is allocated by the host, not under user
> control, is not persistent.  libvirt would just be giving it a name to
> know which devices are part of the same group.  Perhaps the runtime xml
> would fill in the group number once created.
> 
> There were also a lot of unanswered questions in my proposal, it's not
> clear that there's a standard algorithm for when mdev devices need to
> be grouped together.  Should we even allow groups to span multiple host
> devices?  Should they be allowed to span devices from different
> vendors?
>
> If we imagine a scenario of a group composed of a mix of Intel and
> NVIDIA vGPUs, what happens when an Intel device is opened first?  The
> NVIDIA driver wouldn't know about this, but it would know when the
> first NVIDIA device is opened and be able to establish p2p for the
> NVIDIA devices at that point.  Can we do what we need with that model?
> What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
> do a create on the NVIDIA parent device with the existing group id, at
> which point the NVIDIA vendor driver could fail the device create if
> the p2p setup has already been done.  The Intel vendor driver might
> allow it.  Similar to open, the last close of the mdev device for a
> given vendor (which might not be the last close of mdev devices within
> the group) would need to trigger the offline process for that vendor.
> 
> That all sounds well and good... here's the kicker: iommu groups
> necessarily need to be part of the same iommu context, ie.
> vfio container.  How do we deal with vIOMMUs within the guest when we
> are intentionally forcing a set of devices within the same context?
> This is why it's _very_ beneficial on the host to create iommu groups
> with the smallest number of devices we can reasonably trust to be
> isolated.  We're backing ourselves into a corner if we tell libvirt
> that the standard process is to put all mdev devices into a single
> group.  The grouping/startup issue is still unresolved in my head.
> Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-09-03 Thread Kirti Wankhede


On 9/3/2016 6:37 PM, Paolo Bonzini wrote:
> 
> 
> On 03/09/2016 13:56, John Ferlan wrote:
>> On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>>>  We could even do:
>>>>>>
>>>>>> echo $UUID1:$GROUPA > create
>>>>>>
>>>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>>>> which $UUID1 is to be created and added to the same group.
>>>> 
>>>
>>> >From the point of view of libvirt, I think I prefer Alex's idea.
>>>  could be an additional element in the nodedev-create XML:
>>>
>>> 
>>>   my-vgpu
>>>   pci__86_00_0
>>>   
>>> 
>>> 0695d332-7831-493f-9e71-1c85c8911a08
>>> group1
>>>   
>>> 
>>>
>>> (should group also be a UUID?)
>>

I replied to earlier mail too, group number doesn't need to be UUID. It
should be a unique number. I think in the discussion in bof someone
mentioned about using domain's unique number that libvirt generates.
That should also work.

>> As long as create_group handles all the work and all libvirt does is
>> call it, get the return status/error, and handle deleting the vGPU on
>> error, then I guess it's doable.
>>

Yes that is the idea. Libvirt doesn't have to care about the groups.
With Alex's proposal, as you mentioned above, libvirt have to provide
group number to mdev_create, check return status and handle error case.

  echo $UUID1:$GROUP1 > mdev_create
  echo $UUID2:$GROUP1 > mdev_create
would create two mdev devices assigned to same domain.


>> Alternatively having multiple  in the XML and performing a
>> single *mdev/create_group is an option.
> 
> I don't really like the idea of a single nodedev-create creating
> multiple devices, but that would work too.
> 
>> That is, what is the "output" from create_group that gets added to the
>> domain XML?  How is that found?
> 
> A new sysfs path is created, whose name depends on the UUID.  The UUID
> is used in a  element in the domain XML and the sysfs path
> appears in the QEMU command line.  Kirti and Neo had examples in their
> presentation at KVM Forum.
> 
> If you create multiple devices in the same group, they are added to the
> same IOMMU group so they must be used by the same VM.  However they
> don't have to be available from the beginning; they could be
> hotplugged/hot-unplugged later, since from the point of view of the VM
> those are just another PCI device.
> 
>> Also, once the domain is running can a
>> vGPU be added to the group?  Removed?  What allows/prevents?
> 
> Kirti?... :)

Yes, vGPU could be hot-plugged or hot-unplugged. This also depends on
does vendor driver want to support that. For example, domain is running
with two vGPUs $UUID1 and $UUID2 and user tried to hot-unplug vGPU
$UUID2, vendor driver knows that domain is running and vGPU is being
used in guest, so vendor driver can fail offline/close() call if they
don't support hot-unplug. Similarly for hot-plug vendor driver can fail
create call to not to support hot-plug.

> 
> In principle I don't think anything should block vGPUs from different
> groups being added to the same VM, but I have to defer to Alex and Kirti
> again on this.
> 

No, there should be one group per VM.

>>> Since John brought up the topic of minimal XML, in this case it will be
>>> like this:
>>>
>>> 
>>>   my-vgpu
>>>   pci__86_00_0
>>>   
>>> 
>>>   
>>> 
>>>
>>> The uuid will be autogenerated by libvirt and if there's no  (as
>>> is common for VMs with only 1 vGPU) it will be a single-device group.
>>
>> The  could be ignored as it seems existing libvirt code wants to
>> generate a name via udevGenerateDeviceName for other devices. I haven't
>> studied it long enough, but I believe that's how those pci_* names
>> created.
> 
> Yeah that makes sense.  So we get down to a minimal XML that has just
> parent, and capability with type in it; additional elements could be
> name (ignored anyway), and within capability uuid and group.
>

Yes, this seems good.
I would like to have one more capability here. Pulling here some
suggestion from my previous mail:
In the directory structure, a 'params' can take optional parameters.
Libvirt then can set 'params' and then create mdev device. For example,
param say 'disable_console_vnc=1' is set for type 11, then devices
created of type 11 will have that param set unless it is cleared.

 └── mdev_supp

Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-09-03 Thread Kirti Wankhede


On 9/3/2016 5:27 AM, Laine Stump wrote:
> On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
>>
>>
>> On 02/09/2016 22:19, John Ferlan wrote:
>>> We don't have such a pool for GPU's (yet) - although I suppose they
>>> could just become a class of storage pools.
>>>
>>> The issue being nodedev device objects are not saved between reboots.
>>> They are generated on the fly. Hence the "create-nodedev' API - notice
>>> there's no "define-nodedev' API, although I suppose one could be
>>> created. It's just more work to get this all to work properly.
>>
>> It can all be made transient to begin with.  The VM can be defined but
>> won't start unless the mdev(s) exist with the right UUIDs.
>>
 After creating the vGPU, if required by the host driver, all the other
 type ids would disappear from "virsh nodedev-dumpxml
 pci__86_00_0" too.
>>>
>>> Not wanting to make assumptions, but this reads as if I create one type
>>> 11 vGPU, then I can create no others on the host.  Maybe I'm reading it
>>> wrong - it's been a long week.
>>
>> Correct, at least for NVIDIA.
>>
>>> PCI devices have the "managed='yes|no'" attribute as well. That's what
>>> determines whether the device is to be detached from the host or not.
>>> That's been something very painful to manage for vfio and well libvirt!
>>
>> mdevs do not exist on the host (they do not have a driver on the host
>> because they are not PCI devices) so they do need any management.  At
>> least I hope that's good news. :)
> 
> What's your definition of "management"? They don't need the same type of
> management as a traditional hostdev, but they certainly don't just
> appear by magic! :-)
> 
> For standard PCI devices, the managed attribute says whether or not the
> device needs to be detached from the host driver and attached to
> vfio-pci. For other kinds of hostdev devices, we could decide that it
> meant something different. In this case, perhaps managed='yes' could
> mean that the vGPU will be created as needed, and destroyed when the
> guest is finished with it, and managed='no' could mean that we expect a
> vGPU to already exist, and just need starting.
> 
> Or not. Maybe that's a pointless distinction in this case. Just pointing
> out the option...
> 

Mediated devices are like virtual device, there could be no direct
physical device associated with it. All mdev devices are owned by
vfio_mdev module, which is similar to vfio_pci module. I don't think we
need  to interpret 'managed' attribute for mdev devices same as standard
PCI devices.
If mdev device is created, you would find the device directory in
/sys/bus/mdev/devices/ directory.

Kirti.



Re: [Qemu-devel] [PATCH v7 4/4] docs: Add Documentation for Mediated devices

2016-09-03 Thread Kirti Wankhede
Adding Eric.

Eric,
This is the v7 version of patch. I'll incorporate changes that you
suggested here.

Kirti.

On 8/25/2016 9:23 AM, Kirti Wankhede wrote:
> Add file Documentation/vfio-mediated-device.txt that include details of
> mediated device framework.
> 
> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
> Signed-off-by: Neo Jia <c...@nvidia.com>
> Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
> Reviewed-on: http://git-master/r/1182512
> Reviewed-by: Automatic_Commit_Validation_User
> ---
>  Documentation/vfio-mediated-device.txt | 203 
> +
>  1 file changed, 203 insertions(+)
>  create mode 100644 Documentation/vfio-mediated-device.txt
> 
> diff --git a/Documentation/vfio-mediated-device.txt 
> b/Documentation/vfio-mediated-device.txt
> new file mode 100644
> index ..237d8eb630b7
> --- /dev/null
> +++ b/Documentation/vfio-mediated-device.txt
> @@ -0,0 +1,203 @@
> +VFIO Mediated devices [1]
> +---
> +
> +There are more and more use cases/demands to virtualize the DMA devices which
> +doesn't have SR_IOV capability built-in. To do this, drivers of different
> +devices had to develop their own management interface and set of APIs and 
> then
> +integrate it to user space software. We've identified common requirements and
> +unified management interface for such devices to make user space software
> +integration easier.
> +
> +The VFIO driver framework provides unified APIs for direct device access. It 
> is
> +an IOMMU/device agnostic framework for exposing direct device access to
> +user space, in a secure, IOMMU protected environment. This framework is
> +used for multiple devices like GPUs, network adapters and compute 
> accelerators.
> +With direct device access, virtual machines or user space applications have
> +direct access of physical device. This framework is reused for mediated 
> devices.
> +
> +Mediated core driver provides a common interface for mediated device 
> management
> +that can be used by drivers of different devices. This module provides a 
> generic
> +interface to create/destroy mediated device, add/remove it to mediated bus
> +driver, add/remove device to IOMMU group. It also provides an interface to
> +register bus driver, for example, Mediated VFIO mdev driver is designed for
> +mediated devices and supports VFIO APIs. Mediated bus driver add/delete 
> mediated
> +device to VFIO Group.
> +
> +Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
> +as example, since these are the devices which are going to actively use
> +this module as of now.
> +
> + +---+
> + |   |
> + | +---+ |  mdev_register_driver() +--+
> + | |   | +<+  |
> + | |  mdev | | |  |
> + | |  bus  | +>+ vfio_mdev.ko |<-> VFIO user
> + | |  driver   | | probe()/remove()|  |APIs
> + | |   | | +--+
> + | +---+ |
> + |   |
> + |  MDEV CORE|
> + |   MODULE  |
> + |   mdev.ko |
> + | +---+ |  mdev_register_device() +--+
> + | |   | +<+  |
> + | |   | | |  nvidia.ko   |<-> physical
> + | |   | +>+  |device
> + | |   | |callbacks+--+
> + | | Physical  | |
> + | |  device   | |  mdev_register_device() +--+
> + | | interface | |<+  |
> + | |   | | |  i915.ko |<-> physical
> + | |   | +>+  |device
> + | |   | |callbacks+--+
> + | |   | |
> + | |   | |  mdev_register_device() +--+
> + | |   | +<+  |
> + | |   | | | ccw_device.ko|<-> physical
> + | |   | +>+  |device
> + | |   | |callbacks+--+
> + | +---+ |
> + +---+
> +
> +
> +Registration Interfaces
> +---
> +
> +Mediated core driver provides two types of registration interfaces

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-03 Thread Kirti Wankhede


On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 20:33, Kirti Wankhede wrote:
>>  We could even do:
>>>>
>>>> echo $UUID1:$GROUPA > create
>>>>
>>>> where $GROUPA is the group ID of a previously created mdev device into
>>>> which $UUID1 is to be created and added to the same group.
>> 
> 
> From the point of view of libvirt, I think I prefer Alex's idea.
>  could be an additional element in the nodedev-create XML:
> 
> 
>   my-vgpu
>   pci__86_00_0
>   
> 
> 0695d332-7831-493f-9e71-1c85c8911a08
> group1
>   
> 
> 
> (should group also be a UUID?)
> 

No, this should be a unique number in a system, similar to iommu_group.

> Since John brought up the topic of minimal XML, in this case it will be
> like this:
> 
> 
>   my-vgpu
>   pci__86_00_0
>   
> 
>   
> 
> 
> The uuid will be autogenerated by libvirt and if there's no  (as
> is common for VMs with only 1 vGPU) it will be a single-device group.
> 

Right.

Kirti.

> Thanks,
> 
> Paolo
> 



Re: [Qemu-devel] [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-09-03 Thread Kirti Wankhede


On 9/3/2016 1:59 AM, John Ferlan wrote:
> 
> 
> On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
>>
>> On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 02/09/2016 19:15, Kirti Wankhede wrote:
>>>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>>>
>>>>>  my-vgpu
>>>>>  pci__86_00_0
>>>>>  
>>>>>
>>>>>0695d332-7831-493f-9e71-1c85c8911a08
>>>>>  
>>>>>
>>>>>
>>>>> After creating the vGPU, if required by the host driver, all the other
>>>>> type ids would disappear from "virsh nodedev-dumpxml pci__86_00_0" 
>>>>> too.
>>>>
>>>> Thanks Paolo for details.
>>>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>>>> file in sysfs to create mdev device. Right?
>>>> At this moment, does libvirt know which VM this device would be
>>>> associated with?
>>>
>>> No, the VM will associate to the nodedev through the UUID.  The nodedev
>>> is created separately from the VM.
>>>
>>>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>>>> info, again taken from sysfs:
>>>>>
>>>>>
>>>>>  my-vgpu
>>>>>  pci__86_00_0
>>>>>  
>>>>>0695d332-7831-493f-9e71-1c85c8911a08
>>>>>
>>>>>
>>>>>  
>>>>>
>>>>>
>>>>>  
>>>>>  
>>>>>  ...
>>>>>  NVIDIA
>>>>>
>>>>>  
>>>>>
>>>>>
>>>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>>>> pci at all, would have it inside mdev.  This represents the difference
>>>>> between the mdev provider and the mdev device.
>>>>
>>>> Parent of mdev device might not always be a PCI device. I think we
>>>> shouldn't consider it as PCI capability.
>>>
>>> The  in the vGPU means that it _will_ be exposed
>>> as a PCI device by VFIO.
>>>
>>> The  in the physical GPU means that the GPU is a
>>> PCI device.
>>>
>>
>> Ok. Got that.
>>
>>>>> Random proposal for the domain XML too:
>>>>>
>>>>>   
>>>>> 
>>>>>   
>>>>>   0695d332-7831-493f-9e71-1c85c8911a08
>>>>> 
>>>>> 
>>>>>   
>>>>>
>>>>
>>>> When user wants to assign two mdev devices to one VM, user have to add
>>>> such two entries or group the two devices in one entry?
>>>
>>> Two entries, one per UUID, each with its own PCI address in the guest.
>>>
>>>> On other mail thread with same subject we are thinking of creating group
>>>> of mdev devices to assign multiple mdev devices to one VM.
>>>
>>> What is the advantage in managing mdev groups?  (Sorry didn't follow the
>>> other thread).
>>>
>>
>> When mdev device is created, resources from physical device is assigned
>> to this device. But resources are committed only when device goes
>> 'online' ('start' in v6 patch)
>> In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
>> for all vGPU devices in a VM are committed at one place. So we need to
>> know the vGPUs assigned to a VM before QEMU starts.
>>
>> Grouping would help here as Alex suggested in that mail. Pulling only
>> that part of discussion here:
>>
>>  It seems then that the grouping needs to affect the iommu group
>> so that
>>> you know that there's only a single owner for all the mdev devices
>>> within the group.  IIRC, the bus drivers don't have any visibility
>>> to opening and releasing of the group itself to trigger the
>>> online/offline, but they can track opening of the device file
>>> descriptors within the group.  Within the VFIO API the user cannot
>>> access the device without the device file descriptor, so a "first
>>> device opened" and "last device closed" trigger would provide the
>>> trigger points you need.  Some sort of new sysfs interface would need
>>> to be invented to allow this sort of man

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-02 Thread Kirti Wankhede

On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 19:15, Kirti Wankhede wrote:
>> On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
>>>
>>>  my-vgpu
>>>  pci__86_00_0
>>>  
>>>
>>>0695d332-7831-493f-9e71-1c85c8911a08
>>>  
>>>
>>>
>>> After creating the vGPU, if required by the host driver, all the other
>>> type ids would disappear from "virsh nodedev-dumpxml pci__86_00_0" too.
>>
>> Thanks Paolo for details.
>> 'nodedev-create' parse the xml file and accordingly write to 'create'
>> file in sysfs to create mdev device. Right?
>> At this moment, does libvirt know which VM this device would be
>> associated with?
> 
> No, the VM will associate to the nodedev through the UUID.  The nodedev
> is created separately from the VM.
> 
>>> When dumping the mdev with nodedev-dumpxml, it could show more complete
>>> info, again taken from sysfs:
>>>
>>>
>>>  my-vgpu
>>>  pci__86_00_0
>>>  
>>>0695d332-7831-493f-9e71-1c85c8911a08
>>>
>>>
>>>  
>>>
>>>
>>>  
>>>  
>>>  ...
>>>  NVIDIA
>>>
>>>  
>>>
>>>
>>> Notice how the parent has mdev inside pci; the vGPU, if it has to have
>>> pci at all, would have it inside mdev.  This represents the difference
>>> between the mdev provider and the mdev device.
>>
>> Parent of mdev device might not always be a PCI device. I think we
>> shouldn't consider it as PCI capability.
> 
> The  in the vGPU means that it _will_ be exposed
> as a PCI device by VFIO.
> 
> The  in the physical GPU means that the GPU is a
> PCI device.
> 

Ok. Got that.

>>> Random proposal for the domain XML too:
>>>
>>>   
>>> 
>>>   
>>>   0695d332-7831-493f-9e71-1c85c8911a08
>>> 
>>> 
>>>   
>>>
>>
>> When user wants to assign two mdev devices to one VM, user have to add
>> such two entries or group the two devices in one entry?
> 
> Two entries, one per UUID, each with its own PCI address in the guest.
> 
>> On other mail thread with same subject we are thinking of creating group
>> of mdev devices to assign multiple mdev devices to one VM.
> 
> What is the advantage in managing mdev groups?  (Sorry didn't follow the
> other thread).
> 

When mdev device is created, resources from physical device is assigned
to this device. But resources are committed only when device goes
'online' ('start' in v6 patch)
In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources
for all vGPU devices in a VM are committed at one place. So we need to
know the vGPUs assigned to a VM before QEMU starts.

Grouping would help here as Alex suggested in that mail. Pulling only
that part of discussion here:

 It seems then that the grouping needs to affect the iommu group
so that
> you know that there's only a single owner for all the mdev devices
> within the group.  IIRC, the bus drivers don't have any visibility
> to opening and releasing of the group itself to trigger the
> online/offline, but they can track opening of the device file
> descriptors within the group.  Within the VFIO API the user cannot
> access the device without the device file descriptor, so a "first
> device opened" and "last device closed" trigger would provide the
> trigger points you need.  Some sort of new sysfs interface would need
> to be invented to allow this sort of manipulation.
> Also we should probably keep sight of whether we feel this is
> sufficiently necessary for the complexity.  If we can get by with only
> doing this grouping at creation time then we could define the "create"
> interface in various ways.  For example:
>
> echo $UUID0 > create
>
> would create a single mdev named $UUID0 in it's own group.
>
> echo {$UUID0,$UUID1} > create
>
> could create mdev devices $UUID0 and $UUID1 grouped together.
>



I think this would create mdev device of same type on same parent
device. We need to consider the case of multiple mdev devices of
different types and with different parents to be grouped together.


 We could even do:
>
> echo $UUID1:$GROUPA > create
>
> where $GROUPA is the group ID of a previously created mdev device into
> which $UUID1 is to be created and added to the same group.



I was thinking about:

  echo $UUID0 > create

would create mdev device

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-02 Thread Kirti Wankhede
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
> 
> 
> On 02/09/2016 07:21, Kirti Wankhede wrote:
>> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
>>> Okay, maybe I'm misunderstanding something. I just thought that users
>>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
>>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
>>> to construct domain XML.
>>
>> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
>> enumerates devices in the system?
> 
> It looks at sysfs and/or the udev database and transforms what it finds
> there to XML.
> 
> I think people would consult the nodedev driver to fetch vGPU
> capabilities, use "virsh nodedev-create" to create the vGPU device on
> the host, and then somehow refer to the nodedev in the domain XML.
> 
> There isn't very much documentation on nodedev-create, but it's used
> mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
> 
>
>  scsi_host6
>  scsi_host5
>  
>
>  2001001b32a9da5e
>  2101001b32a9da5e
>
>  
>
> 
> so I suppose for vGPU it would look like this:
> 
>
>  my-vgpu
>  pci__86_00_0
>  
>
>0695d332-7831-493f-9e71-1c85c8911a08
>  
>
> 
> while the parent would have:
> 
>
>  pci__86_00_0
>  
>0
>134
>0
>0
>
>  
>  
>
>GRID M60-0B
>2
>45
>524288
>2560
>1600
>  
>
>GRID M60
>NVIDIA
>  
>
> 
> After creating the vGPU, if required by the host driver, all the other
> type ids would disappear from "virsh nodedev-dumpxml pci__86_00_0" too.
>

Thanks Paolo for details.
'nodedev-create' parse the xml file and accordingly write to 'create'
file in sysfs to create mdev device. Right?
At this moment, does libvirt know which VM this device would be
associated with?

> When dumping the mdev with nodedev-dumpxml, it could show more complete
> info, again taken from sysfs:
> 
>
>  my-vgpu
>  pci__86_00_0
>  
>0695d332-7831-493f-9e71-1c85c8911a08
>
>
>  GRID M60-0B
>  2
>  45
>  524288
>  2560
>  1600
>
>
>  
>  
>  ...
>  NVIDIA
>
>  
>
> 
> Notice how the parent has mdev inside pci; the vGPU, if it has to have
> pci at all, would have it inside mdev.  This represents the difference
> between the mdev provider and the mdev device.
>

Parent of mdev device might not always be a PCI device. I think we
shouldn't consider it as PCI capability.

> Random proposal for the domain XML too:
> 
>   
> 
>   
>   0695d332-7831-493f-9e71-1c85c8911a08
> 
> 
>   
>

When user wants to assign two mdev devices to one VM, user have to add
such two entries or group the two devices in one entry?
On other mail thread with same subject we are thinking of creating group
of mdev devices to assign multiple mdev devices to one VM. Libvirt don't
have to know about group number but libvirt should add all mdev devices
in a group. Is that possible to do before starting QEMU process?

Thanks,
Kirti


> Paolo
> 



Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-02 Thread Kirti Wankhede


On 9/2/2016 1:31 AM, Alex Williamson wrote:
> On Thu, 1 Sep 2016 23:52:02 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> Alex,
>> Thanks for summarizing the discussion.
>>
>> On 8/31/2016 9:18 PM, Alex Williamson wrote:
>>> On Wed, 31 Aug 2016 15:04:13 +0800
>>> Jike Song <jike.s...@intel.com> wrote:
>>>   
>>>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:  
>>>>>> From: Alex Williamson [mailto:alex.william...@redhat.com]
>>>>>> Sent: Wednesday, August 31, 2016 12:17 AM
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> At KVM Forum we had a BoF session primarily around the mediated device
>>>>>> sysfs interface.  I'd like to share what I think we agreed on and the
>>>>>> "problem areas" that still need some work so we can get the thoughts
>>>>>> and ideas from those who weren't able to attend.
>>>>>>
>>>>>> DanPB expressed some concern about the mdev_supported_types sysfs
>>>>>> interface, which exposes a flat csv file with fields like "type",
>>>>>> "number of instance", "vendor string", and then a bunch of type
>>>>>> specific fields like "framebuffer size", "resolution", "frame rate
>>>>>> limit", etc.  This is not entirely machine parsing friendly and sort of
>>>>>> abuses the sysfs concept of one value per file.  Example output taken
>>>>>> from Neo's libvirt RFC:
>>>>>>
>>>>>> cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
>>>>>> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, 
>>>>>> framebuffer,
>>>>>> max_resolution
>>>>>> 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
>>>>>> 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
>>>>>> 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
>>>>>> 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
>>>>>> 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
>>>>>> 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
>>>>>> 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
>>>>>> 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160
>>>>>>
>>>>>> The create/destroy then looks like this:
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>>  /sys/bus/pci/devices/.../mdev_create
>>>>>>
>>>>>> echo "$mdev_UUID:vendor_specific_argument_list" >
>>>>>>  /sys/bus/pci/devices/.../mdev_destroy
>>>>>>
>>>>>> "vendor_specific_argument_list" is nebulous.
>>>>>>
>>>>>> So the idea to fix this is to explode this into a directory structure,
>>>>>> something like:
>>>>>>
>>>>>> ├── mdev_destroy
>>>>>> └── mdev_supported_types
>>>>>> ├── 11
>>>>>> │   ├── create
>>>>>> │   ├── description
>>>>>> │   └── max_instances
>>>>>> ├── 12
>>>>>> │   ├── create
>>>>>> │   ├── description
>>>>>> │   └── max_instances
>>>>>> └── 13
>>>>>> ├── create
>>>>>> ├── description
>>>>>> └── max_instances
>>>>>>
>>>>>> Note that I'm only exposing the minimal attributes here for simplicity,
>>>>>> the other attributes would be included in separate files and we would
>>>>>> require vendors to create standard attributes for common device classes. 
>>>>>>
>>>>>
>>>>> I like this idea. All standard attributes are reflected into this 
>>>>> hierarchy.
>>>>> In the meantime, can we still allow optional vendor string in create 
>>>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>>>> layer to do some vendor specific tweak 

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-01 Thread Kirti Wankhede


On 9/2/2016 10:18 AM, Michal Privoznik wrote:
> On 01.09.2016 18:59, Alex Williamson wrote:
>> On Thu, 1 Sep 2016 18:47:06 +0200
>> Michal Privoznik  wrote:
>>
>>> On 31.08.2016 08:12, Tian, Kevin wrote:
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 31, 2016 12:17 AM
>
> Hi folks,
>
> At KVM Forum we had a BoF session primarily around the mediated device
> sysfs interface.  I'd like to share what I think we agreed on and the
> "problem areas" that still need some work so we can get the thoughts
> and ideas from those who weren't able to attend.
>
> DanPB expressed some concern about the mdev_supported_types sysfs
> interface, which exposes a flat csv file with fields like "type",
> "number of instance", "vendor string", and then a bunch of type
> specific fields like "framebuffer size", "resolution", "frame rate
> limit", etc.  This is not entirely machine parsing friendly and sort of
> abuses the sysfs concept of one value per file.  Example output taken
> from Neo's libvirt RFC:
>
> cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, 
> framebuffer,
> max_resolution
> 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
> 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
> 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
> 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
> 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
> 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
> 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
> 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160
>
> The create/destroy then looks like this:
>
> echo "$mdev_UUID:vendor_specific_argument_list" >
>   /sys/bus/pci/devices/.../mdev_create
>
> echo "$mdev_UUID:vendor_specific_argument_list" >
>   /sys/bus/pci/devices/.../mdev_destroy
>
> "vendor_specific_argument_list" is nebulous.
>
> So the idea to fix this is to explode this into a directory structure,
> something like:
>
> ├── mdev_destroy
> └── mdev_supported_types
> ├── 11
> │   ├── create
> │   ├── description
> │   └── max_instances
> ├── 12
> │   ├── create
> │   ├── description
> │   └── max_instances
> └── 13
> ├── create
> ├── description
> └── max_instances
>
> Note that I'm only exposing the minimal attributes here for simplicity,
> the other attributes would be included in separate files and we would
> require vendors to create standard attributes for common device classes.  

 I like this idea. All standard attributes are reflected into this 
 hierarchy.
 In the meantime, can we still allow optional vendor string in create 
 interface? libvirt doesn't need to know the meaning, but allows upper
 layer to do some vendor specific tweak if necessary.  
>>>
>>> This is not the best idea IMO. Libvirt is there to shadow differences
>>> between hypervisors. While doing that, we often hide differences between
>>> various types of HW too. Therefore in order to provide good abstraction
>>> we should make vendor specific string as small as possible (ideally an
>>> empty string). I mean I see it as bad idea to expose "vgpu_type_id" from
>>> example above in domain XML. What I think the better idea is if we let
>>> users chose resolution and frame buffer size, e.g.: >> resolution="1024x768" framebuffer="16"/> (just the first idea that came
>>> to my mind while writing this e-mail). The point is, XML part is
>>> completely free of any vendor-specific knobs.
>>
>> That's not really what you want though, a user actually cares whether
>> they get an Intel of NVIDIA vGPU, we can't specify it as just a
>> resolution and framebuffer size.  The user also doesn't want the model
>> changing each time the VM is started, so not only do you *need* to know
>> the vendor, you need to know the vendor model.  This is the only way to
>> provide a consistent VM.  So as we discussed at the BoF, the libvirt
>> xml will likely reference the vendor string, which will be a unique
>> identifier that encompasses all the additional attributes we expose.
>> Really the goal of the attributes is simply so you don't need a per
>> vendor magic decoder ring to figure out the basic features of a given
>> vendor string.  Thanks,
> 
> Okay, maybe I'm misunderstanding something. I just thought that users
> will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh
> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
> to construct 

Re: [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-01 Thread Kirti Wankhede

Alex,
Thanks for summarizing the discussion.

On 8/31/2016 9:18 PM, Alex Williamson wrote:
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song  wrote:
> 
>> On 08/31/2016 02:12 PM, Tian, Kevin wrote:
 From: Alex Williamson [mailto:alex.william...@redhat.com]
 Sent: Wednesday, August 31, 2016 12:17 AM

 Hi folks,

 At KVM Forum we had a BoF session primarily around the mediated device
 sysfs interface.  I'd like to share what I think we agreed on and the
 "problem areas" that still need some work so we can get the thoughts
 and ideas from those who weren't able to attend.

 DanPB expressed some concern about the mdev_supported_types sysfs
 interface, which exposes a flat csv file with fields like "type",
 "number of instance", "vendor string", and then a bunch of type
 specific fields like "framebuffer size", "resolution", "frame rate
 limit", etc.  This is not entirely machine parsing friendly and sort of
 abuses the sysfs concept of one value per file.  Example output taken
 from Neo's libvirt RFC:

 cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
 # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, 
 framebuffer,
 max_resolution
 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160

 The create/destroy then looks like this:

 echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_create

 echo "$mdev_UUID:vendor_specific_argument_list" >
/sys/bus/pci/devices/.../mdev_destroy

 "vendor_specific_argument_list" is nebulous.

 So the idea to fix this is to explode this into a directory structure,
 something like:

 ├── mdev_destroy
 └── mdev_supported_types
 ├── 11
 │   ├── create
 │   ├── description
 │   └── max_instances
 ├── 12
 │   ├── create
 │   ├── description
 │   └── max_instances
 └── 13
 ├── create
 ├── description
 └── max_instances

 Note that I'm only exposing the minimal attributes here for simplicity,
 the other attributes would be included in separate files and we would
 require vendors to create standard attributes for common device classes.  
>>>
>>> I like this idea. All standard attributes are reflected into this hierarchy.
>>> In the meantime, can we still allow optional vendor string in create 
>>> interface? libvirt doesn't need to know the meaning, but allows upper
>>> layer to do some vendor specific tweak if necessary.
>>>   
>>
>> Not sure whether this can done within MDEV framework (attrs provided by
>> vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know when to apply them,
> which is just as bad.  Ultimately we want libvirt to be able to
> interact with sysfs without having an vendor specific knowledge.
> 

Above directory hierarchy looks fine to me. Along with the fixed set of
parameter, a optional field of extra parameter is also required. Such
parameters are required for some specific testing or running benchmarks,
for example to disable FRL (framerate limiter) or to disable console vnc
when not required. Libvirt don't need to know its details, its just a
string that user can provide and libvirt need to pass the string as it
is to vendor driver, vendor driver would act accordingly.


 For vGPUs like NVIDIA where we don't support multiple types
 concurrently, this directory structure would update as mdev devices are
 created, removing no longer available types.  I carried forward  
>>>
>>> or keep the type with max_instances cleared to ZERO.
>>>  
>>
>> +1 :)
> 
> Possible yes, but why would the vendor driver report types that the
> user cannot create?  It just seems like superfluous information (well,
> except for the use I discover below).
> 

The directory structure for a physical GPU will be defined when device
is register to 

Re: [Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices

2016-08-26 Thread Kirti Wankhede


On 8/25/2016 2:52 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:53 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
> [...]
> 
> Dear Kirti,
> 
> I just rebased my vfio-ccw patches to this series.
> With a little fix, which was pointed it out in my reply to the #3
> patch, it works fine.
> 

Thanks for update. Glad to know this works for you.


>> +static long vfio_mdev_unlocked_ioctl(void *device_data,
>> + unsigned int cmd, unsigned long arg)
>> +{
>> +int ret = 0;
>> +struct vfio_mdev *vmdev = device_data;
>> +struct parent_device *parent = vmdev->mdev->parent;
>> +unsigned long minsz;
>> +
>> +switch (cmd) {
>> +case VFIO_DEVICE_GET_INFO:
>> +{
>> +struct vfio_device_info info;
>> +
>> +minsz = offsetofend(struct vfio_device_info, num_irqs);
>> +
>> +if (copy_from_user(, (void __user *)arg, minsz))
>> +return -EFAULT;
>> +
>> +if (info.argsz < minsz)
>> +return -EINVAL;
>> +
>> +if (parent->ops->get_device_info)
>> +ret = parent->ops->get_device_info(vmdev->mdev, );
>> +else
>> +return -EINVAL;
>> +
>> +if (ret)
>> +return ret;
>> +
>> +if (parent->ops->reset)
>> +info.flags |= VFIO_DEVICE_FLAGS_RESET;
> Shouldn't this be done inside the get_device_info callback?
> 

I would like Vendor driver to set device type only. Reset flag should be
set on basis of reset() callback provided.

>> +
>> +memcpy(>dev_info, , sizeof(info));
>> +
>> +return copy_to_user((void __user *)arg, , minsz);
>> +}
> [...]
> 
>> +
>> +static ssize_t vfio_mdev_read(void *device_data, char __user *buf,
>> +  size_t count, loff_t *ppos)
>> +{
>> +struct vfio_mdev *vmdev = device_data;
>> +struct mdev_device *mdev = vmdev->mdev;
>> +struct parent_device *parent = mdev->parent;
>> +unsigned int done = 0;
>> +int ret;
>> +
>> +if (!parent->ops->read)
>> +return -EINVAL;
>> +
>> +while (count) {
> Here, I have to say sorry to you guys for that I didn't notice the
> bad impact of this change to my patches during the v6 discussion.
> 
> For vfio-ccw, I introduced an I/O region to input/output I/O
> instruction parameters and results for Qemu. The @count of these data
> currently is 140. So supporting arbitrary lengths in one shot here, and
> also in vfio_mdev_write, seems the better option for this case.
> 
> I believe that if the pci drivers want to iterate in a 4 bytes step, you
> can do that in the parent read/write callbacks instead.
> 
> What do you think?
> 

I would like to know Alex's thought on this. He raised concern with this
approach in v6 reviews:
"But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer."

Thanks,
Kirti

>> +size_t filled;
>> +
>> +if (count >= 4 && !(*ppos % 4)) {
>> +u32 val;
>> +
>> +ret = parent->ops->read(mdev, (char *), sizeof(val),
>> +*ppos);
>> +if (ret <= 0)
>> +goto read_err;
>> +
>> +if (copy_to_user(buf, , sizeof(val)))
>> +goto read_err;
>> +
>> +filled = 4;
>> +} else if (count >= 2 && !(*ppos % 2)) {
>> +u16 val;
>> +
>> +ret = parent->ops->read(mdev, (char *), sizeof(val),
>> +*ppos);
>> +if (ret <= 0)
>> +goto read_err;
>> +
>> +if (copy_to_user(buf, , sizeof(val)))
>> +goto read_err;
>> +
>> +filled = 2;
>> +} else {
>> +u8 val;
>> +
>> +ret = parent->ops->read(mdev, , sizeof(val), *ppos);
>> +if (ret <= 0)
>> +goto read_err;
>> +
>> +if (copy_to_user(buf, , sizeof(val)))
>> +goto read_err;
>> +
>> +filled = 1;
>> +}
>> +
>> +count -= filled;
>> +done += filled;
>> +*ppos += filled;
>> +buf += filled;
>> +}
>> +
>> +return done;
>> +
>> +read_err:
>> +return -EFAULT;
>> +}
> [...]
> 
> 
> Dong Jia
> 



Re: [Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices

2016-08-26 Thread Kirti Wankhede

Oh, that's the last minute change after running checkpatch.pl :(
Thanks for catching that. I'll correct that.

Thanks,
Kirti

On 8/25/2016 12:59 PM, Dong Jia wrote:
> On Thu, 25 Aug 2016 09:23:54 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> @@ -769,6 +1090,33 @@ static int vfio_iommu_type1_attach_group(void 
>> *iommu_data,
>>  if (ret)
>>  goto out_free;
>>
>> +if (IS_ENABLED(CONFIF_VFIO_MDEV) && !iommu_present(bus) &&
> s/CONFIF_VFIO_MDEV/CONFIG_VFIO_MDEV/
> 
>> +(bus == _bus_type)) {
>> +if (iommu->local_domain) {
>> +list_add(>next,
>> + >local_domain->group_list);
>> +kfree(domain);
>> +mutex_unlock(>lock);
>> +return 0;
>> +}
>> +
> 
> 
> 
> Dong Jia
> 



[Qemu-devel] [PATCH v7 3/4] vfio iommu: Add support for mediated devices

2016-08-24 Thread Kirti Wankhede
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
Reviewed-on: http://git-master/r/1175707
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/vfio.c | 117 ++
 drivers/vfio/vfio_iommu_type1.c | 498 
 include/linux/vfio.h|  13 +-
 3 files changed, 580 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..e3e342861e04 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,123 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, 
size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+static struct vfio_group *vfio_group_from_dev(struct device *dev)
+{
+   struct vfio_device *device;
+   struct vfio_group *group;
+   int ret;
+
+   device = vfio_device_get_from_dev(dev);
+   if (!device)
+   return ERR_PTR(-EINVAL);
+
+   group = device->group;
+   if (!atomic_inc_not_zero(>container_users)) {
+   ret = -EINVAL;
+   goto err_ret;
+   }
+
+   if (group->noiommu) {
+   atomic_dec(>container_users);
+   ret = -EPERM;
+   goto err_ret;
+   }
+
+   if (!group->container->iommu_driver ||
+   !vfio_group_viable(group)) {
+   atomic_dec(>container_users);
+   ret = -EINVAL;
+   goto err_ret;
+   }
+
+   vfio_device_put(device);
+   return group;
+
+err_ret:
+   vfio_device_put(device);
+   return ERR_PTR(ret);
+}
+
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for local
+ * domain only.
+ * @dev [in] : device
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct device *dev, unsigned long *user_pfn,
+   long npage, int prot, unsigned long *phys_pfn)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret = -EINVAL;
+
+   if (!dev || !user_pfn || !phys_pfn)
+   return -EINVAL;
+
+   group = vfio_group_from_dev(dev);
+   if (IS_ERR(group))
+   return PTR_ERR(group);
+
+   container = group->container;
+   if (IS_ERR(container))
+   return PTR_ERR(container);
+
+   down_read(>group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->pin_pages))
+   ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+npage, prot, phys_pfn);
+
+   up_read(>group_lock);
+   vfio_group_try_dissolve_container(group);
+
+   return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for local domain only.
+ * @dev [in] : device
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct device *dev, unsigned long *pfn, long npage)
+{
+   struct vfio_container *container;
+   struct vfio_group *group;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret = -EINVAL;
+
+   if (!dev || !pfn)
+   return -EINVAL;

[Qemu-devel] [PATCH v7 4/4] docs: Add Documentation for Mediated devices

2016-08-24 Thread Kirti Wankhede
Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
Reviewed-on: http://git-master/r/1182512
Reviewed-by: Automatic_Commit_Validation_User
---
 Documentation/vfio-mediated-device.txt | 203 +
 1 file changed, 203 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
new file mode 100644
index ..237d8eb630b7
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,203 @@
+VFIO Mediated devices [1]
+---
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated 
devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a 
generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register bus driver, for example, Mediated VFIO mdev driver is designed for
+mediated devices and supports VFIO APIs. Mediated bus driver add/delete 
mediated
+device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now.
+
+ +---+
+ |   |
+ | +---+ |  mdev_register_driver() +--+
+ | |   | +<+  |
+ | |  mdev | | |  |
+ | |  bus  | +>+ vfio_mdev.ko |<-> VFIO user
+ | |  driver   | | probe()/remove()|  |APIs
+ | |   | | +--+
+ | +---+ |
+ |   |
+ |  MDEV CORE|
+ |   MODULE  |
+ |   mdev.ko |
+ | +---+ |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | |  nvidia.ko   |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | | Physical  | |
+ | |  device   | |  mdev_register_device() +--+
+ | | interface | |<+  |
+ | |   | | |  i915.ko |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | |   | |
+ | |   | |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | | ccw_device.ko|<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | +---+ |
+ +---+
+
+
+Registration Interfaces
+---
+
+Mediated core driver provides two types of registration interfaces:
+
+1. Registration interface for mediated bus driver:
+-
+ /*
+  * struct mdev_driver [2] - Mediated device's driver
+  * @name: driver name
+  * @probe: called when new device created
+  * @remove: called when device removed
+  * @driver: device driver structure
+  */
+ struct mdev_driver {
+const char *name;
+int  (*probe)  (struct device *dev);
+void (*remove) (struct device *dev);
+struct device_driverdriver;
+ };
+
+Mediated bus driver for mdev should use this interface to register 

[Qemu-devel] [PATCH v7 2/4] vfio: VFIO driver for mediated devices

2016-08-24 Thread Kirti Wankhede
VFIO MDEV driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated
device. Those are:
- get VFIO device information about type of device, maximum number of
  regions and maximum number of interrupts supported.
- get region information from vendor driver.
- Get interrupt information and send interrupt configuration information to
  vendor driver.
- Device reset
- Trap and forward read/write for emulated regions.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
Reviewed-on: http://git-master/r/1175706
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/mdev/Kconfig   |   6 +
 drivers/vfio/mdev/Makefile  |   1 +
 drivers/vfio/mdev/vfio_mdev.c   | 467 
 drivers/vfio/pci/vfio_pci_private.h |   6 +-
 4 files changed, 477 insertions(+), 3 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..703abd0a9bff 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
 If you don't know what do here, say N.
 
+config VFIO_MDEV_DEVICE
+tristate "VFIO support for Mediated devices"
+depends on VFIO && VFIO_MDEV
+default n
+help
+VFIO based driver for mediated devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..e5087ed83a34 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MDEV_DEVICE) += vfio_mdev.o
 
diff --git a/drivers/vfio/mdev/vfio_mdev.c b/drivers/vfio/mdev/vfio_mdev.c
new file mode 100644
index ..28f13aeaa46b
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mdev.c
@@ -0,0 +1,467 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+   struct iommu_group *group;
+   struct mdev_device *mdev;
+   struct vfio_device_info dev_info;
+};
+
+static int vfio_mdev_open(void *device_data)
+{
+   int ret = 0;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   return ret;
+}
+
+static void vfio_mdev_close(void *device_data)
+{
+   module_put(THIS_MODULE);
+}
+
+static int sparse_mmap_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+   struct vfio_info_cap_header *header;
+   struct vfio_region_info_cap_sparse_mmap *sparse_cap, *sparse = cap_type;
+   size_t size;
+
+   size = sizeof(*sparse) + sparse->nr_areas *  sizeof(*sparse->areas);
+   header = vfio_info_cap_add(caps, size,
+  VFIO_REGION_INFO_CAP_SPARSE_MMAP, 1);
+   if (IS_ERR(header))
+   return PTR_ERR(header);
+
+   sparse_cap = container_of(header,
+   struct vfio_region_info_cap_sparse_mmap, header);
+   sparse_cap->nr_areas = sparse->nr_areas;
+   memcpy(sparse_cap->areas, sparse->areas,
+  sparse->nr_areas * sizeof(*sparse->areas));
+   return 0;
+}
+
+static int region_type_cap(struct vfio_info_cap *caps, void *cap_type)
+{
+   struct vfio_info_cap_header *header;
+   struct vfio_region_info_cap_type *type_cap, *cap = cap_type;
+
+   header = vfio_info_cap_add(caps, sizeof(*cap),
+  VFIO_REGION_INFO_CAP_TYPE, 1);
+   if (IS_ERR(header))
+   return PTR_ERR(header);
+
+   type_cap = container_of(header, struct vfio_region_info_cap_type,
+   header);
+   type_cap->type = cap->type;
+   type_cap->subtype = cap->type;
+   return 0;
+}
+
+static long vfio_mdev_unlocked_ioctl(void *device_data,
+unsigned int cmd, unsigned long arg)
+{
+   int ret = 0;
+   struct vfio_mdev *vmdev = device_data;
+   struct parent_device *parent = vmdev->mdev->parent;
+   unsign

[Qemu-devel] [PATCH v7 1/4] vfio: Mediated device Core driver

2016-08-24 Thread Kirti Wankhede
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by different drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---+
 |   |
 | +---+ |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |  mdev | | |  |
 | |  bus  | +>+  |<-> VFIO user
 | |  driver   | | probe()/remove()| vfio_mdev.ko |APIs
 | |   | | |  |
 | +---+ | +--+
 |   |
 |  MDEV CORE|
 |   MODULE  |
 |   mdev.ko |
 | +---+ |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | |  nvidia.ko   |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--+
 | | interface | |<+  |
 | |   | | |  i915.ko |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | |   | |
 | |   | |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | | ccw_device.ko|<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | +---+ |
 +---+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 struct device_driverdriver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev, vfio_mdev, uses this interface to
register with Core driver. vfio_mdev module adds mediated device to VFIO
group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
driver
- create: to allocate basic resources in vendor driver for a mediated
  device.
- destroy: to free resources in vendor driver when mediated device is
   destroyed.
- reset: to free and reallocate resources in vendor driver during device
 reset.
- set_online_status: to change online status of mediated device.
- get_online_status: to get current (online/offline) status of mediated
 device.
- read : read emulation callback.
- write: write emulation callback.
- mmap: mmap emulation callback.
- get_irq_info: to retrieve information about mediated device's IRQ.
- set_irqs: send interrupt configuration information that VMM sets.
- get_device_info: to retrieve VFIO device related flags, number of regions
   and number of IRQs supported.
- get_region_info: to provide region size and its flags for the mediated
   device.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.
Locks to serialize above callbacks are removed. If required, vendor driver
can have locks to serialize above APIs in their driver.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
Reviewed-on: http://git-master/r/1175705
Reviewed-by: Automatic_Commit_Validation_User
---
 drivers/vfio/Kconfig |   1 +
 drivers/vfio/Makefile|   1 +
 drivers/vfio/mdev/Kconfig|  12 +
 drivers/vfio/mdev/Makefile   |   5 +
 drivers/vfio/mdev/mdev_core.c| 509 +++
 drivers/vfio/mdev/mdev_driver.c  | 131 ++
 drivers/vfio/mdev/mdev_private.h |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c   | 240 ++
 include/linu

[Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-08-24 Thread Kirti Wankhede
This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated devices that are created by mdev core module and update VFIO type1
IOMMU module to support pinning & unpinning for mediated devices.

This change uses uuid_le_to_bin() to parse UUID string and convert to bin.
This requires following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
lib/uuid.c: introduce a few more generic helpers

Requires below commits from linux master branch for mmap region fault handler
that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

What's new in v7?
- Removed 'instance' field from mdev_device structure.
- Replaced 'start' and 'stop' with 'online' interface which is per mdev device
  and takes 1 or 0 as argument.
- Removed validate_mmap_request() callback and added mmap() callback to
  parent_ops.
- With above change, removed mapping tracking logic and invalidation function
  from mdev core module. Vendor driver should have this in their module.
- Added get_device_info() callback so that vendor driver can define the device
  type, number of regions and number of IRQs supported.
- Added get_irq_info() callback for vendor driver to define the flags for irqs.
- Updated get_region_info() callback so that vendor driver can specify the
  capabilities.
- With all the above changes, VFIO driver is no more PCI driver. It can be used
  for any type of device. Hence, renamed vfio_mpci module to vfio_mdev and
  removed match() from driver interface structure.

Yet TODO:
  Need to handle the case in vfio_type1_iommu module that Alex pointed out in v6
review, that is, if the devices attached to the normal IOMMU API domain go away,
need to re-establish accounting for local domain.


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated devices
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 203 +
 drivers/vfio/Kconfig   |   1 +
 drivers/vfio/Makefile  |   1 +
 drivers/vfio/mdev/Kconfig  |  18 ++
 drivers/vfio/mdev/Makefile |   6 +
 drivers/vfio/mdev/mdev_core.c  | 509 +
 drivers/vfio/mdev/mdev_driver.c| 131 +
 drivers/vfio/mdev/mdev_private.h   |  36 +++
 drivers/vfio/mdev/mdev_sysfs.c | 240 
 drivers/vfio/mdev/vfio_mdev.c  | 467 ++
 drivers/vfio/pci/vfio_pci_private.h|   6 +-
 drivers/vfio/vfio.c| 117 
 drivers/vfio/vfio_iommu_type1.c| 499 +---
 include/linux/mdev.h   | 212 ++
 include/linux/vfio.h   |  13 +-
 15 files changed, 2408 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mdev.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0




Re: [Qemu-devel] [RFC v6-based v1 0/5] refine mdev framework

2016-08-19 Thread Kirti Wankhede


On 8/18/2016 11:55 PM, Alex Williamson wrote:
> On Thu, 18 Aug 2016 16:42:14 +0800
> Dong Jia  wrote:
> 
>> On Wed, 17 Aug 2016 03:09:10 -0700
>> Neo Jia  wrote:
>>
>>> On Wed, Aug 17, 2016 at 04:58:14PM +0800, Dong Jia wrote:  
 On Tue, 16 Aug 2016 16:14:12 +0800
 Jike Song  wrote:
   
>
> This patchset is based on NVidia's "Add Mediated device support" series, 
> version 6:
>
>   http://www.spinics.net/lists/kvm/msg136472.html
>
>
> Background:
>
>   The patchset from NVidia introduced the Mediated Device support to
>   Linux/VFIO. With that series, one can create virtual devices (supporting
>   by underlying physical device and vendor driver), and assign them to
>   userspace like QEMU/KVM, in the same way as device assignment via VFIO.
>
>   Based on that, NVidia and Intel implemented their vGPU solutions, IBM
>   implemented its CCW pass-through.  However, there are limitations
>   imposed by current (v6 in particular) mdev framework: the mdev must be
>   represented as a PCI device, several vfio capabilities such as
>   sparse mmap are not possible, and so forth.
>
>   This series aims to address above limitations and simplify the 
> implementation.
>
>
> Key Changes:
>
>   - An independent "struct device" was introduced to parent_device, thus
> a hierarchy in driver core is formed with physical device, parent 
> device
> and mdev device;
>
>   - Leveraging the mechanism and APIs provided by Linux driver core, it
> is now safe to remove all refcnts and locks;
>
>   - vfio_mpci (later renamed to vfio_mdev) was made BUS-agnostic: all
> PCI-specific logic was removed, accesses from userspace are now
> passed to vendor driver directly, thus guaranteed that full VFIO
> capabilities provided: e.g. dynamic regions, sparse mmap, etc.;
>
> With vfio_mdev being BUS-agnostic, it is enough to have only one
> driver for all mdev devices;  

 Hi Jike:

 I don't know what happened, but finding out which direction this will
 likely go seems my first priority now...  
>>>
>>> Hi Dong,
>>>
>>> Just want to let you know that we are preparing the v7 patches to 
>>> incorporate
>>> the latest review comments from Intel folks and Alex, for some changes in 
>>> this
>>> patch set also mentioned in the recent review are already queued up in the 
>>> new
>>> version.  
>> Hi Neo,
>>
>> Good to know this. :>
>>
>>>   

 I'd say, either with only the original mdev v6, or patched this series,
 vfio-ccw could live. But this series saves my work of mimicing the
 vfio-mpci code in my vfio-mccw driver. I like this incremental patches.  
>>>
>>> Thanks for sharing your progress and good to know our current v6 solution 
>>> works 
>>> for you. We are still evaluating the vfio_mdev changes here as I still 
>>> prefer to
>>> share general VFIO pci handling inside a common VFIO PCI driver, and the
>>> modularization will reduce the impact of future changes and potential 
>>> regressions
>>> cross architectures - between PCI and CCW.  
>> If this is something that Alex and the Intel folks are fine with, I have
>> no problem with this too. Thanks,
> 
> Overall, I like this a lot.  Creating a proper device hierarchy and
> letting the driver core manage the references makes a lot of sense and
> the reduction in code volume and complexity speaks for itself.  

We are evaluating on this proposed solution. But the proposed patches
are not tested, those have bugs.

+#define dev_to_parent_dev(_dev) container_of((_dev),   \
+struct parent_device, dev)

This macro itself is not correct and causes kernel crash. This macro
doesn't really takes to 'struct parent_device' as it aimed to.

We are evaluating on how this will change sysfs entries and what will be
the impact of fixing all these bugs or will that really going to help.

> I like
> how the PCI mdev layer goes away, we're not imposing arbitrary
> restrictions on the vendor driver in an attempt to insert a common
> layer.

We were trying to make it more and more configurable for vendor driver
while keeping common code at common place so that code is not replicated
in each vendor driver. We believe with the new version of v7, it will
become a common module instead of pci module.


>  We can add helpers for things that do end up being common as we
> go.  Using devices rather than uuids for functions is a big
> improvement.


This is agreed on reviews of v6 version of my patches. Now we are
introducing ‘online’ instead of start()/stop() and all are in agreement
with that, right?

Thanks,
Kirti

> I hope that Neo and Kirti will incorporate many of these
> changes in their next revision.  Thanks for stepping in with this,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-12 Thread Kirti Wankhede


On 8/13/2016 2:55 AM, Alex Williamson wrote:
> On Fri, 12 Aug 2016 23:27:01 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 8/12/2016 12:13 AM, Alex Williamson wrote:
>>
>>>
>>> TBH, I don't see how providing a default implementation of
>>> validate_map_request() is useful.  How many mediated devices are going
>>> to want to identity map resources from the parent?  Even if they do, it
>>> seems we can only support a single mediated device per parent device
>>> since each will map the same parent resource offset. Let's not even try
>>> to define a default.  If we get a fault and the vendor driver hasn't
>>> provided a handler, send a SIGBUS.  I expect we should also allow
>>> vendor drivers to fill the mapping at mmap() time rather than expecting
>>> this map on fault scheme.  Maybe the mid-level driver should not even be
>>> interacting with mmap() and should let the vendor driver entirely
>>> determine the handling.
>>>  
>>
>> Should we go ahead with pass through mmap() call to vendor driver and
>> let vendor driver decide what to do in mmap() call, either
>> remap_pfn_range in mmap() or do fault on access and handle the fault in
>> their driver. In that case we don't need to track mappings in mdev core.
>> Let vendor driver do that on their own, right?
> 
> This sounds right to me, I don't think we want to impose either model
> on the vendor driver.  The vendor driver owns the vfio device file
> descriptor and is responsible for managing it should they expose mmap
> support for regions on the file descriptor.  They either need to insert
> mappings at the point where mmap() is called or setup fault handlers to
> insert them on demand.  If we can provide helper functions so that each
> vendor driver doesn't need to re-invent either of those, that would be
> a bonus.  Thanks,
> 

Since mmap() is going to be handled in vendor driver, let vendor driver
do their own tracking logic of mappings based on which way they decide
to go. No need to keep it in mdev coer module and try to handle all the
cases in one function.

Thanks,
Kirti


> Alex
> 



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-12 Thread Kirti Wankhede


On 8/13/2016 2:46 AM, Alex Williamson wrote:
> On Sat, 13 Aug 2016 00:14:39 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:51 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>
>>> This is used later by mdev_device_start() and mdev_device_stop() to get
>>> the parent_device so it can call the start and stop ops callbacks
>>> respectively.  That seems to imply that all of instances for a given
>>> uuid come from the same parent_device.  Where is that enforced?  I'm
>>> still having a hard time buying into the uuid+instance plan when it
>>> seems like each mdev_device should have an actual unique uuid.
>>> Userspace tools can figure out which uuids to start for a given user, I
>>> don't see much value in collecting them to instances within a uuid.
>>>   
>>
>> Initially we started discussion with VM_UUID+instance suggestion, where
>> instance was introduced to support multiple devices in a VM.
> 
> The instance number was never required in order to support multiple
> devices in a VM, IIRC this UUID+instance scheme was to appease NVIDIA
> management tools which wanted to re-use the VM UUID by creating vGPU
> devices with that same UUID and therefore associate udev events to a
> given VM.  Only then does an instance number become necessary since the
> UUID needs to be static for a vGPUs within a VM.  This has always felt
> like a very dodgy solution when we should probably just be querying
> libvirt to give us a device to VM association.
> 
>> 'mdev_create' creates device and 'mdev_start' is to commit resources of
>> all instances of similar devices assigned to VM.
>>
>> For example, to create 2 devices:
>> # echo "$UUID:0:params" > /sys/devices/../mdev_create
>> # echo "$UUID:1:params" > /sys/devices/../mdev_create
>>
>> "$UUID-0" and "$UUID-1" devices are created.
>>
>> Commit resources for above devices with single 'mdev_start':
>> # echo "$UUID" > /sys/class/mdev/mdev_start
>>
>> Considering $UUID to be a unique UUID of a device, we don't need
>> 'instance', so 'mdev_create' would look like:
>>
>> # echo "$UUID1:params" > /sys/devices/../mdev_create
>> # echo "$UUID2:params" > /sys/devices/../mdev_create
>>
>> where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
>> would be vendor specific parameters.
>>
>> Device nodes would be created as "$UUID1" and "$UUID"
>>
>> Then 'mdev_start' would be:
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start
>>
>> Similarly 'mdev_stop' and 'mdev_destroy' would be:
>>
>> # echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop
> 
> I'm not sure a comma separated list makes sense here, for both
> simplicity in the kernel and more fine grained error reporting, we
> probably want to start/stop them individually.  Actually, why is it
> that we can't use the mediated device being opened and released to
> automatically signal to the backend vendor driver to commit and release
> resources? I don't fully understand why userspace needs this interface.
> 

For NVIDIA vGPU solution we need to know all devices assigned to a VM in
one shot to commit resources of all vGPUs assigned to a VM along with
some common resources.

For start callback, I can pass on the list of UUIDs as is to vendor
driver. Let vendor driver decide whether to iterate for each device and
commit resources or do it in one shot.

Thanks,
Kirti

>> and
>>
>> # echo "$UUID1" > /sys/devices/../mdev_destroy
>> # echo "$UUID2" > /sys/devices/../mdev_destroy
>>
>> Does this seems reasonable?
> 
> I've been hoping we could drop the instance numbers and create actual
> unique UUIDs per mediated device for a while ;)  Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-12 Thread Kirti Wankhede


On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:51 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
> This is used later by mdev_device_start() and mdev_device_stop() to get
> the parent_device so it can call the start and stop ops callbacks
> respectively.  That seems to imply that all of instances for a given
> uuid come from the same parent_device.  Where is that enforced?  I'm
> still having a hard time buying into the uuid+instance plan when it
> seems like each mdev_device should have an actual unique uuid.
> Userspace tools can figure out which uuids to start for a given user, I
> don't see much value in collecting them to instances within a uuid.
> 

Initially we started discussion with VM_UUID+instance suggestion, where
instance was introduced to support multiple devices in a VM.
'mdev_create' creates device and 'mdev_start' is to commit resources of
all instances of similar devices assigned to VM.

For example, to create 2 devices:
# echo "$UUID:0:params" > /sys/devices/../mdev_create
# echo "$UUID:1:params" > /sys/devices/../mdev_create

"$UUID-0" and "$UUID-1" devices are created.

Commit resources for above devices with single 'mdev_start':
# echo "$UUID" > /sys/class/mdev/mdev_start

Considering $UUID to be a unique UUID of a device, we don't need
'instance', so 'mdev_create' would look like:

# echo "$UUID1:params" > /sys/devices/../mdev_create
# echo "$UUID2:params" > /sys/devices/../mdev_create

where $UUID1 and $UUID2 would be mdev device's unique UUID and 'params'
would be vendor specific parameters.

Device nodes would be created as "$UUID1" and "$UUID"

Then 'mdev_start' would be:
# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_start

Similarly 'mdev_stop' and 'mdev_destroy' would be:

# echo "$UUID1, $UUID2" > /sys/class/mdev/mdev_stop

and

# echo "$UUID1" > /sys/devices/../mdev_destroy
# echo "$UUID2" > /sys/devices/../mdev_destroy

Does this seems reasonable?

Thanks,
Kirti



Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-12 Thread Kirti Wankhede


On 8/12/2016 12:13 AM, Alex Williamson wrote:

> 
> TBH, I don't see how providing a default implementation of
> validate_map_request() is useful.  How many mediated devices are going
> to want to identity map resources from the parent?  Even if they do, it
> seems we can only support a single mediated device per parent device
> since each will map the same parent resource offset. Let's not even try
> to define a default.  If we get a fault and the vendor driver hasn't
> provided a handler, send a SIGBUS.  I expect we should also allow
> vendor drivers to fill the mapping at mmap() time rather than expecting
> this map on fault scheme.  Maybe the mid-level driver should not even be
> interacting with mmap() and should let the vendor driver entirely
> determine the handling.
>

Should we go ahead with pass through mmap() call to vendor driver and
let vendor driver decide what to do in mmap() call, either
remap_pfn_range in mmap() or do fault on access and handle the fault in
their driver. In that case we don't need to track mappings in mdev core.
Let vendor driver do that on their own, right?



> For the most part these mid-level drivers, like mediated pci, should be
> as thin as possible, and to some extent I wonder if we need them at
> all.  We mostly want user interaction with the vfio device file
> descriptor to pass directly to the vendor driver and we should only be
> adding logic to the mid-level driver when it actually provides some
> useful and generic simplification to the vendor driver.  Things like
> this default fault handling scheme don't appear to be generic at all,
> it's actually a very unique use case I think.  For the most part
> I think the mediated interface is just a shim to standardize the
> lifecycle of a mediated device for management purposes,
> integrate "fake/virtual" devices into the vfio infrastructure,
> provide common page tracking, pinning and mapping services, but
> the device interface itself should mostly just pass through the
> vfio device API straight through to the vendor driver.  Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-11 Thread Kirti Wankhede


On 8/11/2016 9:54 PM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 21:29:35 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 8/11/2016 4:30 AM, Alex Williamson wrote:
>>> On Thu, 11 Aug 2016 02:53:10 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>>> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
>>>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>>> 
>>>>
>>>> ...
>>>>>>  #include "vfio_pci_private.h"
>>>>>>  
>>>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>>>> --- a/include/linux/vfio.h
>>>>>> +++ b/include/linux/vfio.h
>>>>>> @@ -18,6 +18,13 @@
>>>>>>  #include 
>>>>>>  #include 
>>>>>>  
>>>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>>>> +
>>>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << 
>>>>>> VFIO_PCI_OFFSET_SHIFT)
>>>>>> +#define VFIO_PCI_OFFSET_MASK(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 
>>>>>> 1)
>>>>>> +
>>>>>> +
>>>>>
>>>>> Nak this, I'm not interested in making this any sort of ABI.
>>>>> 
>>>>
>>>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>>>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>>>> they should be moved to common place as you suggested in earlier
>>>> reviews. I think this is better common place. Are there any other
>>>> suggestion?  
>>>
>>> They're only used in ways that I objected to above and you've agreed
>>> to.  These define implementation details that must not become part of
>>> the mediated vendor driver ABI.  A vendor driver is free to redefine
>>> this the same if they want, but as we can see with how easily they slip
>>> into code where they don't belong, the only way to make sure they don't
>>> become ABI is to keep them in private headers.
>>>
>>
>> Then I think, I can't use these macros in mdev modules, they are defined
>> in drivers/vfio/pci/vfio_pci_private.h
>> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
>>
>> parent->ops->get_region_info() is called from vfio_mpci_open() that is
>> before PCI config space is setup. Main expectation from
>> get_region_info() was to get flags and size. At this point of time
>> vendor driver also don't know about the base addresses of regions.
>>
>> case VFIO_DEVICE_GET_REGION_INFO:
>> ...
>>
>> info.offset = vmdev->vfio_region_info[info.index].offset;
>>
>> In that case, as suggested in previous reply, above is not going to work.
>> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
>> offset according to these macros. Then on first access to any BAR
>> region, i.e. after PCI config space is populated, call
>> parent->ops->get_region_info() again so that
>> vfio_region_info[index].offset for all regions are set by vendor driver.
>> Then use these offsets to calculate 'pos' for
>> read/write/validate_map_request(). Does this seems reasonable?
> 
> This doesn't make any sense to me, there should be absolutely no reason
> for the mid-layer mediated device infrastructure to impose region
> offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
> Only the leaf drivers can define how they layout the offsets within the
> device file descriptor.  Being a VFIO_PCI device only defines region
> indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
> BAR1,... region 7 is PCI config space).  If this mid-layer even needs
> to know region offsets, then caching them on opening the vendor device
> is certainly sufficient.  Remember we're talking about the offset into
> the vfio device file descriptor, how that potentially maps onto a
> physical MMIO space later doesn't matter here.  It seems like maybe
> we're confusing those points.  Anyway, the more I hear about needing to
> reproduce these INDEX/OFFSET translation macros in places they
> shouldn't be used, the more confident I am in keeping them private.

If vendor driver defines the offsets into vfio device file descriptor,
it will be vendor drivers responsibility that the ranges defined (offset
to offset + size) are not overlapping with other regions ranges. There
will be no validation in vfio-mpci, right?

In current implementation there is a provision that if
validate_map_request() callback is not provided, map it to physical
device's region and start of physical device's BAR address is queried
using pci_resource_start(). Since with the above change that you are
proposing, index could not be extracted from offset. Then if vendor
driver doesn't provide validate_map_request(), return SIGBUS from fault
handler.
So that impose indirect requirement that if vendor driver sets
VFIO_REGION_INFO_FLAG_MMAP for any region, they should provide
validate_map_request().

Thanks,
Kirti.

> Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-11 Thread Kirti Wankhede


On 8/11/2016 4:30 AM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 02:53:10 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>
>> ...
>>
>>>> +
>>>> +  switch (info.index) {
>>>> +  case VFIO_PCI_CONFIG_REGION_INDEX:
>>>> +  case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>> +  info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
>>>
>>> No, vmdev->vfio_region_info[info.index].offset
>>>  
>>
>> Ok.
>>
>>>> +  info.size = vmdev->vfio_region_info[info.index].size;
>>>> +  if (!info.size) {
>>>> +  info.flags = 0;
>>>> +  break;
>>>> +  }
>>>> +
>>>> +  info.flags = vmdev->vfio_region_info[info.index].flags;
>>>> +  break;
>>>> +  case VFIO_PCI_VGA_REGION_INDEX:
>>>> +  case VFIO_PCI_ROM_REGION_INDEX:  
>>>
>>> Why?  Let the vendor driver decide.
>>>   
>>
>> Ok.
>>
>>>> +  switch (info.index) {
>>>> +  case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>>>> +  case VFIO_PCI_REQ_IRQ_INDEX:
>>>> +  break;
>>>> +  /* pass thru to return error */
>>>> +  case VFIO_PCI_MSIX_IRQ_INDEX:  
>>>
>>> ???  
>>
>> Sorry, I missed to update this. Updating it.
>>
>>>> +  case VFIO_DEVICE_SET_IRQS:
>>>> +  {  
>> ...
>>>> +
>>>> +  if (parent->ops->set_irqs)
>>>> +  ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>>>> +  hdr.start, hdr.count, data);
>>>> +
>>>> +  kfree(ptr);
>>>> +  return ret;  
>>>
>>> Return success if no set_irqs callback?
>>>  
>>
>> Ideally, vendor driver should provide this function. If vendor driver
>> doesn't provide it, do we really need to fail here?
> 
> Wouldn't you as a user expect to get an error if you try to call an
> ioctl that has no backing rather than assume success and never receive
> and interrupt?
>  

If we really don't want to proceed if set_irqs() is not provided then
its better to add it in mandatory list in mdev_register_device() in
mdev_core.c and fail earlier, i.e. fail to register the device.


>>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>>>> +size_t count, loff_t *ppos)
>>>> +{
>>>> +  struct vfio_mdev *vmdev = device_data;
>>>> +  struct mdev_device *mdev = vmdev->mdev;
>>>> +  struct parent_device *parent = mdev->parent;
>>>> +  int ret = 0;
>>>> +
>>>> +  if (!count)
>>>> +  return 0;
>>>> +
>>>> +  if (parent->ops->read) {
>>>> +  char *ret_data, *ptr;
>>>> +
>>>> +  ptr = ret_data = kzalloc(count, GFP_KERNEL);  
>>>
>>> Do we really need to support arbitrary lengths in one shot?  Seems like
>>> we could just use a 4 or 8 byte variable on the stack and iterate until
>>> done.
>>>   
>>
>> We just want to pass the arguments to vendor driver as is here. Vendor
>> driver could take care of that.
> 
> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.
>  
>>>> +
>>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>>>> + size_t count, loff_t *ppos)
>>>> +{
>>>> +  struct vfio_mdev *vmdev = device_data;
>>>> +  struct mdev_device *mdev = vmdev->mdev;
>>>> +  struct parent_device *parent = mdev->parent;
>>>> +  int ret = 0;
>>>> +
>>>> +  if (!count)
>>>> +  return 0;
>>>> +
>>>> +  if (parent->ops->write) {
>>>> +  char *usr_data, *ptr;
>>>> +
>>>> +  ptr = usr_data = memdup_user(buf, count);  
>>>
>>> Same here, how much do we care to let the user

Re: [Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices

2016-08-11 Thread Kirti Wankhede

Thanks Alex. I'll take care of suggested nits and rename structures and
function.

On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:53 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>
...

>>
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for
mediated
>> + * domain only.
>
> Why only mediated domain?  What assumption is specific to a mediated
> domain other than unnecessarily passing an mdev_device?
>
>> + * @user_pfn [in]: array of user/guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @phys_pfn[out] : array of host PFNs
>> + */
>> +long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
>
> Why use and mdev_device here?  We only reference the struct device to
> get the drvdata.  (dev also not listed above in param description)
>

Ok.

>> +long npage, int prot, unsigned long *phys_pfn)
>> +{
>> +struct vfio_device *device;
>> +struct vfio_container *container;
>> +struct vfio_iommu_driver *driver;
>> +ssize_t ret = -EINVAL;
>> +
>> +if (!mdev || !user_pfn || !phys_pfn)
>> +return -EINVAL;
>> +
>> +device = dev_get_drvdata(>dev);
>> +
>> +if (!device || !device->group)
>> +return -EINVAL;
>> +
>> +container = device->group->container;
>
> This doesn't seem like a valid way to get a reference to the container
> and in fact there is no reference at all.  I think you need to use
> vfio_device_get_from_dev(), check and increment container_users around
> the callback, abort on noiommu groups, and check for viability.
>

Thanks for pointing that out. I'll change it as suggested.

>
>
> I see how you're trying to only do accounting when there is only an
> mdev (local) domain, but the devices attached to the normal iommu API
> domain can go away at any point.  Where do we re-establish accounting
> should the pinning from those devices be removed?  I don't see that as
> being an optional support case since userspace can already do this.
>

I missed this case. So in that case, when
vfio_iommu_type1_detach_group() for iommu group for that device is
called and it is the last entry in iommu capable domain_list, it should
re-iterate through pfn_list of mediated_domain and do the accounting,
right? Then we also have to update accounting when iommu capable device
is hotplugged while mediated_domain already exist.

Thanks,
Kirti




Re: [Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-10 Thread Kirti Wankhede


On 8/10/2016 12:30 AM, Alex Williamson wrote:
> On Thu, 4 Aug 2016 00:33:52 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 

...

>> +
>> +switch (info.index) {
>> +case VFIO_PCI_CONFIG_REGION_INDEX:
>> +case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>> +info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
> 
> No, vmdev->vfio_region_info[info.index].offset
>

Ok.

>> +info.size = vmdev->vfio_region_info[info.index].size;
>> +if (!info.size) {
>> +info.flags = 0;
>> +break;
>> +}
>> +
>> +info.flags = vmdev->vfio_region_info[info.index].flags;
>> +break;
>> +case VFIO_PCI_VGA_REGION_INDEX:
>> +case VFIO_PCI_ROM_REGION_INDEX:
> 
> Why?  Let the vendor driver decide.
> 

Ok.

>> +switch (info.index) {
>> +case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>> +case VFIO_PCI_REQ_IRQ_INDEX:
>> +break;
>> +/* pass thru to return error */
>> +case VFIO_PCI_MSIX_IRQ_INDEX:
> 
> ???

Sorry, I missed to update this. Updating it.

>> +case VFIO_DEVICE_SET_IRQS:
>> +{
...
>> +
>> +if (parent->ops->set_irqs)
>> +ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +hdr.start, hdr.count, data);
>> +
>> +kfree(ptr);
>> +return ret;
> 
> Return success if no set_irqs callback?
>

Ideally, vendor driver should provide this function. If vendor driver
doesn't provide it, do we really need to fail here?


>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>> +  size_t count, loff_t *ppos)
>> +{
>> +struct vfio_mdev *vmdev = device_data;
>> +struct mdev_device *mdev = vmdev->mdev;
>> +struct parent_device *parent = mdev->parent;
>> +int ret = 0;
>> +
>> +if (!count)
>> +return 0;
>> +
>> +if (parent->ops->read) {
>> +char *ret_data, *ptr;
>> +
>> +ptr = ret_data = kzalloc(count, GFP_KERNEL);
> 
> Do we really need to support arbitrary lengths in one shot?  Seems like
> we could just use a 4 or 8 byte variable on the stack and iterate until
> done.
> 

We just want to pass the arguments to vendor driver as is here. Vendor
driver could take care of that.

>> +
>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>> +   size_t count, loff_t *ppos)
>> +{
>> +struct vfio_mdev *vmdev = device_data;
>> +struct mdev_device *mdev = vmdev->mdev;
>> +struct parent_device *parent = mdev->parent;
>> +int ret = 0;
>> +
>> +if (!count)
>> +return 0;
>> +
>> +if (parent->ops->write) {
>> +char *usr_data, *ptr;
>> +
>> +ptr = usr_data = memdup_user(buf, count);
> 
> Same here, how much do we care to let the user write in one pass and is
> there any advantage to it?  When QEMU is our userspace we're only
> likely to see 4-byte accesses anyway.

Same as above.

>> +
>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault 
>> *vmf)
>> +{
...
>> +} else {
>> +struct pci_dev *pdev;
>> +
>> +virtaddr = vma->vm_start;
>> +req_size = vma->vm_end - vma->vm_start;
>> +
>> +pdev = to_pci_dev(parent->dev);
>> +index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);
> 
> Iterate through region_info[*].offset/size provided by vendor driver.
> 

Yes, makes sense.

>> +
>> +int vfio_mpci_match(struct device *dev)
>> +{
>> +if (dev_is_pci(dev->parent))
> 
> This is the wrong test, there's really no requirement that a pci mdev
> device is hosted by a real pci device.  

Ideally this module is for the mediated device whose parent is PCI
device. And we are relying on kernel functions like
pci_resource_start(), to_pci_dev() in this module, so better to check it
while loading.


> Can't we check that the device
> is on an mdev_pci_bus_type?
> 

I didn't get this part.

Each mediated device is of mdev_bus_type. But VFIO module could be
different based on parent device t

Re: [Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices

2016-08-05 Thread Kirti Wankhede


On 8/4/2016 1:01 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>> +
>> +* mdev_supported_types: (read only)
>> +List the current supported mediated device types and its details.
>> +
>> +* mdev_create: (write only)
>> +Create a mediated device on target physical device.
>> +Input syntax: 
>> +where,
>> +UUID: mediated device's UUID
>> +idx: mediated device index inside a VM
> 
> Is above description too specific to VM usage? mediated device can
> be used by other user components too, e.g. an user space driver.
> Better to make the description general (you can list above as one
> example).
>
Ok. I'll change it to VM or user space component.

> Also I think calling it idx a bit limited, which means only numbers
> possible. Is it more flexible to call it 'handle' and then any string
> can be used here?
> 

Index is integer, it is to keep track of mediated device instance number
created for a user space component or VM.

>> +params: extra parameters required by driver
>> +Example:
>> +# echo "12345678-1234-1234-1234-123456789abc:0:0" >
>> + /sys/bus/pci/devices/\:05\:00.0/mdev_create
>> +
>> +* mdev_destroy: (write only)
>> +Destroy a mediated device on a target physical device.
>> +Input syntax: 
>> +where,
>> +UUID: mediated device's UUID
>> +idx: mediated device index inside a VM
>> +Example:
>> +# echo "12345678-1234-1234-1234-123456789abc:0" >
>> +   /sys/bus/pci/devices/\:05\:00.0/mdev_destroy
>> +
>> +Under mdev class sysfs /sys/class/mdev/:
>> +
>> +
>> +* mdev_start: (write only)
>> +This trigger the registration interface to notify the driver to
>> +commit mediated device resource for target VM.
>> +The mdev_start function is a synchronized call, successful return of
>> +this call will indicate all the requested mdev resource has been fully
>> +committed, the VMM should continue.
>> +Input syntax: 
>> +Example:
>> +# echo "12345678-1234-1234-1234-123456789abc" >
>> +/sys/class/mdev/mdev_start
>> +
>> +* mdev_stop: (write only)
>> +This trigger the registration interface to notify the driver to
>> +release resources of mediated device of target VM.
>> +Input syntax: 
>> +Example:
>> +# echo "12345678-1234-1234-1234-123456789abc" >
>> + /sys/class/mdev/mdev_stop
> 
> I think it's clearer to create a node per mdev under /sys/class/mdev,
> and then move start/stop as attributes under each mdev node, e.g:
> 
> echo "0/1" > /sys/class/mdev/12345678-1234-1234-1234-123456789abc/start
> 

To support multiple mdev devices in one VM or user space driver, process
is to create or configure all mdev devices for that VM or user space
driver and then have a single 'start' which means all requested mdev
resources are committed.

> Doing this way is more extensible to add more capabilities under
> each mdev node, and different capability set may be implemented
> for them.
> 

You can add extra capabilities for each mdev device node using
'mdev_attr_groups' of 'struct parent_ops' from vendor driver.


>> +
>> +Mediated device Hotplug:
>> +---
>> +
>> +To support mediated device hotplug,  and  can be
>> +accessed during VM runtime, and the corresponding registration callback is
>> +invoked to allow driver to support hotplug.
> 
> 'hotplug' is an action on the mdev user (e.g. the VM), not on mdev itself.
> You can always create a mdev as long as physical device has enough
> available resource to support requested config. Destroying a mdev 
> may fail if there is still user on target mdev.
>

Here point is: user need to pass UUID to mdev_create and device will be
created even if VM or user space driver is running.

Thanks,
Kirti

> Thanks
> Kevin
> 



Re: [Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-05 Thread Kirti Wankhede


On 8/4/2016 12:51 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
>> Sent: Thursday, August 04, 2016 3:04 AM
>>
>>
>> 2. Physical device driver interface
>> This interface provides vendor driver the set APIs to manage physical
>> device related work in their own driver. APIs are :
>> - supported_config: provide supported configuration list by the vendor
>>  driver
>> - create: to allocate basic resources in vendor driver for a mediated
>>device.
>> - destroy: to free resources in vendor driver when mediated device is
>> destroyed.
>> - reset: to free and reallocate resources in vendor driver during reboot
> 
> Currently I saw 'reset' callback only invoked from VFIO ioctl path. Do 
> you think whether it makes sense to expose a sysfs 'reset' node too,
> similar to what people see under a PCI device node?
> 

All vendor drivers might not support reset of mdev from sysfs. But those
who want to support can expose 'reset' node using 'mdev_attr_groups' of
'struct parent_ops'.


>> - start: to initiate mediated device initialization process from vendor
>>   driver
>> - shutdown: to teardown mediated device resources during teardown.
> 
> I think 'shutdown' should be 'stop' based on actual code.
>

Thanks for catching that, yes I missed to updated here.

Thanks,
Kirti

> Thanks
> Kevin
> 



[Qemu-devel] [PATCH v6 3/4] vfio iommu: Add support for mediated devices

2016-08-03 Thread Kirti Wankhede
VFIO IOMMU drivers are designed for the devices which are IOMMU capable.
Mediated device only uses IOMMU APIs, the underlying hardware can be
managed by an IOMMU domain.

Aim of this change is:
- To use most of the code of TYPE1 IOMMU driver for mediated devices
- To support direct assigned device and mediated device in single module

Added two new callback functions to struct vfio_iommu_driver_ops. Backend
IOMMU module that supports pining and unpinning pages for mdev devices
should provide these functions.
Added APIs for pining and unpining pages to VFIO module. These calls back
into backend iommu module to actually pin and unpin pages.

This change adds pin and unpin support for mediated device to TYPE1 IOMMU
backend module. More details:
- When iommu_group of mediated devices is attached, task structure is
  cached which is used later to pin pages and page accounting.
- It keeps track of pinned pages for mediated domain. This data is used to
  verify unpinning request and to unpin remaining pages while detaching, if
  there are any.
- Used existing mechanism for page accounting. If iommu capable domain
  exist in the container then all pages are already pinned and accounted.
  Accouting for mdev device is only done if there is no iommu capable
  domain in the container.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio.c |  82 +++
 drivers/vfio/vfio_iommu_type1.c | 499 
 include/linux/vfio.h|  13 +-
 3 files changed, 546 insertions(+), 48 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 6fd6fa5469de..1f87e3a30d24 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1782,6 +1782,88 @@ void vfio_info_cap_shift(struct vfio_info_cap *caps, 
size_t offset)
 }
 EXPORT_SYMBOL_GPL(vfio_info_cap_shift);
 
+/*
+ * Pin a set of guest PFNs and return their associated host PFNs for mediated
+ * domain only.
+ * @user_pfn [in]: array of user/guest PFNs
+ * @npage [in]: count of array elements
+ * @prot [in] : protection flags
+ * @phys_pfn[out] : array of host PFNs
+ */
+long vfio_pin_pages(struct mdev_device *mdev, unsigned long *user_pfn,
+   long npage, int prot, unsigned long *phys_pfn)
+{
+   struct vfio_device *device;
+   struct vfio_container *container;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret = -EINVAL;
+
+   if (!mdev || !user_pfn || !phys_pfn)
+   return -EINVAL;
+
+   device = dev_get_drvdata(>dev);
+
+   if (!device || !device->group)
+   return -EINVAL;
+
+   container = device->group->container;
+
+   if (!container)
+   return -EINVAL;
+
+   down_read(>group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->pin_pages))
+   ret = driver->ops->pin_pages(container->iommu_data, user_pfn,
+npage, prot, phys_pfn);
+
+   up_read(>group_lock);
+
+   return ret;
+
+}
+EXPORT_SYMBOL(vfio_pin_pages);
+
+/*
+ * Unpin set of host PFNs for mediated domain only.
+ * @pfn [in] : array of host PFNs to be unpinned.
+ * @npage [in] :count of elements in array, that is number of pages.
+ */
+long vfio_unpin_pages(struct mdev_device *mdev, unsigned long *pfn, long npage)
+{
+   struct vfio_device *device;
+   struct vfio_container *container;
+   struct vfio_iommu_driver *driver;
+   ssize_t ret = -EINVAL;
+
+   if (!mdev || !pfn)
+   return -EINVAL;
+
+   device = dev_get_drvdata(>dev);
+
+   if (!device || !device->group)
+   return -EINVAL;
+
+   container = device->group->container;
+
+   if (!container)
+   return -EINVAL;
+
+   down_read(>group_lock);
+
+   driver = container->iommu_driver;
+   if (likely(driver && driver->ops->unpin_pages))
+   ret = driver->ops->unpin_pages(container->iommu_data, pfn,
+  npage);
+
+   up_read(>group_lock);
+
+   return ret;
+
+}
+EXPORT_SYMBOL(vfio_unpin_pages);
+
 /**
  * Module/class support
  */
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..1f4e24e0debd 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -55,18 +55,26 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
struct list_headdomain_list;
+   struct vfio_domain  *mediated_domain;
struct mutexlock;
struct rb_root  

[Qemu-devel] [PATCH v6 4/4] docs: Add Documentation for Mediated devices

2016-08-03 Thread Kirti Wankhede
Add file Documentation/vfio-mediated-device.txt that include details of
mediated device framework.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I137dd646442936090d92008b115908b7b2c7bc5d
---
 Documentation/vfio-mediated-device.txt | 235 +
 1 file changed, 235 insertions(+)
 create mode 100644 Documentation/vfio-mediated-device.txt

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
new file mode 100644
index ..029152670141
--- /dev/null
+++ b/Documentation/vfio-mediated-device.txt
@@ -0,0 +1,235 @@
+VFIO Mediated devices [1]
+---
+
+There are more and more use cases/demands to virtualize the DMA devices which
+doesn't have SR_IOV capability built-in. To do this, drivers of different
+devices had to develop their own management interface and set of APIs and then
+integrate it to user space software. We've identified common requirements and
+unified management interface for such devices to make user space software
+integration easier.
+
+The VFIO driver framework provides unified APIs for direct device access. It is
+an IOMMU/device agnostic framework for exposing direct device access to
+user space, in a secure, IOMMU protected environment. This framework is
+used for multiple devices like GPUs, network adapters and compute accelerators.
+With direct device access, virtual machines or user space applications have
+direct access of physical device. This framework is reused for mediated 
devices.
+
+Mediated core driver provides a common interface for mediated device management
+that can be used by drivers of different devices. This module provides a 
generic
+interface to create/destroy mediated device, add/remove it to mediated bus
+driver, add/remove device to IOMMU group. It also provides an interface to
+register different types of bus drivers, for example, Mediated VFIO PCI driver
+is designed for mediated PCI devices and supports VFIO APIs. Similarly, driver
+can be designed to support any type of mediated device and added to this
+framework. Mediated bus driver add/delete mediated device to VFIO Group.
+
+Below is the high Level block diagram, with NVIDIA, Intel and IBM devices
+as example, since these are the devices which are going to actively use
+this module as of now. NVIDIA and Intel uses vfio_mpci.ko module for their GPUs
+which are PCI devices. There has to be different bus driver for Channel I/O
+devices, vfio_mccw.ko.
+
+
+ +---+
+ |   |
+ | +---+ |  mdev_register_driver() +--+
+ | |   | +<+  |
+ | |   | | |  |
+ | |  mdev | +>+ vfio_mpci.ko |<-> VFIO user
+ | |  bus  | | probe()/remove()|  |APIs
+ | |  driver   | | |  |
+ | |   | | +--+
+ | |   | |  mdev_register_driver() +--+
+ | |   | +<+  |
+ | |   | | |  |
+ | |   | +>+ vfio_mccw.ko |<-> VFIO user
+ | +---+ | probe()/remove()|  |APIs
+ |   | |  |
+ |  MDEV CORE| +--+
+ |   MODULE  |
+ |   mdev.ko |
+ | +---+ |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | |  nvidia.ko   |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | | Physical  | |
+ | |  device   | |  mdev_register_device() +--+
+ | | interface | |<+  |
+ | |   | | |  i915.ko |<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | |   | |
+ | |   | |  mdev_register_device() +--+
+ | |   | +<+  |
+ | |   | | | ccw_device.ko|<-> physical
+ | |   | +>+  |device
+ | |   | |callbacks+--+
+ | +---+ |
+ +---+
+
+
+Registration Interfaces
+---
+
+Mediated core driver pro

[Qemu-devel] [PATCH v6 1/4] vfio: Mediated device Core driver

2016-08-03 Thread Kirti Wankhede
g(struct mdev_device *mdev,
  struct address_space *mapping,
  unsigned long addr, unsigned long size)
void mdev_del_phys_mapping(struct mdev_device *mdev, unsigned long addr)

API to be used by vendor driver to invalidate mapping:
int mdev_device_invalidate_mapping(struct mdev_device *mdev,
   unsigned long addr, unsigned long size)

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig |   1 +
 drivers/vfio/Makefile|   1 +
 drivers/vfio/mdev/Kconfig|  12 +
 drivers/vfio/mdev/Makefile   |   5 +
 drivers/vfio/mdev/mdev_core.c| 676 +++
 drivers/vfio/mdev/mdev_driver.c  | 142 
 drivers/vfio/mdev/mdev_private.h |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c   | 269 
 include/linux/mdev.h | 236 ++
 9 files changed, 1375 insertions(+)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 include/linux/mdev.h

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index da6e2ce77495..23eced02aaf6 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -48,4 +48,5 @@ menuconfig VFIO_NOIOMMU
 
 source "drivers/vfio/pci/Kconfig"
 source "drivers/vfio/platform/Kconfig"
+source "drivers/vfio/mdev/Kconfig"
 source "virt/lib/Kconfig"
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 7b8a31f63fea..4a23c13b6be4 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -7,3 +7,4 @@ obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
 obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o
 obj-$(CONFIG_VFIO_PCI) += pci/
 obj-$(CONFIG_VFIO_PLATFORM) += platform/
+obj-$(CONFIG_VFIO_MDEV) += mdev/
diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
new file mode 100644
index ..a34fbc66f92f
--- /dev/null
+++ b/drivers/vfio/mdev/Kconfig
@@ -0,0 +1,12 @@
+
+config VFIO_MDEV
+tristate "Mediated device driver framework"
+depends on VFIO
+default n
+help
+Provides a framework to virtualize device.
+   See Documentation/vfio-mediated-device.txt for more details.
+
+If you don't know what do here, say N.
+
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
new file mode 100644
index ..56a75e689582
--- /dev/null
+++ b/drivers/vfio/mdev/Makefile
@@ -0,0 +1,5 @@
+
+mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
+
+obj-$(CONFIG_VFIO_MDEV) += mdev.o
+
diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
new file mode 100644
index ..90ff073abfce
--- /dev/null
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -0,0 +1,676 @@
+/*
+ * Mediated device Core Driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR  "NVIDIA Corporation"
+#define DRIVER_DESC"Mediated device Core Driver"
+
+#define MDEV_CLASS_NAME"mdev"
+
+static LIST_HEAD(parent_list);
+static DEFINE_MUTEX(parent_list_lock);
+
+static int mdev_add_attribute_group(struct device *dev,
+   const struct attribute_group **groups)
+{
+   return sysfs_create_groups(>kobj, groups);
+}
+
+static void mdev_remove_attribute_group(struct device *dev,
+   const struct attribute_group **groups)
+{
+   sysfs_remove_groups(>kobj, groups);
+}
+
+/* Should be called holding parent->mdev_list_lock */
+static struct mdev_device *find_mdev_device(struct parent_device *parent,
+   uuid_le uuid, int instance)
+{
+   struct mdev_device *mdev;
+
+   list_for_each_entry(mdev, >mdev_list, next) {
+   if ((uuid_le_cmp(mdev->uuid, uuid) == 0) &&
+   (mdev->instance == instance))
+   return mdev;
+   }
+   return NULL;
+}
+
+/* Should be called holding parent_list_lock */
+static struct parent_device *find_parent_device(struct 

[Qemu-devel] [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

2016-08-03 Thread Kirti Wankhede
MPCI VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This driver
adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device. Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- Device reset
- mmap mappable region with invalidate mapping and fault on access to
  remap pfns. If validate_map_request() is not provided by vendor driver,
  fault handler maps physical devices region.
- Add and delete mappable region's physical mappings to mdev's mapping
  tracking logic.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig   |   6 +
 drivers/vfio/mdev/Makefile  |   1 +
 drivers/vfio/mdev/vfio_mpci.c   | 536 
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
 include/linux/vfio.h|   7 +
 6 files changed, 551 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index a34fbc66f92f..431ed595c8da 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,4 +9,10 @@ config VFIO_MDEV
 
 If you don't know what do here, say N.
 
+config VFIO_MPCI
+tristate "VFIO support for Mediated PCI devices"
+depends on VFIO && PCI && VFIO_MDEV
+default n
+help
+VFIO based driver for mediated PCI devices.
 
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 56a75e689582..264fb03dd0e3 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_VFIO_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index ..9da94b76ae3e
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,536 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+   struct iommu_group *group;
+   struct mdev_device *mdev;
+   int refcnt;
+   struct vfio_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+   struct mutexvfio_mdev_lock;
+};
+
+static int vfio_mpci_open(void *device_data)
+{
+   int ret = 0;
+   struct vfio_mdev *vmdev = device_data;
+   struct parent_device *parent = vmdev->mdev->parent;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   mutex_lock(>vfio_mdev_lock);
+   if (!vmdev->refcnt && parent->ops->get_region_info) {
+   int index;
+
+   for (index = VFIO_PCI_BAR0_REGION_INDEX;
+index < VFIO_PCI_NUM_REGIONS; index++) {
+   ret = parent->ops->get_region_info(vmdev->mdev, index,
+ >vfio_region_info[index]);
+   if (ret)
+   goto open_error;
+   }
+   }
+
+   vmdev->refcnt++;
+
+open_error:
+   mutex_unlock(>vfio_mdev_lock);
+   if (ret)
+   module_put(THIS_MODULE);
+
+   return ret;
+}
+
+static void vfio_mpci_close(void *device_data)
+{
+   struct vfio_mdev *vmdev = device_data;
+
+   mutex_lock(>vfio_mdev_lock);
+   vmdev->refcnt--;
+   if (!vmdev->refcnt) {
+   memset(>vfio_region_info, 0,
+   sizeof(vmdev->vfio_region_info));
+   }
+   mutex_unlock(>vfio_mdev_lock);
+   module_put(THIS_MODULE);
+}
+
+static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
+{
+   loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);
+   struct parent_device *parent = mdev->parent;
+   u16 status;
+   u8  cap_ptr, cap_id = 0xff;
+
+   parent->ops->read(mdev, (char *), sizeof(status)

[Qemu-devel] [PATCH v6 0/4] Add Mediated device support

2016-08-03 Thread Kirti Wankhede
This series adds Mediated device support to Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v6?
- Removed per mdev_device lock for registration callbacks. Vendor driver should
  implement locking if they need to serialize registration callbacks.
- Added mapped region tracking logic and invalidation function to be used by
  vendor driver.
- Moved vfio_pin_pages and vfio_unpin_pages API from IOMMU type1 driver to vfio
  driver. Added callbacks to vfio ops structure to support pin and unpin APIs in
  backend iommu module.
- Used uuid_le_to_bin() to parse UUID string and convert to bin. This requires
  following commits from linux master branch:
* commit bc9dc9d5eec908806f1b15c9ec2253d44dcf7835 :
lib/uuid.c: use correct offset in uuid parser
* commit 2b1b0d66704a8cafe83be7114ec4c15ab3a314ad :
lib/uuid.c: introduce a few more generic helpers
- Requires below commits from linux master branch for mmap region fault handler
  that uses remap_pfn_range() to setup EPT properly.
* commit add6a0cd1c5ba51b201e1361b05a5df817083618
KVM: MMU: try to fix up page faults before giving up
* commit 92176a8ede577d0ff78ab3298e06701f67ad5f51 :
KVM: MMU: prepare to support mapping of VM_IO and VM_PFNMAP frames

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (4):
  vfio: Mediated device Core driver
  vfio: VFIO driver for mediated PCI device
  vfio iommu: Add support for mediated devices
  docs: Add Documentation for Mediated devices

 Documentation/vfio-mediated-device.txt | 235 
 drivers/vfio/Kconfig   |   1 +
 drivers/vfio/Makefile  |   1 +
 drivers/vfio/mdev/Kconfig  |  18 +
 drivers/vfio/mdev/Makefile |   6 +
 drivers/vfio/mdev/mdev_core.c  | 676 +
 drivers/vfio/mdev/mdev_driver.c| 142 +++
 drivers/vfio/mdev/mdev_private.h   |  33 ++
 drivers/vfio/mdev/mdev_sysfs.c | 269 +
 drivers/vfio/mdev/vfio_mpci.c  | 536 ++
 drivers/vfio/pci/vfio_pci_private.h|   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c   |   1 +
 drivers/vfio/vfio.c|  82 
 drivers/vfio/vfio_iommu_type1.c| 499 +---
 include/linux/mdev.h   | 236 
 include/linux/vfio.h   |  20 +-
 16 files changed, 2707 insertions(+), 54 deletions(-)
 create mode 100644 Documentation/vfio-mediated-device.txt
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0




Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver

2016-06-30 Thread Kirti Wankhede


On 6/30/2016 12:42 PM, Jike Song wrote:
> On 06/29/2016 09:51 PM, Xiao Guangrong wrote:
>> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>>> +   mutex_unlock(_devices.list_lock);
>>> +   return parent;
>>> +}
>>> +
>>> +static int mdev_device_create_ops(struct mdev_device *mdev, char 
>>> *mdev_params)
>>> +{
>>> +   struct parent_device *parent = mdev->parent;
>>> +   int ret;
>>> +
>>> +   mutex_lock(>ops_lock);
>>> +   if (parent->ops->create) {
>>> +   ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>>> +   mdev->instance, mdev_params);
>>
>> I think it is better if we pass @mdev to this callback, then the parent 
>> driver
>> can do its specified operations and associate it with the instance,
>> e.g, via mdev->private.
>>
> 
> Just noticed that mdev->driver_data is missing in v5, I'd like to have it 
> back :)
>

Actually, I added mdev_get_drvdata() and mdev_set_drvdata() but I missed
earlier that mdev->dev->driver_data is used by vfio module to keep
reference of vfio_device. So adding driver_data to struct mdev_device
again and updating mdev_get_drvdata() and mdev_set_drvdata() as below.

 static inline void *mdev_get_drvdata(struct mdev_device *mdev)
 {
-   return dev_get_drvdata(>dev);
+   return mdev->driver_data;
 }

 static inline void mdev_set_drvdata(struct mdev_device *mdev, void *data)
 {
-   dev_set_drvdata(>dev, data);
+   mdev->driver_data = data;
 }


> Yes either mdev need to be passed to parent driver (preferred), or 
> find_mdev_device to
> be exported for parent driver (less preferred, but at least functional).
> 

Updating argument to create to have mdev.

Thanks,
Kirti.





Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver

2016-06-30 Thread Kirti Wankhede


On 6/29/2016 7:21 PM, Xiao Guangrong wrote:
> 
> 
> On 06/21/2016 12:31 AM, Kirti Wankhede wrote:
>> Design for Mediated Device Driver:
...
>> +static int mdev_add_attribute_group(struct device *dev,
>> +const struct attribute_group **groups)
>> +{
>> +return sysfs_create_groups(>kobj, groups);
>> +}
>> +
>> +static void mdev_remove_attribute_group(struct device *dev,
>> +const struct attribute_group **groups)
>> +{
>> +sysfs_remove_groups(>kobj, groups);
>> +}
>> +
> 
> better use device_add_groups() / device_remove_groups() instead?
> 

These are not exported from base module. They can't be used here.


>> +}
>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char
>> *mdev_params)
>> +{
>> +struct parent_device *parent = mdev->parent;
>> +int ret;
>> +
>> +mutex_lock(>ops_lock);
>> +if (parent->ops->create) {
>> +ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +mdev->instance, mdev_params);
> 
> I think it is better if we pass @mdev to this callback, then the parent
> driver
> can do its specified operations and associate it with the instance,
> e.g, via mdev->private.
> 

Yes, actually I was also thinking of changing it to

-   ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
- mdev->instance, mdev_params);
+   ret = parent->ops->create(mdev, mdev_params);


>> +int mdev_register_device(struct device *dev, const struct parent_ops
>> *ops)
>> +{
>> +int ret = 0;
>> +struct parent_device *parent;
>> +
>> +if (!dev || !ops)
>> +return -EINVAL;
>> +
>> +mutex_lock(_devices.list_lock);
>> +
>> +/* Check for duplicate */
>> +parent = find_parent_device(dev);
>> +if (parent) {
>> +ret = -EEXIST;
>> +goto add_dev_err;
>> +}
>> +
>> +parent = kzalloc(sizeof(*parent), GFP_KERNEL);
>> +if (!parent) {
>> +ret = -ENOMEM;
>> +goto add_dev_err;
>> +}
>> +
>> +kref_init(>ref);
>> +list_add(>next, _devices.dev_list);
>> +mutex_unlock(_devices.list_lock);
> 
> It is not safe as Alex's already pointed it out.
> 
>> +
>> +parent->dev = dev;
>> +parent->ops = ops;
>> +mutex_init(>ops_lock);
>> +mutex_init(>mdev_list_lock);
>> +INIT_LIST_HEAD(>mdev_list);
>> +init_waitqueue_head(>release_done);
> 
> And no lock to protect these operations.
> 

As I replied to Alex also, yes I'm fixing it.

>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +struct parent_device *parent;
>> +struct mdev_device *mdev, *n;
>> +int ret;
>> +
>> +mutex_lock(_devices.list_lock);
>> +parent = find_parent_device(dev);
>> +
>> +if (!parent) {
>> +mutex_unlock(_devices.list_lock);
>> +return;
>> +}
>> +dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +/*
>> + * Remove parent from the list and remove create and destroy sysfs
>> + * files so that no new mediated device could be created for this
>> parent
>> + */
>> +list_del(>next);
>> +mdev_remove_sysfs_files(dev);
>> +mutex_unlock(_devices.list_lock);
>> +
> 
> find_parent_device() does not increase the refcount of the parent-device,
> after releasing the lock, is it still safe to use the device?
> 

Yes. In mdev_register_device(), kref_init() initialises refcount to 1
and then when mdev child is created refcount is incremented and on child
mdev destroys refcount is decremented. So when all child mdev are
destroyed, refcount will still be 1 until mdev_unregister_device() is
called. So when no mdev device is created, mdev_register_device() hold
parent's refcount and released from mdev_unregister_device().


>> +mutex_lock(>ops_lock);
>> +mdev_remove_attribute_group(dev,
>> +parent->ops->dev_attr_groups);
> 
> Why mdev_remove_sysfs_files() and mdev_remove_attribute_group()
> are protected by different locks?
>

As mentioned in reply to Alex on another thread, removing these locks.

>> +
>> +int mdev_device_destroy(struct device *dev, uuid_le uuid, uint32_t
>> instance)
>> +{
>> +struct mdev_device *mdev;
>> +struct parent_device *parent;
>> +int ret;
>> +
>>

Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device

2016-06-30 Thread Kirti Wankhede


On 6/29/2016 8:24 AM, Alex Williamson wrote:
> On Wed, 29 Jun 2016 00:15:23 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> On 6/25/2016 1:15 AM, Alex Williamson wrote:
>>> On Sat, 25 Jun 2016 00:04:27 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>   
>>
>>>>>> +
>>>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>>>> +{
>>>>>> +/* Don't support MSIX for now */
>>>>>> +if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>>>> +return -1;
>>>>>> +
>>>>>> +return 1;
>>>>>
>>>>> Too much hard coding here, the mediated driver should define this.
>>>>> 
>>>>
>>>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>>>> thought we can add supported for MSIX later. Till then hard code it to 1.  
>>>
>>> To me it screams that there needs to be an interface to the mediated
>>> device here.  How do you even know that the mediated device intends to
>>> support MSI?  What if it wants to emulated a VF and not support INTx?
>>> This is basically just a big "TODO" flag that needs to be addressed
>>> before a non-RFC.
>>>   
>>
>> VFIO user space app reads emulated PCI config space of mediated device.
>> In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
>> it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
>> VFIO_PCI_MSI_IRQ_INDEX.
>> Similarly, MSIX is identified from emulated config space of mediated
>> device that checks if MSI capability is present and number of vectors
>> extracted from PCI_MSI_FLAGS_QSIZE flag.
>> vfio_mpci modules don't need to query it from vendor driver of mediated
>> device. Depending on which interrupt to support, mediated driver should
>> emulate PCI config space.
> 
> Are you suggesting that if the user can determine which interrupts are
> supported and the various counts for each by querying the PCI config
> space of the mediated device then this interface should do the same,
> much like vfio_pci_get_irq_count(), such that it can provide results
> consistent with config space?  That I'm ok with.  Having the user find
> one IRQ count as they read PCI config space and another via the vfio
> API, I'm not ok with.  Thanks,
> 

Yes, it will be more like vfio_pci_get_irq_count(). I will have
mdev_get_irq_count() updated with such change in next version of patch.

Thanks,
Kirti.





Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver

2016-06-30 Thread Kirti Wankhede


On 6/25/2016 1:10 AM, Alex Williamson wrote:
> On Fri, 24 Jun 2016 23:24:58 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> Alex,
>>
>> Thanks for taking closer look. I'll incorporate all the nits you suggested.
>>
>> On 6/22/2016 3:00 AM, Alex Williamson wrote:
>>> On Mon, 20 Jun 2016 22:01:46 +0530
>>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>>>  
...
>>>> +create_ops_err:
>>>> +  mutex_unlock(>ops_lock);  
>>>
>>> It seems like ops_lock isn't used so much as a lock as a serialization
>>> mechanism.  Why?  Where is this serialization per parent device
>>> documented?
>>>  
>>
>> parent->ops_lock is to serialize parent device callbacks to vendor
>> driver, i.e supported_config(), create() and destroy().
>> mdev->ops_lock is to serialize mediated device related callbacks to
>> vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
>> get_region_info(), validate_map_request().
>> Its not documented, I'll add comments to mdev.h about these locks.
> 
> Should it be the mediated driver core's responsibility to do this?  If
> a given mediated driver wants to serialize on their own, they can do
> that, but I don't see why we would impose that on every mediated driver.
> 

Ok. Removing these locks from here, so it would be mediated driver
responsibility to serialize if they need to.

>>
>>>> +
>>>> +struct pci_region_info {
>>>> +  uint64_t start;
>>>> +  uint64_t size;
>>>> +  uint32_t flags; /* VFIO region info flags */
>>>> +};
>>>> +
>>>> +enum mdev_emul_space {
>>>> +  EMUL_CONFIG_SPACE,  /* PCI configuration space */
>>>> +  EMUL_IO,/* I/O register space */
>>>> +  EMUL_MMIO   /* Memory-mapped I/O space */
>>>> +};  
>>>
>>>
>>> I'm still confused why this is needed, perhaps a description here would
>>> be useful so I can stop asking.  Clearly config space is PCI only, so
>>> it's strange to have it in the common code.  Everyone not on x86 will
>>> say I/O space is also strange.  I can't keep it in my head why the
>>> read/write offsets aren't sufficient for the driver to figure out what
>>> type it is.
>>>
>>>  
>>
>> Now that VFIO_PCI_OFFSET_* macros are moved to vfio.h which vendor
>> driver can also use, above enum could be removed from read/write. But
>> again these macros are useful when parent device is PCI device. How
>> would non-pci parent device differentiate IO ports and MMIO?
> 
> Moving VFIO_PCI_OFFSET_* to vfio.h already worries me, the vfio api
> does not impose fixed offsets, it's simply an implementation detail of
> vfio-pci.  We should be free to change that whenever we want and not
> break userspace.  By moving it to vfio.h and potentially having
> external mediated drivers depend on those offset macros, they now become
> part of the kABI.  So more and more, I'd prefer that reads/writes/mmaps
> get passed directly to the mediated driver, let them define which
> offset is which, the core is just a passthrough.  For non-PCI devices,
> like platform devices, the indexes are implementation specific, the
> user really needs to know how to work with the specific device and how
> it defines device mmio to region indexes.
>  

Ok. With this vfio_mpci looks simple.

Thanks,
Kirti.

>>>> +  int (*get_region_info)(struct mdev_device *vdev, int region_index,
>>>> +   struct pci_region_info *region_info);  
>>>
>>> This can't be //pci_//region_info.  How do you intend to support things
>>> like sparse mmap capabilities in the user REGION_INFO ioctl when such
>>> things are not part of the mediated device API?  Seems like the driver
>>> should just return a buffer.
>>>  
>>
>> If not pci_region_info, can use vfio_region_info here, even to fetch
>> sparce mmap capabilities from vendor driver?
> 
> Sure, you can use vfio_region_info, then it's just a pointer to a
> buffer allocated by the callee and the mediated core is just a
> passthrough, which is probably how it should be.  Thanks,
> 
> Alex
> 



Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device

2016-06-28 Thread Kirti Wankhede


On 6/25/2016 1:15 AM, Alex Williamson wrote:
> On Sat, 25 Jun 2016 00:04:27 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 

>>>> +
>>>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>>>> +{
>>>> +  /* Don't support MSIX for now */
>>>> +  if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>>>> +  return -1;
>>>> +
>>>> +  return 1;  
>>>
>>> Too much hard coding here, the mediated driver should define this.
>>>   
>>
>> I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
>> thought we can add supported for MSIX later. Till then hard code it to 1.
> 
> To me it screams that there needs to be an interface to the mediated
> device here.  How do you even know that the mediated device intends to
> support MSI?  What if it wants to emulated a VF and not support INTx?
> This is basically just a big "TODO" flag that needs to be addressed
> before a non-RFC.
> 

VFIO user space app reads emulated PCI config space of mediated device.
In PCI capability list when MSI capability (PCI_CAP_ID_MSI) is present,
it calls VFIO_DEVICE_SET_IRQS ioctl with irq_set->index set to
VFIO_PCI_MSI_IRQ_INDEX.
Similarly, MSIX is identified from emulated config space of mediated
device that checks if MSI capability is present and number of vectors
extracted from PCI_MSI_FLAGS_QSIZE flag.
vfio_mpci modules don't need to query it from vendor driver of mediated
device. Depending on which interrupt to support, mediated driver should
emulate PCI config space.

Thanks,
Kirti.






Re: [Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices

2016-06-28 Thread Kirti Wankhede


On 6/22/2016 9:16 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:48 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>>  
>>  struct vfio_iommu {
>>  struct list_headdomain_list;
>> +struct vfio_domain  *mediated_domain;
> 
> I'm not really a fan of how this is so often used to special case the
> code...
> 
>>  struct mutexlock;
>>  struct rb_root  dma_list;
>>  boolv2;
>> @@ -67,6 +69,13 @@ struct vfio_domain {
>>  struct list_headgroup_list;
>>  int prot;   /* IOMMU_CACHE */
>>  boolfgsp;   /* Fine-grained super pages */
>> +
>> +/* Domain for mediated device which is without physical IOMMU */
>> +boolmediated_device;
> 
> But sometimes we use this to special case the code and other times we
> use domain_list being empty.  I thought the argument against pulling
> code out to a shared file was that this approach could be made
> maintainable.
> 

Functions where struct vfio_domain *domain is argument which are
intended to perform for that domain only, checked if
(domain->mediated_device), like map_try_harder(), vfio_iommu_replay(),
vfio_test_domain_fgsp(). Checks in these functions can be removed but
then it would be callers responsibility to make sure that they don't
call these functions for mediated_domain.
Whereas functions where struct vfio_iommu *iommu is argument and
domain_list is traversed to find domain or perform for each domain in
domain_list, checked if (list_empty(>domain_list)), like
vfio_unmap_unpin(), vfio_iommu_map(), vfio_dma_do_map().


>> +
>> +struct mm_struct*mm;
>> +struct rb_root  pfn_list;   /* pinned Host pfn list */
>> +struct mutexpfn_list_lock;  /* mutex for pfn_list */
> 
> Seems like we could reduce overhead for the existing use cases by just
> adding a pointer here and making these last 3 entries part of the
> structure that gets pointed to.  Existence of the pointer would replace
> @mediated_device.
>

Ok.

>>  };
>>  
>>  struct vfio_dma {
>> @@ -79,10 +88,26 @@ struct vfio_dma {
>>  
>>  struct vfio_group {
>>  struct iommu_group  *iommu_group;
>> +#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
> 
> Where does CONFIG_MDEV_MODULE come from?
> 
> Plus, all the #ifdefs... 
> 

Config option MDEV is tristate and when selected as module
CONFIG_MDEV_MODULE is set in include/generated/autoconf.h.
Symbols mdev_bus_type, mdev_get_device_by_group() and mdev_put_device()
are only available when MDEV option is selected as built-in or modular.
If MDEV option is not selected, vfio_iommu_type1 modules should still
work for device direct assignment. If these #ifdefs are not there
vfio_iommu_type1 module fails to load with undefined symbols when MDEV
is not selected.

>> +struct mdev_device  *mdev;
> 
> This gets set on attach_group where we use the iommu_group to lookup
> the mdev, so why can't we do that on the other paths that make use of
> this?  I think this is just holding a reference.
> 

mdev is retrieved from attach_group for 2 reasons:
1. to increase the ref count of mdev, mdev_get_device_by_group(), when
its iommu_group is attached. That should be decremented, by
mdev_put_device(), from detach while detaching its iommu_group. This is
make sure that mdev is not freed until it's iommu_group is detached from
the container.

2. save reference to iommu_data so that vendor driver would use to call
vfio_pin_pages() and vfio_unpin_pages(). More details below.



>> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
>> +static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
>> + int prot, unsigned long *pfn)
>>  {
>>  struct page *page[1];
>>  struct vm_area_struct *vma;
>> +struct mm_struct *local_mm = mm;
>>  int ret = -EFAULT;
>>  
>> -if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
>> +if (!local_mm && !current->mm)
>> +return -ENODEV;
>> +
>> +if (!local_mm)
>> +local_mm = current->mm;
> 
> The above would be much more concise if we just initialized local_mm
> as: mm ? mm : current->mm
> 
>> +
>> +down_read(_mm->mmap_sem);
>> +if (get_user_pages_remote(NULL, local_mm, vaddr, 1,
>> +!!(prot & IOMMU_WRITE), 0, page, NULL) == 1) {
> 
> Um, the comment for get_user_pages_remote says:
> 
> "See also get_user_pages_fast, for performan

Re: [Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device

2016-06-24 Thread Kirti Wankhede
Thanks Alex.


On 6/22/2016 4:18 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:47 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
>> +
>> +static int get_mdev_region_info(struct mdev_device *mdev,
>> +struct pci_region_info *vfio_region_info,
>> +int index)
>> +{
>> +int ret = -EINVAL;
>> +struct parent_device *parent = mdev->parent;
>> +
>> +if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
>> +mutex_lock(>ops_lock);
>> +ret = parent->ops->get_region_info(mdev, index,
>> +vfio_region_info);
>> +mutex_unlock(>ops_lock);
> 
> Why do we have two ops_lock, one on the parent_device and one on the
> mdev_device?!  Is this one actually locking anything or also just
> providing serialization?  Why do some things get serialized at the
> parent level and some things at the device level?  Very confused by
> ops_lock.
>

There are two sets of callback:
* parent device callbacks: supported_config, create, destroy, start, stop
* mdev device callbacks: read, write, set_irqs, get_region_info,
validate_map_request

parent->ops_lock is to serialize per parent device callbacks.
mdev->ops_lock is to serialize per mdev device callbacks.

I'll add above comment in mdev.h.


>> +
>> +static int mdev_get_irq_count(struct vfio_mdev *vmdev, int irq_type)
>> +{
>> +/* Don't support MSIX for now */
>> +if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>> +return -1;
>> +
>> +return 1;
> 
> Too much hard coding here, the mediated driver should define this.
> 

I'm testing INTX and MSI, I don't have a way to test MSIX for now. So we
thought we can add supported for MSIX later. Till then hard code it to 1.

>> +
>> +if (parent && parent->ops->set_irqs) {
>> +mutex_lock(>ops_lock);
>> +ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>> +hdr.start, hdr.count, data);
>> +mutex_unlock(>ops_lock);
> 
> Device level serialization on set_irqs... interesting.
> 

Hope answer above helps to clarify this.


>> +}
>> +
>> +kfree(ptr);
>> +return ret;
>> +}
>> +}
>> +return -ENOTTY;
>> +}
>> +
>> +ssize_t mdev_dev_config_rw(struct vfio_mdev *vmdev, char __user *buf,
>> +   size_t count, loff_t *ppos, bool iswrite)
>> +{
>> +struct mdev_device *mdev = vmdev->mdev;
>> +struct parent_device *parent = mdev->parent;
>> +int size = vmdev->vfio_region_info[VFIO_PCI_CONFIG_REGION_INDEX].size;
>> +int ret = 0;
>> +uint64_t pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +
>> +if (pos < 0 || pos >= size ||
>> +pos + count > size) {
>> +pr_err("%s pos 0x%llx out of range\n", __func__, pos);
>> +ret = -EFAULT;
>> +goto config_rw_exit;
>> +}
>> +
>> +if (iswrite) {
>> +char *usr_data, *ptr;
>> +
>> +ptr = usr_data = memdup_user(buf, count);
>> +if (IS_ERR(usr_data)) {
>> +ret = PTR_ERR(usr_data);
>> +goto config_rw_exit;
>> +}
>> +
>> +ret = parent->ops->write(mdev, usr_data, count,
>> +  EMUL_CONFIG_SPACE, pos);
> 
> No serialization on this ops, thank goodness, but why?
>

Its there at caller of mdev_dev_rw().


> This read/write interface still seems strange to me...
> 

Replied on this in 1st Patch.

>> +
>> +memcpy((void *)(vmdev->vconfig + pos), (void *)usr_data, count);
>> +kfree(ptr);
>> +} else {
>> +char *ret_data, *ptr;
>> +
>> +ptr = ret_data = kzalloc(count, GFP_KERNEL);
>> +
>> +if (IS_ERR(ret_data)) {
>> +ret = PTR_ERR(ret_data);
>> +goto config_rw_exit;
>> +}
>> +
>> +ret = parent->ops->read(mdev, ret_data, count,
>> +EMUL_CONFIG_SPACE, pos);
>> +
>> +if (ret > 0) {
>> +if (copy_to_user(buf, ret_data, ret))
>> +ret = -EFAULT;
>> +  

Re: [Qemu-devel] [PATCH 1/3] Mediated device Core driver

2016-06-24 Thread Kirti Wankhede
Alex,

Thanks for taking closer look. I'll incorporate all the nits you suggested.

On 6/22/2016 3:00 AM, Alex Williamson wrote:
> On Mon, 20 Jun 2016 22:01:46 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>
...
>> +
>> +config MDEV
>> +tristate "Mediated device driver framework"
>> +depends on VFIO
>> +default n
>> +help
>> +MDEV provides a framework to virtualize device without SR-IOV cap
>> +See Documentation/mdev.txt for more details.
> 
> Documentation pointer still doesn't exist.  Perhaps this file would be
> a more appropriate place than the commit log for some of the
> information above.
> 

Sure, I'll add these details to documentation.

> Every time I review this I'm struggling to figure out why this isn't
> VFIO_MDEV since it's really tied to vfio and difficult to evaluate it
> as some sort of standalone mediated device interface.  I don't know
> the answer, but it always strikes me as a discontinuity.
> 

Ok. I'll change to VFIO_MDEV

>> +
>> +static struct mdev_device *find_mdev_device(struct parent_device *parent,
>> +uuid_le uuid, int instance)
>> +{
>> +struct mdev_device *mdev = NULL, *p;
>> +
>> +list_for_each_entry(p, >mdev_list, next) {
>> +if ((uuid_le_cmp(p->uuid, uuid) == 0) &&
>> +(p->instance == instance)) {
>> +mdev = p;
> 
> Locking here is still broken, the callers are create and destroy, which
> can still race each other and themselves.
>

Fixed it.

>> +
>> +static int mdev_device_create_ops(struct mdev_device *mdev, char 
>> *mdev_params)
>> +{
>> +struct parent_device *parent = mdev->parent;
>> +int ret;
>> +
>> +mutex_lock(>ops_lock);
>> +if (parent->ops->create) {
> 
> How would a parent_device without ops->create or ops->destroy useful?
> Perhaps mdev_register_driver() should enforce required ops.  mdev.h
> should at least document which ops are optional if they really are
> optional.

Makes sense, adding check in mdev_register_driver() to mandate create
and destroy in ops. I'll also update the comments in mdev.h for
mandatory and optional ops.

> 
>> +ret = parent->ops->create(mdev->dev.parent, mdev->uuid,
>> +mdev->instance, mdev_params);
>> +if (ret)
>> +goto create_ops_err;
>> +}
>> +
>> +ret = mdev_add_attribute_group(>dev,
>> +parent->ops->mdev_attr_groups);
> 
> An error here seems to put us in a bad place, the device is created but
> the attributes are broken, is it the caller's responsibility to
> destroy?  Seems like we need a cleanup if this fails.
> 

Right, adding cleanup here.

>> +create_ops_err:
>> +mutex_unlock(>ops_lock);
> 
> It seems like ops_lock isn't used so much as a lock as a serialization
> mechanism.  Why?  Where is this serialization per parent device
> documented?
>

parent->ops_lock is to serialize parent device callbacks to vendor
driver, i.e supported_config(), create() and destroy().
mdev->ops_lock is to serialize mediated device related callbacks to
vendor driver, i.e. start(), stop(), read(), write(), set_irqs(),
get_region_info(), validate_map_request().
Its not documented, I'll add comments to mdev.h about these locks.


>> +return ret;
>> +}
>> +
>> +static int mdev_device_destroy_ops(struct mdev_device *mdev, bool force)
>> +{
>> +struct parent_device *parent = mdev->parent;
>> +int ret = 0;
>> +
>> +/*
>> + * If vendor driver doesn't return success that means vendor
>> + * driver doesn't support hot-unplug
>> + */
>> +mutex_lock(>ops_lock);
>> +if (parent->ops->destroy) {
>> +ret = parent->ops->destroy(parent->dev, mdev->uuid,
>> +   mdev->instance);
>> +if (ret && !force) {
> 
> It seems this is not so much a 'force' but an ignore errors, we never
> actually force the mdev driver to destroy the device... which makes me
> wonder if there are leaks there.
> 

Consider a case where VM is running or in teardown path and parent
device in unbound from vendor driver, then vendor driver would call
mdev_unregister_device() from its remove() call. Even if
parent->ops->destroy() returns error that could also mean that
hot-unplug is not supported but we have to destroy mdev device. remove()
call doesn't honor error retu

[Qemu-devel] [PATCH 3/3] VFIO Type1 IOMMU: Add support for mediated devices

2016-06-20 Thread Kirti Wankhede
VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I295d6f0f2e0579b8d9882bfd8fd5a4194b97bd9a
---
 drivers/vfio/vfio_iommu_type1.c | 444 +---
 include/linux/vfio.h|   6 +
 2 files changed, 418 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..f17dd104fe27 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.william...@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
struct list_headdomain_list;
+   struct vfio_domain  *mediated_domain;
struct mutexlock;
struct rb_root  dma_list;
boolv2;
@@ -67,6 +69,13 @@ struct vfio_domain {
struct list_headgroup_list;
int prot;   /* IOMMU_CACHE */
boolfgsp;   /* Fine-grained super pages */
+
+   /* Domain for mediated device which is without physical IOMMU */
+   boolmediated_device;
+
+   struct mm_struct*mm;
+   struct rb_root  pfn_list;   /* pinned Host pfn list */
+   struct mutexpfn_list_lock;  /* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,26 @@ struct vfio_dma {
 
 struct vfio_group {
struct iommu_group  *iommu_group;
+#if defined(CONFIG_MDEV) || defined(CONFIG_MDEV_MODULE)
+   struct mdev_device  *mdev;
+#endif
struct list_headnext;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+   struct rb_node  node;
+   unsigned long   vaddr;  /* virtual addr */
+   dma_addr_t  iova;   /* IOVA */
+   unsigned long   npage;  /* number of pages */
+   unsigned long   pfn;/* Host pfn */
+   size_t  prot;
+   atomic_tref_count;
+};
+
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +155,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
struct vfio_dma *old)
rb_erase(>node, >dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+ unsigned long pfn)
+{
+   struct rb_node *node;
+   struct vfio_pfn *vpfn, *ret = NULL;
+
+   mutex_lock(>pfn_list_lock);
+   node = domain->pfn_list.rb_node;
+
+   while (node) {
+   vpfn = rb_entry(node, struct vfio_pfn, node);
+
+   if (pfn < vpfn->pfn)
+   node = node->rb_left;
+   else if (pfn > vpfn->pfn)
+   node = node->rb_right;
+   else {
+   ret = vpfn;
+   break;
+   }
+   }
+
+   mutex_unlock(>pfn_list_lock);
+   return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+   struct rb_node **link, *parent = NULL;
+   struct vfio_pfn *vpfn;
+
+   mutex_lock(>pfn_list_lock);
+   link = >pfn_list.rb_node;
+   while (*link) {
+   parent = *link;
+   vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+   if (new->pfn < vpfn->pfn)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(>node, parent, link);
+   rb_insert_color(>node, >pfn_list);
+   mutex_unlock(>pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, st

[Qemu-devel] [PATCH 1/3] Mediated device Core driver

2016-06-20 Thread Kirti Wankhede
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---+
 |   |
 | +---+ |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |   | | |  |
 | |  mdev | +>+  |<-> VFIO user
 | |  bus  | | probe()/remove()| vfio_mpci.ko |APIs
 | |  driver   | | |  |
 | |   | | +--+
 | |   | |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |   | | |  |
 | |   | +>+  |<-> VFIO user
 | +---+ | probe()/remove()| vfio_mccw.ko |APIs
 |   | |  |
 |  MDEV CORE| +--+
 |   MODULE  |
 |   mdev.ko |
 | +---+ |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | |  nvidia.ko   |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--+
 | | interface | |<+  |
 | |   | | |  i915.ko |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | |   | |
 | |   | |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | | ccw_device.ko|<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | +---+ |
 +---+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
Return 1 if given device can be handled by given driver and
zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 int  (*match)(struct device *dev);
 struct device_driverdriver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
driver
- create: to allocate basic resources in vendor driver for a mediated
  device.
- destroy: to free resources in vendor driver when mediated device is
   destroyed.
- start: to initiate mediated device initialization process from vendor
 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I73a5084574270b14541c529461ea2f03c292d510
---
 drivers/vfio/Kconfig |   1 +
 drivers/vfio/Makefile|   1 +
 drivers/vfio/mdev/Kconfig|  11 +
 drivers/vfio/mdev/Makefile   |   5 +
 drivers/vfio/mdev/md

[Qemu-devel] [RFC PATCH v5 0/3] Add Mediated device support

2016-06-20 Thread Kirti Wankhede
This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v5?
- Improved mdev_put_device() and mdev_get_device() for mediated devices and
  locking for per mdev_device registration callbacks.

What's left to do?
- Issues with mmap region fault handler, EPT is not correctly populated with the
  information provided by remap_pfn_range() inside fault handler.

- mmap invalidation mechanism will be added once above issue gets resolved.

Tested:
- Single vGPU VM
- Multiple vGPU VMs on same GPU


Thanks,
Kirti


Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig|   1 +
 drivers/vfio/Makefile   |   1 +
 drivers/vfio/mdev/Kconfig   |  18 +
 drivers/vfio/mdev/Makefile  |   6 +
 drivers/vfio/mdev/mdev_core.c   | 595 
 drivers/vfio/mdev/mdev_driver.c | 138 
 drivers/vfio/mdev/mdev_private.h|  33 ++
 drivers/vfio/mdev/mdev_sysfs.c  | 300 +
 drivers/vfio/mdev/vfio_mpci.c   | 654 
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
 drivers/vfio/vfio_iommu_type1.c | 444 ++--
 include/linux/mdev.h| 232 +
 include/linux/vfio.h|  13 +
 14 files changed, 2404 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev_core.c
 create mode 100644 drivers/vfio/mdev/mdev_driver.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/mdev_sysfs.c
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0




[Qemu-devel] [PATCH 2/3] VFIO driver for mediated PCI device

2016-06-20 Thread Kirti Wankhede
VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I583f4734752971d3d112324d69e2508c88f359ec
---
 drivers/vfio/mdev/Kconfig   |   7 +
 drivers/vfio/mdev/Makefile  |   1 +
 drivers/vfio/mdev/vfio_mpci.c   | 654 
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
 include/linux/vfio.h|   7 +
 6 files changed, 670 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
 If you don't know what do here, say N.
 
+config VFIO_MPCI
+tristate "VFIO support for Mediated PCI devices"
+depends on VFIO && PCI && MDEV
+default n
+help
+VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 2c6d11f7bc24..cd5e7625e1ec 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev_core.o mdev_sysfs.o mdev_driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index ..267879a05c39
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,654 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VFIO based Mediated PCI device driver"
+
+struct vfio_mdev {
+   struct iommu_group *group;
+   struct mdev_device *mdev;
+   int refcnt;
+   struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+   u8  *vconfig;
+   struct mutexvfio_mdev_lock;
+};
+
+static int get_mdev_region_info(struct mdev_device *mdev,
+   struct pci_region_info *vfio_region_info,
+   int index)
+{
+   int ret = -EINVAL;
+   struct parent_device *parent = mdev->parent;
+
+   if (parent && dev_is_pci(parent->dev) && parent->ops->get_region_info) {
+   mutex_lock(>ops_lock);
+   ret = parent->ops->get_region_info(mdev, index,
+   vfio_region_info);
+   mutex_unlock(>ops_lock);
+   }
+   return ret;
+}
+
+static void mdev_read_base(struct vfio_mdev *vmdev)
+{
+   int index, pos;
+   u32 start_lo, start_hi;
+   u32 mem_type;
+
+   pos = PCI_BASE_ADDRESS_0;
+
+   for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+   if (!vmdev->vfio_region_info[index].size)
+   continue;
+
+   start_lo = (*(u32 *)(vmdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_MASK;
+   mem_type = (*(u32 *)(vmdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+   switch (mem_type) {
+   case PCI_BASE_ADDRESS_MEM_TYPE_64:
+   start_hi = (*(u32 *)(vmdev->vconfig + pos + 4));
+   pos += 4;
+   break;
+   case PCI_BASE_ADDRESS_MEM_TYPE_32:
+   case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+   /* 1M mem BAR treated as 32-bit BAR */
+   default:
+   /* mem unknown type treated as 32-bit BAR */
+   start_hi = 0;
+   break;
+   }
+   pos += 4;
+   vmdev->vfio_region_

Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-06-05 Thread Kirti Wankhede


On 6/3/2016 2:27 PM, Dong Jia wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> 
> 
> ...snip...
> 
>> +struct phy_device_ops {
>> +struct module   *owner;
>> +const struct attribute_group **dev_attr_groups;
>> +const struct attribute_group **mdev_attr_groups;
>> +
>> +int (*supported_config)(struct device *dev, char *config);
>> +int (*create)(struct device *dev, uuid_le uuid,
>> +  uint32_t instance, char *mdev_params);
>> +int (*destroy)(struct device *dev, uuid_le uuid,
>> +   uint32_t instance);
>> +int (*start)(uuid_le uuid);
>> +int (*shutdown)(uuid_le uuid);
>> +ssize_t (*read)(struct mdev_device *vdev, char *buf, size_t count,
>> +enum mdev_emul_space address_space, loff_t pos);
>> +ssize_t (*write)(struct mdev_device *vdev, char *buf, size_t count,
>> + enum mdev_emul_space address_space, loff_t pos);
>> +int (*set_irqs)(struct mdev_device *vdev, uint32_t flags,
>> +unsigned int index, unsigned int start,
>> +unsigned int count, void *data);
>> +int (*get_region_info)(struct mdev_device *vdev, int region_index,
>> + struct pci_region_info *region_info);
>> +int (*validate_map_request)(struct mdev_device *vdev,
>> +unsigned long virtaddr,
>> +unsigned long *pfn, unsigned long *size,
>> +pgprot_t *prot);
>> +};
> 
> Dear Kirti:
> 
> When I rebased my vfio-ccw patches on this series, I found I need an
> extra 'ioctl' callback in phy_device_ops.
> 

Thanks for taking closer look. As per my knowledge ccw is not PCI
device, right? Correct me if I'm wrong. I'm curious to know. Are you
planning to write a driver (vfio-mccw) for mediated ccw device?

Thanks,
Kirti

> The ccw physical device only supports one ccw mediated device. And I
> have two new ioctl commands for the ccw mediated device. One is 
> to hot-reset the resource in the physical device that allocated for
> the mediated device, the other is to do an I/O instruction translation
> and perform an I/O operation on the physical device. I found the
> existing callbacks could not meet my requirements.
> 
> Something like the following would be fine for my case:
>   int (*ioctl)(struct mdev_device *vdev,
>unsigned int cmd,
>unsigned long arg);
> 
> What do you think about this?
> 
> 
> Dong Jia
> 



Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-05-26 Thread Kirti Wankhede
Thanks Alex.

I'll consider all the nits and fix those in next version of patch.

More below:

On 5/26/2016 4:09 AM, Alex Williamson wrote:
> On Wed, 25 May 2016 01:28:15 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>

...

>> +
>> +config MDEV
>> +tristate "Mediated device driver framework"
>> +depends on VFIO
>> +default n
>> +help
>> +MDEV provides a framework to virtualize device without
SR-IOV cap
>> +See Documentation/mdev.txt for more details.
>
> I don't see that file anywhere in this series.

Yes, missed this file in this patch. I'll add it in next version of patch.
Since mdev module is moved in vfio directory, should I place this file
in vfio directory, Documentation/vfio/mdev.txt? or keep documentation of
mdev module within vfio.txt itself?


>> +if (phy_dev) {
>> +mutex_lock(_devices.list_lock);
>> +
>> +/*
>> +* If vendor driver doesn't return success that means vendor
>> +* driver doesn't support hot-unplug
>> +*/
>> +if (phy_dev->ops->destroy) {
>> +if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +  mdevice->instance)) {
>> +mutex_unlock(_devices.list_lock);
>> +return;
>> +}
>> +}
>> +
>> +mdev_remove_attribute_group(>dev,
>> +phy_dev->ops->mdev_attr_groups);
>> +mdevice->phy_dev = NULL;
>> +mutex_unlock(_devices.list_lock);
>
> Locking here appears arbitrary, how does the above code interact with
> phy_devices.dev_list?
>

Sorry for not being clear about phy_devices.list_lock, probably I
shouldn't have named it 'list_lock'. This lock is also to synchronize
register_device & unregister_device and physical device specific
callbacks: supported_config, create, destroy, start and shutdown.
Although supported_config, create and destroy are per phy_device
specific callbacks while start and shutdown could refer to multiple
phy_devices indirectly when there are multiple mdev devices of same type
on different physical devices. There could be race condition in start
callback and destroy & unregister_device. I'm revisiting this lock again
and will see to use per phy device lock for phy_device specific callbacks.


>> +struct mdev_device {
>> +struct kref kref;
>> +struct device   dev;
>> +struct phy_device   *phy_dev;
>> +struct iommu_group  *group;
>> +void*iommu_data;
>> +uuid_le uuid;
>> +uint32_tinstance;
>> +void*driver_data;
>> +struct mutexops_lock;
>> +struct list_headnext;
>> +};
>
> Could this be in the private header?  Seems like this should be opaque
> outside of mdev core.
>

No, this structure is used in mediated device call back functions to
vendor driver so that vendor driver could identify mdev device, similar
to pci_dev structure in pci bus subsystem. (I'll remove kref which is
not being used at all.)


>> + * @read:   Read emulation callback
>> + *  @mdev: mediated device structure
>> + *  @buf: read buffer
>> + *  @count: number bytes to read
>> + *  @address_space: specifies for which address
>> + *  space the request is: pci_config_space, IO
>> + *  register space or MMIO space.
>
> Seems like I asked before and it's no more clear in the code, how do we
> handle multiple spaces for various types?  ie. a device might have
> multiple MMIO spaces.
>
>> + *  @pos: offset from base address.

Sorry, updated the code but missed to update comment here.
pos = base_address + offset
(its not 'pos' anymore, will rename it to addr)

so vendor driver is aware about base addresses of multiple MMIO spaces
and its size, they can identify MMIO space based on addr.

>> +/*
>> + * Physical Device
>> + */
>> +struct phy_device {
>> +struct device   *dev;
>> +const struct phy_device_ops *ops;
>> +struct list_headnext;
>> +};
>
> I would really like to be able to use the mediated device interface to
> create a purely virtual device, is the expectation that my physical
> device interface would create a virtual struct device which would
> become the parent and control point in sysfs for creating all the mdev
> devices? Should we be calling this a host_device or mdev_parent_dev in
> that case since there's really no requirement that it be a physical
> device?

Makes sense. I'll rename it to parent_device.

Thanks,
Kirti.




Re: [Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-05-25 Thread Kirti Wankhede

On 5/25/2016 1:25 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>

...

>> +
>> +config MDEV
>> +tristate "Mediated device driver framework"
>
> Sorry not a native speaker. Is it cleaner to say "Driver framework for
Mediated
> Devices" or "Mediated Device Framework"? Should we focus on driver or
device
> here?
>

Both, device and driver. This framework provides way to register
physical *devices* and also register *driver* for mediated devices.


>> +depends on VFIO
>> +default n
>> +help
>> +MDEV provides a framework to virtualize device without
SR-IOV cap
>> +See Documentation/mdev.txt for more details.
>
> Looks Documentation/mdev.txt is not included in this version.
>

Yes, will have Documentation/mdev.txt in next version of patch.


>> +static struct devices_list {
>> +struct list_headdev_list;
>> +struct mutexlist_lock;
>> +} mdevices, phy_devices;
>
> phy_devices -> pdevices? and similarly we can use pdev/mdev
> pair in other places...
>

'pdevices' sometimes also refers to 'pointer to devices' that's the
reason I perfer to use phy_devices to represent 'physical devices'


>> +static struct mdev_device *find_mdev_device(uuid_le uuid, int instance)
>
> can we just call it "struct mdev* or "mdevice"? "dev_device" looks
redundant.
>

'struct mdev_device' represents 'device structure for device created by
mdev module'. Still that doesn't satisfy major folks, I'm open to change
it.


> Sorry I may have to ask same question since I didn't get an answer yet.
> what exactly does 'instance' mean here? since uuid is unique, why do
> we need match instance too?
>

'uuid' could be UUID of a VM for whom it is created. To support mutiple
mediated devices for same VM, name should be unique. Hence we need a
instance number to identify each mediated device uniquely in one VM.



>> +if (phy_dev->ops->destroy) {
>> +if (phy_dev->ops->destroy(phy_dev->dev, mdevice->uuid,
>> +  mdevice->instance)) {
>> +mutex_unlock(_devices.list_lock);
>
> a warning message is preferred. Also better to return -EBUSY here.
>

mdev_destroy_device() is called from 2 paths, one is sysfs mdev_destroy
and mdev_unregister_device(). For the later case, return from here will
any ways ignored. mdev_unregister_device() is called from the remove
function of physical device and that doesn't care about return error, it
just removes the device from subsystem.

>> +return;
>> +}
>> +}
>> +
>> +mdev_remove_attribute_group(>dev,
>> +phy_dev->ops->mdev_attr_groups);
>> +mdevice->phy_dev = NULL;
>
> Am I missing something here? You didn't remove this mdev node from
> the list, and below...
>

device_unregister() calls put_device(dev) and if refcount is zero its
release function is called, which is mdev_device_release(), that is
hooked during device_register(). This node is removed from list from
mdev_device_release().


>> +mutex_unlock(_devices.list_lock);
>
> you should use mutex of mdevices list
>

No, this lock is for phy_dev.


>> +phy_dev->dev = dev;
>> +phy_dev->ops = ops;
>> +
>> +mutex_lock(_devices.list_lock);
>> +ret = mdev_create_sysfs_files(dev);
>> +if (ret)
>> +goto add_sysfs_error;
>> +
>> +ret = mdev_add_attribute_group(dev, ops->dev_attr_groups);
>> +if (ret)
>> +goto add_group_error;
>
> any reason to include sysfs operations inside the mutex which is
> purely about phy_devices list?
>

dev_attr_groups attribute is for physical device, hence inside
phy_devices.list_lock.

* @dev_attr_groups:Default attributes of the physical device.


>> +void mdev_unregister_device(struct device *dev)
>> +{
>> +struct phy_device *phy_dev;
>> +struct mdev_device *vdev = NULL;
>> +
>> +phy_dev = find_physical_device(dev);
>> +
>> +if (!phy_dev)
>> +return;
>> +
>> +dev_info(dev, "MDEV: Unregistering\n");
>> +
>> +while ((vdev = find_next_mdev_device(phy_dev)))
>> +mdev_destroy_device(vdev);
>
> Need check return value here since ops->destroy may fail.
>

See my comment above.


>> +static void mdev_device_release(struct device *dev)
>
> 

Re: [Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device

2016-05-25 Thread Kirti Wankhede


On 5/25/2016 1:45 PM, Tian, Kevin wrote:
>> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
>> Sent: Wednesday, May 25, 2016 3:58 AM
>>
>> VFIO driver registers with MDEV core driver. MDEV core driver creates
>> mediated device and calls probe routine of MPCI VFIO driver. This MPCI
>> VFIO driver adds mediated device to VFIO core module.
>> Main aim of this module is to manage all VFIO APIs for each mediated PCI
>> device.
>> Those are:
>> - get region information from vendor driver.
>> - trap and emulate PCI config space and BAR region.
>> - Send interrupt configuration information to vendor driver.
>> - mmap mappable region with invalidate mapping and fault on access to
>>   remap pfn.
>>
>> Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
>> Signed-off-by: Neo Jia <c...@nvidia.com>
>> Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
>> ---
>>  drivers/vfio/mdev/Kconfig   |   7 +
>>  drivers/vfio/mdev/Makefile  |   1 +
>>  drivers/vfio/mdev/vfio_mpci.c   | 648
>> 
>>  drivers/vfio/pci/vfio_pci_private.h |   6 -
>>  drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
>>  include/linux/vfio.h|   7 +
>>  6 files changed, 664 insertions(+), 6 deletions(-)
>>  create mode 100644 drivers/vfio/mdev/vfio_mpci.c
>>
>> diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
>> index 951e2bb06a3f..8d9e78aaa80f 100644
>> --- a/drivers/vfio/mdev/Kconfig
>> +++ b/drivers/vfio/mdev/Kconfig
>> @@ -9,3 +9,10 @@ config MDEV
>>
>>  If you don't know what do here, say N.
>>
>> +config VFIO_MPCI
>> +tristate "VFIO support for Mediated PCI devices"
>> +depends on VFIO && PCI && MDEV
>> +default n
>> +help
>> +VFIO based driver for mediated PCI devices.
>> +
>> diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
>> index 4adb069febce..8ab38c57df21 100644
>> --- a/drivers/vfio/mdev/Makefile
>> +++ b/drivers/vfio/mdev/Makefile
>> @@ -2,4 +2,5 @@
>>  mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
>>
>>  obj-$(CONFIG_MDEV) += mdev.o
>> +obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
>>
>> diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
>> new file mode 100644
>> index ..ef9d757ec511
>> --- /dev/null
>> +++ b/drivers/vfio/mdev/vfio_mpci.c
>> @@ -0,0 +1,648 @@
>> +/*
>> + * VFIO based Mediated PCI device driver
>> + *
>> + * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
>> + * Author: Neo Jia <c...@nvidia.com>
>> + * Kirti Wankhede <kwankh...@nvidia.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2 as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +#include "mdev_private.h"
>> +
>> +#define DRIVER_VERSION  "0.1"
>> +#define DRIVER_AUTHOR   "NVIDIA Corporation"
>> +#define DRIVER_DESC "VFIO based Mediated PCI device driver"
>> +
>> +struct vfio_mdevice {
>> +struct iommu_group *group;
>> +struct mdev_device *mdevice;
>> +int refcnt;
>> +struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
>> +u8  *vconfig;
>> +struct mutexvfio_mdev_lock;
>> +};
>> +
>> +static int get_virtual_bar_info(struct mdev_device *mdevice,
>> +struct pci_region_info *vfio_region_info,
>> +int index)
> 
> 'virtual' or 'physical'? My feeling is to get physical region resource 
> allocated
> for a mdev.
> 

It's mediated device's region information, changing it to
get_mdev_region_info.


>> +{
>> +int ret = -EINVAL;
>> +struct phy_device *phy_dev = mdevice->phy_dev;
>> +
>> +if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
>> +mutex_lock(>ops_lock);
>> +ret = phy_dev->ops->get_region_info(mdevice, index,
>> +vfio_region_info);
>> +mutex_un

[Qemu-devel] [RFC PATCH v4 3/3] VFIO Type1 IOMMU: Add support for mediated devices

2016-05-24 Thread Kirti Wankhede
VFIO Type1 IOMMU driver is designed for the devices which are IOMMU
capable. Mediated device only uses IOMMU TYPE1 API, the underlying
hardware can be managed by an IOMMU domain.

This change exports functions to pin and unpin pages for mediated devices.
It maintains data of pinned pages for mediated domain. This data is used to
verify unpinning request and to unpin remaining pages from detach_group()
if there are any.

Aim of this change is:
- To use most of the code of IOMMU driver for mediated devices
- To support direct assigned device and mediated device by single module

Updated the change to keep mediated domain structure out of domain_list.

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I9c262abc9c68fd6abf52d91a636bf0cc631593a0
---
 drivers/vfio/vfio_iommu_type1.c | 433 +---
 include/linux/vfio.h|   6 +
 2 files changed, 407 insertions(+), 32 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e93cedb..5cc7dc0288a3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.william...@redhat.com>"
@@ -55,6 +56,7 @@ MODULE_PARM_DESC(disable_hugepages,
 
 struct vfio_iommu {
struct list_headdomain_list;
+   struct vfio_domain  *mediated_domain;
struct mutexlock;
struct rb_root  dma_list;
boolv2;
@@ -67,6 +69,13 @@ struct vfio_domain {
struct list_headgroup_list;
int prot;   /* IOMMU_CACHE */
boolfgsp;   /* Fine-grained super pages */
+
+   /* Domain for mediated device which is without physical IOMMU */
+   boolmediated_device;
+
+   struct mm_struct*mm;
+   struct rb_root  pfn_list;   /* pinned Host pfn list */
+   struct mutexpfn_list_lock;  /* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -79,10 +88,23 @@ struct vfio_dma {
 
 struct vfio_group {
struct iommu_group  *iommu_group;
+   struct mdev_device  *mdevice;
struct list_headnext;
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_pfn {
+   struct rb_node  node;
+   unsigned long   vaddr;  /* virtual addr */
+   dma_addr_t  iova;   /* IOVA */
+   unsigned long   npage;  /* number of pages */
+   unsigned long   pfn;/* Host pfn */
+   size_t  prot;
+   atomic_tref_count;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +152,64 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
struct vfio_dma *old)
rb_erase(>node, >dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_pfn *vfio_find_pfn(struct vfio_domain *domain,
+ unsigned long pfn)
+{
+   struct rb_node *node;
+   struct vfio_pfn *vpfn, *ret = NULL;
+
+   mutex_lock(>pfn_list_lock);
+   node = domain->pfn_list.rb_node;
+
+   while (node) {
+   vpfn = rb_entry(node, struct vfio_pfn, node);
+
+   if (pfn < vpfn->pfn)
+   node = node->rb_left;
+   else if (pfn > vpfn->pfn)
+   node = node->rb_right;
+   else {
+   ret = vpfn;
+   break;
+   }
+   }
+
+   mutex_unlock(>pfn_list_lock);
+   return ret;
+}
+
+static void vfio_link_pfn(struct vfio_domain *domain, struct vfio_pfn *new)
+{
+   struct rb_node **link, *parent = NULL;
+   struct vfio_pfn *vpfn;
+
+   mutex_lock(>pfn_list_lock);
+   link = >pfn_list.rb_node;
+   while (*link) {
+   parent = *link;
+   vpfn = rb_entry(parent, struct vfio_pfn, node);
+
+   if (new->pfn < vpfn->pfn)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(>node, parent, link);
+   rb_insert_color(>node, >pfn_list);
+   mutex_unlock(>pfn_list_lock);
+}
+
+/* call by holding domain->pfn_list_lock */
+static void vfio_unlink_pfn(struct vfio_domain *domain, struct vfio_pfn *old)
+{
+   rb_erase(>node, >pfn_list);
+}
+

[Qemu-devel] [RFC PATCH v4 0/3] Add Mediated device support[was: Add vGPU support]

2016-05-24 Thread Kirti Wankhede
This series adds Mediated device support to v4.6 Linux host kernel. Purpose
of this series is to provide a common interface for mediated device
management that can be used by different devices. This series introduces
Mdev core module that create and manage mediated devices, VFIO based driver
for mediated PCI devices that are created by Mdev core module and update
VFIO type1 IOMMU module to support mediated devices.

What's new in v4?
- Renamed 'vgpu' module to 'mdev' module that represent generic term
  'Mediated device'.
- Moved mdev directory to drivers/vfio directory as this is the extension
  of VFIO APIs for mediated devices.
- Updated mdev driver to be flexible to register multiple types of drivers
  to mdev_bus_type bus.
- Updated mdev core driver with mdev_put_device() and mdev_get_device() for
  mediated devices.


What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Kirti Wankhede (3):
  Mediated device Core driver
  VFIO driver for mediated PCI device
  VFIO Type1 IOMMU: Add support for mediated devices

 drivers/vfio/Kconfig|   1 +
 drivers/vfio/Makefile   |   1 +
 drivers/vfio/mdev/Kconfig   |  18 +
 drivers/vfio/mdev/Makefile  |   6 +
 drivers/vfio/mdev/mdev-core.c   | 462 +
 drivers/vfio/mdev/mdev-driver.c | 139 
 drivers/vfio/mdev/mdev-sysfs.c  | 312 +
 drivers/vfio/mdev/mdev_private.h|  33 ++
 drivers/vfio/mdev/vfio_mpci.c   | 648 
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
 drivers/vfio/vfio_iommu_type1.c | 433 ++--
 include/linux/mdev.h| 224 +
 include/linux/vfio.h|  13 +
 14 files changed, 2259 insertions(+), 38 deletions(-)
 create mode 100644 drivers/vfio/mdev/Kconfig
 create mode 100644 drivers/vfio/mdev/Makefile
 create mode 100644 drivers/vfio/mdev/mdev-core.c
 create mode 100644 drivers/vfio/mdev/mdev-driver.c
 create mode 100644 drivers/vfio/mdev/mdev-sysfs.c
 create mode 100644 drivers/vfio/mdev/mdev_private.h
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c
 create mode 100644 include/linux/mdev.h

-- 
2.7.0




[Qemu-devel] [RFC PATCH v4 2/3] VFIO driver for mediated PCI device

2016-05-24 Thread Kirti Wankhede
VFIO driver registers with MDEV core driver. MDEV core driver creates
mediated device and calls probe routine of MPCI VFIO driver. This MPCI
VFIO driver adds mediated device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each mediated PCI
device.
Those are:
- get region information from vendor driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to vendor driver.
- mmap mappable region with invalidate mapping and fault on access to
  remap pfn.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I48a34af88a9a905ec1f0f7528383c5db76c2e14d
---
 drivers/vfio/mdev/Kconfig   |   7 +
 drivers/vfio/mdev/Makefile  |   1 +
 drivers/vfio/mdev/vfio_mpci.c   | 648 
 drivers/vfio/pci/vfio_pci_private.h |   6 -
 drivers/vfio/pci/vfio_pci_rdwr.c|   1 +
 include/linux/vfio.h|   7 +
 6 files changed, 664 insertions(+), 6 deletions(-)
 create mode 100644 drivers/vfio/mdev/vfio_mpci.c

diff --git a/drivers/vfio/mdev/Kconfig b/drivers/vfio/mdev/Kconfig
index 951e2bb06a3f..8d9e78aaa80f 100644
--- a/drivers/vfio/mdev/Kconfig
+++ b/drivers/vfio/mdev/Kconfig
@@ -9,3 +9,10 @@ config MDEV
 
 If you don't know what do here, say N.
 
+config VFIO_MPCI
+tristate "VFIO support for Mediated PCI devices"
+depends on VFIO && PCI && MDEV
+default n
+help
+VFIO based driver for mediated PCI devices.
+
diff --git a/drivers/vfio/mdev/Makefile b/drivers/vfio/mdev/Makefile
index 4adb069febce..8ab38c57df21 100644
--- a/drivers/vfio/mdev/Makefile
+++ b/drivers/vfio/mdev/Makefile
@@ -2,4 +2,5 @@
 mdev-y := mdev-core.o mdev-sysfs.o mdev-driver.o
 
 obj-$(CONFIG_MDEV) += mdev.o
+obj-$(CONFIG_VFIO_MPCI) += vfio_mpci.o
 
diff --git a/drivers/vfio/mdev/vfio_mpci.c b/drivers/vfio/mdev/vfio_mpci.c
new file mode 100644
index ..ef9d757ec511
--- /dev/null
+++ b/drivers/vfio/mdev/vfio_mpci.c
@@ -0,0 +1,648 @@
+/*
+ * VFIO based Mediated PCI device driver
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "mdev_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VFIO based Mediated PCI device driver"
+
+struct vfio_mdevice {
+   struct iommu_group *group;
+   struct mdev_device *mdevice;
+   int refcnt;
+   struct pci_region_info vfio_region_info[VFIO_PCI_NUM_REGIONS];
+   u8  *vconfig;
+   struct mutexvfio_mdev_lock;
+};
+
+static int get_virtual_bar_info(struct mdev_device *mdevice,
+   struct pci_region_info *vfio_region_info,
+   int index)
+{
+   int ret = -EINVAL;
+   struct phy_device *phy_dev = mdevice->phy_dev;
+
+   if (dev_is_pci(phy_dev->dev) && phy_dev->ops->get_region_info) {
+   mutex_lock(>ops_lock);
+   ret = phy_dev->ops->get_region_info(mdevice, index,
+   vfio_region_info);
+   mutex_unlock(>ops_lock);
+   }
+   return ret;
+}
+
+static int mdev_read_base(struct vfio_mdevice *vdev)
+{
+   int index, pos;
+   u32 start_lo, start_hi;
+   u32 mem_type;
+
+   pos = PCI_BASE_ADDRESS_0;
+
+   for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+   if (!vdev->vfio_region_info[index].size)
+   continue;
+
+   start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_MASK;
+   mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+   switch (mem_type) {
+   case PCI_BASE_ADDRESS_MEM_TYPE_64:
+   start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+   pos += 4;
+   break;
+   case PCI_BASE_ADDRESS_MEM_TYPE_32:
+   case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+   /* 1M mem BAR treated as 32-bit BAR */
+   default:
+   /* mem unknown type treated as 32-bit BAR */
+   start_hi = 0;
+   break;
+   }
+   pos += 4;
+   vdev->vfio_region_info[i

[Qemu-devel] [RFC PATCH v4 1/3] Mediated device Core driver

2016-05-24 Thread Kirti Wankhede
Design for Mediated Device Driver:
Main purpose of this driver is to provide a common interface for mediated
device management that can be used by differnt drivers of different
devices.

This module provides a generic interface to create the device, add it to
mediated bus, add device to IOMMU group and then add it to vfio group.

Below is the high Level block diagram, with Nvidia, Intel and IBM devices
as example, since these are the devices which are going to actively use
this module as of now.

 +---+
 |   |
 | +---+ |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |   | | |  |
 | |  mdev | +>+  |<-> VFIO user
 | |  bus  | | probe()/remove()| vfio_mpci.ko |APIs
 | |  driver   | | |  |
 | |   | | +--+
 | |   | |  mdev_register_driver() +--+
 | |   | +<+ __init() |
 | |   | | |  |
 | |   | +>+  |<-> VFIO user
 | +---+ | probe()/remove()| vfio_mccw.ko |APIs
 |   | |  |
 |  MDEV CORE| +--+
 |   MODULE  |
 |   mdev.ko |
 | +---+ |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | |  nvidia.ko   |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | | Physical  | |
 | |  device   | |  mdev_register_device() +--+
 | | interface | |<+  |
 | |   | | |  i915.ko |<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | |   | |
 | |   | |  mdev_register_device() +--+
 | |   | +<+  |
 | |   | | | ccw_device.ko|<-> physical
 | |   | +>+  |device
 | |   | |callback +--+
 | +---+ |
 +---+

Core driver provides two types of registration interfaces:
1. Registration interface for mediated bus driver:

/**
  * struct mdev_driver - Mediated device's driver
  * @name: driver name
  * @probe: called when new device created
  * @remove:called when device removed
  * @match: called when new device or driver is added for this bus.
Return 1 if given device can be handled by given driver and
zero otherwise.
  * @driver:device driver structure
  *
  **/
struct mdev_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 int  (*match)(struct device *dev);
 struct device_driverdriver;
};

int  mdev_register_driver(struct mdev_driver *drv, struct module *owner);
void mdev_unregister_driver(struct mdev_driver *drv);

Mediated device's driver for mdev should use this interface to register
with Core driver. With this, mediated devices driver for such devices is
responsible to add mediated device to VFIO group.

2. Physical device driver interface
This interface provides vendor driver the set APIs to manage physical
device related work in their own driver. APIs are :
- supported_config: provide supported configuration list by the vendor
driver
- create: to allocate basic resources in vendor driver for a mediated
  device.
- destroy: to free resources in vendor driver when mediated device is
   destroyed.
- start: to initiate mediated device initialization process from vendor
 driver when VM boots and before QEMU starts.
- shutdown: to teardown mediated device resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- set_irqs: send interrupt configuration information that QEMU sets.
- get_region_info: to provide region size and its flags for the mediated
   device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by vendor drivers to register
each physical device to mdev core driver.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I88f4482f7608f40550a152c5f882b64271287c62
---
 drivers/vfio/Kconfig |   1 +
 drivers/vfio/Makefile|   1 +
 drivers/vfio/mdev/Kconfig|  11 +
 drivers/vfio/mdev/Makefile   |   5 +
 drivers/vfio/mdev/md

Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver

2016-05-06 Thread Kirti Wankhede


On 5/6/2016 5:44 PM, Jike Song wrote:
> On 05/05/2016 05:06 PM, Tian, Kevin wrote:
>>> From: Kirti Wankhede
>>>
>>>  >> + * @validate_map_request:  Validate remap pfn request
>>>  >> + * @vdev: vgpu device structure
>>>  >> + * @virtaddr: target user address to start 
>>> at
>>>  >> + * @pfn: physical address of kernel 
>>> memory, GPU
>>>  >> + * driver can change if required.
>>>  >> + * @size: size of map area, GPU driver can 
>>> change
>>>  >> + * the size of map area if desired.
>>>  >> + * @prot: page protection flags for this 
>>> mapping,
>>>  >> + * GPU driver can change, if required.
>>>  >> + * Returns integer: success (0) or error 
>>> (< 0)
>>>  >
>>>  > Was not at all clear to me what this did until I got to patch 2, this
>>>  > is actually providing the fault handling for mmap'ing a vGPU mmio BAR.
>>>  > Needs a better name or better description.
>>>  >
>>>
>>> If say VMM mmap whole BAR1 of GPU, say 128MB, so fault would occur when
>>> BAR1 is tried to access then the size is calculated as:
>>> req_size = vma->vm_end - virtaddr
> Hi Kirti,
> 
> virtaddr is the faulted one, vma->vm_end the vaddr of the mmap-ed 128MB BAR1?
> 
> Would you elaborate why (vm_end - fault_addr) results the requested size? 
> 
> 

If first access is at start address of mmaped address, fault_addr is
vma->vm_start. Then (vm_end - vm_start) is the size mmapped region.

req_size should not exceed (vm_end - vm_start).


>>> Since GPU is being shared by multiple vGPUs, GPU driver might not remap
>>> whole BAR1 for only one vGPU device, so would prefer, say map one page
>>> at a time. GPU driver returns PAGE_SIZE. This is used by
>>> remap_pfn_range(). Now on next access to BAR1 other than that page, we
>>> will again get a fault().
>>> As the name says this call is to validate from GPU driver for the size
>>> and prot of map area. GPU driver can change size and prot for this map area.
> 
> If I understand correctly, you are trying to share a physical BAR among
> multiple vGPUs, by mapping a single pfn each time, when fault happens?
> 

Yes.

>>
>> Currently we don't require such interface for Intel vGPU. Need to think about
>> its rationale carefully (still not clear to me). Jike, do you have any 
>> thought on
>> this?
> 
> We need the mmap method of vgpu_device to be implemented, but I was
> expecting something else, like calling remap_pfn_range() directly from
> the mmap.
>

Calling remap_pfn_range directly from mmap means you would like to remap
pfn for whole BAR1 during mmap, right?

In that case, don't set validate_map_request() and access start of mmap
address, so that on first access it will do remap_pfn_range() for
(vm_end - vm_start).

Thanks,
Kirti


>>
>> Thanks
>> Kevin
>>
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver

2016-05-05 Thread Kirti Wankhede



On 5/5/2016 5:37 PM, Tian, Kevin wrote:

From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
Sent: Thursday, May 05, 2016 6:45 PM


On 5/5/2016 2:36 PM, Tian, Kevin wrote:

From: Kirti Wankhede
Sent: Wednesday, May 04, 2016 9:32 PM

Thanks Alex.

 >> +config VGPU_VFIO
 >> +tristate
 >> +depends on VGPU
 >> +default n
 >> +
 >
 > This is a little bit convoluted, it seems like everything added in this
 > patch is vfio agnostic, it doesn't necessarily care what the consumer
 > is.  That makes me think we should only be adding CONFIG_VGPU here and
 > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
 > The middle config entry is also redundant to the first, just move the
 > default line up to the first and remove the rest.

CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
directly dependent on VFIO. But devices created by VGPU core module need
a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
by CONFIG_VGPU.

This would look like:
menuconfig VGPU
 tristate "VGPU driver framework"
 select VGPU_VFIO
 default n
 help
 VGPU provides a framework to virtualize GPU without SR-IOV cap
 See Documentation/vgpu.txt for more details.

 If you don't know what do here, say N.

config VGPU_VFIO
 tristate
 depends on VGPU
 depends on VFIO
 default n



There could be multiple drivers operating VGPU. Why do we restrict
it to VFIO here?



VGPU_VFIO uses VFIO APIs, it depends on VFIO.
I think since there is no other driver than VGPU_VFIO for VGPU devices,
we should keep default selection of VGPU_VFIO on VGPU. May be in future
if other driver is add ti operate vGPU devices, then default selection
can be removed.


What's your plan to support Xen here?



No plans to support Xen.

Thanks,
Kirti




Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver

2016-05-05 Thread Kirti Wankhede



On 5/5/2016 2:36 PM, Tian, Kevin wrote:

From: Kirti Wankhede
Sent: Wednesday, May 04, 2016 9:32 PM

Thanks Alex.

 >> +config VGPU_VFIO
 >> +tristate
 >> +depends on VGPU
 >> +default n
 >> +
 >
 > This is a little bit convoluted, it seems like everything added in this
 > patch is vfio agnostic, it doesn't necessarily care what the consumer
 > is.  That makes me think we should only be adding CONFIG_VGPU here and
 > it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
 > The middle config entry is also redundant to the first, just move the
 > default line up to the first and remove the rest.

CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is
directly dependent on VFIO. But devices created by VGPU core module need
a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which
will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled
by CONFIG_VGPU.

This would look like:
menuconfig VGPU
 tristate "VGPU driver framework"
 select VGPU_VFIO
 default n
 help
 VGPU provides a framework to virtualize GPU without SR-IOV cap
 See Documentation/vgpu.txt for more details.

 If you don't know what do here, say N.

config VGPU_VFIO
 tristate
 depends on VGPU
 depends on VFIO
 default n



There could be multiple drivers operating VGPU. Why do we restrict
it to VFIO here?



VGPU_VFIO uses VFIO APIs, it depends on VFIO.
I think since there is no other driver than VGPU_VFIO for VGPU devices, 
we should keep default selection of VGPU_VFIO on VGPU. May be in future 
if other driver is add ti operate vGPU devices, then default selection 
can be removed.



 >> +create_attr_error:
 >> + if (gpu_dev->ops->vgpu_destroy) {
 >> + int ret = 0;
 >> + ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
 >> +  vgpu_dev->uuid,
 >> +  vgpu_dev->vgpu_instance);
 >
 > Unnecessary initialization and we don't do anything with the result.
 > Below indicates lack of vgpu_destroy indicates the vendor doesn't
 > support unplug, but doesn't that break our error cleanup path here?
 >

Comment about vgpu_destroy:
If VM is running and vgpu_destroy is called that
means the vGPU is being hotunpluged. Return
error if VM is running and graphics driver
doesn't support vgpu hotplug.

Its GPU drivers responsibility to check if VM is running and return
accordingly. This is vgpu creation path. Vgpu device would be hotplug to
VM on vgpu_start.


How does GPU driver know whether VM is running? VM is managed
by KVM here.

Maybe it's clearer to say whether vGPU is busy which means some work
has been loaded to vGPU. That's something GPU driver can tell.



GPU driver can detect based on resources allocated for the VM from 
vgpu_create/vgpu_start.




 >> + * @vgpu_bar_info:   Called to get BAR size and flags of vGPU 
device.
 >> + *   @vdev: vgpu device structure
 >> + *   @bar_index: BAR index
 >> + *   @bar_info: output, returns size and flags of
 >> + *   requested BAR
 >> + *   Returns integer: success (0) or error (< 0)
 >
 > This is called bar_info, but the bar_index is actually the vfio region
 > index and things like the config region info is being overloaded
 > through it.  We already have a structure defined for getting a generic
 > region index, why not use it?  Maybe this should just be
 > vgpu_vfio_get_region_info.
 >

Ok. Will do.


As you commented earlier that GPU driver is required to provide config
space (which I agree), then what's the point of introducing another
bar specific structure? VFIO can use @write to get bar information
from vgpu config space, just like how it's done on physical device today.



It is required not only for size, but also to fetch flags. Region flags 
could be combination of:


#define VFIO_REGION_INFO_FLAG_READ  (1 << 0) /* Region supports read */
#define VFIO_REGION_INFO_FLAG_WRITE (1 << 1) /* Region supports write */
#define VFIO_REGION_INFO_FLAG_MMAP  (1 << 2) /* Region supports mmap */

Thanks,
Kirti.




 >> + * @validate_map_request:Validate remap pfn request
 >> + *   @vdev: vgpu device structure
 >> + *   @virtaddr: target user address to start at
 >> + *   @pfn: physical address of kernel memory, GPU
 >> + *   driver can change if required.
 >> + *   @size: size of map area, GPU driver can change
 >> + *   the size of map area if desired.
 &

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-05 Thread Kirti Wankhede


On 5/4/2016 4:13 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:41 +0530
> Kirti Wankhede <kwankh...@nvidia.com> wrote:
>
[..]


>> +  if (domain->vfio_iommu_api_only)
>> +  mm = domain->vmm_mm;
>> +  else
>> +  mm = current->mm;
>> +
>> +  if (!mm)
>> +  return -ENODEV;
>> +
>> +  ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
>
> We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
> NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
> gup variant to use.  Of course we need to deal with mmap_sem somewhere
> too without turning the code into swiss cheese.
>

Yes, I missed that. Thanks for pointing out. I'll fix it.

> Correct me if I'm wrong, but I assume the main benefit of interweaving
> this into type1 vs pulling out common code and making a new vfio iommu
> backend is the page accounting, ie. not over accounting locked pages.
> TBH, I don't know if it's worth it.  Any idea what the high water mark
> of pinned pages for a vgpu might be?
>

It depends in which VM (Linux/Windows) is running and what workload is 
running. On Windows DMA pages are managed by WDDM model. On Linux each 
user space application can DMA to pages and there are not restrictions.



> The only reason I can come up with for why we'd want to integrate an
> api-only domain into the existing type1 code would be to avoid page
> accounting issues where we count locked pages once for a normal
> assigned device and again for a vgpu, but that's not what we're doing
> here.  We're not only locking the pages again regardless of them
> already being locked, we're counting every time we lock them through
> this new interface.  So there's really no point at all to making type1
> become this unsupportable.  In that case we should be pulling out the
> common code that we want to share from type1 and making a new type1
> compatible vfio iommu backend rather than conditionalizing everything
> here.
>

I tried to add pfn tracking logic and use already locked pages, but that 
didn't worked somehow, I'll revisit it again.
With this there will be additional pfn tracking logic for the case where 
device is directly assigned and vGPU device is not present.




>> +  // verify if pfn exist in pfn_list
>> +  if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i {
>> +  continue;
>
> How does the caller deal with this, the function returns number of
> pages unpinned which will not match the requested number of pages to
> unpin if there are any missing.  Also, no setting variables within a
> test when easily avoidable please, separate to a set then test.
>

Here we are following the current code logic. Do you have any suggestion 
how to deal with that?



>> +  (iommu, , _vgpu);
>> +
>> +  if (!domain)
>> +  return;
>> +
>> +  d = domain;
>>list_for_each_entry_continue(d, >domain_list, next) {
>> -  iommu_unmap(d->domain, dma->iova, dma->size);
>> -  cond_resched();
>> +  if (!d->vfio_iommu_api_only) {
>> +  iommu_unmap(d->domain, dma->iova, dma->size);
>> +  cond_resched();
>> +  }get_first_domains
>>}
>>
>>while (iova < end) {
>
> How do api-only domain not blowup on the iommu API code in this next
> code block?  Are you just getting lucky that the api-only domain is
> first in the list and the real domain is last?
>

Control will not come here if there is no domain with IOMMU due to below 
change:


>> +  if (!domain)
>> +  return;

get_first_domains() returns the first domain with IOMMU and first domain 
with api_only.



>> +  if (d->vfio_iommu_api_only)
>> +  continue;
>> +
>
> Really disliking all these switches everywhere, too many different code
> paths.
>

I'll move such APIs to inline functions such that this check would be 
within inline functions and code would look much cleaner.



>> +  // Skip pin and map only if domain without IOMMU is present
>> +  if (!domain_with_iommu_present) {
>> +  dma->size = size;
>> +  goto map_done;
>> +  }
>> +
>
> Yet more special cases, the code is getting unsupportable.

From vfio_dma_do_map(), if there is no devices pass-throughed then we 
don't want to pin all pages upfront. and that is the reason of this check.


>
> I'm really not convinced that pushing this into the type1 code is the
> right approach vs pulling out shareable code chunks where it makes
> sense and creating a separate iommu backend.  We're not getting
> anything but code complexity out of this approach it seems.

I find pulling out shared code is also not simple. I would like to 
revisit this again and sort out the concerns you raised rather than 
making separate module.


Thanks,
Kirti.





Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device

2016-05-04 Thread Kirti Wankhede



On 5/5/2016 2:44 AM, Neo Jia wrote:

On Wed, May 04, 2016 at 11:06:19AM -0600, Alex Williamson wrote:

On Wed, 4 May 2016 03:23:13 +
"Tian, Kevin"  wrote:


From: Alex Williamson [mailto:alex.william...@redhat.com]
Sent: Wednesday, May 04, 2016 6:43 AM

+
+   if (gpu_dev->ops->write) {
+   ret = gpu_dev->ops->write(vgpu_dev,
+ user_data,
+ count,
+ vgpu_emul_space_config,
+ pos);
+   }
+
+   memcpy((void *)(vdev->vconfig + pos), (void *)user_data, count);


So write is expected to user_data to allow only the writable bits to be
changed?  What's really being saved in the vconfig here vs the vendor
vgpu driver?  It seems like we're only using it to cache the BAR
values, but we're not providing the BAR emulation here, which seems
like one of the few things we could provide so it's not duplicated in
every vendor driver.  But then we only need a few u32s to do that, not
all of config space.


We can borrow same vconfig emulation from existing vfio-pci driver.
But doing so doesn't mean that vendor vgpu driver cannot have its
own vconfig emulation further. vGPU is not like a real device, since
there may be no physical config space implemented for each vGPU.
So anyway vendor vGPU driver needs to create/emulate the virtualized
config space while the way how is created might be vendor specific.
So better to keep the interface to access raw vconfig space from
vendor vGPU driver.


I'm hoping config space will be very simple for a vgpu, so I don't know
that it makes sense to add that complexity early on.  Neo/Kirti, what
capabilities do you expect to provide?  Who provides the MSI
capability?  Is a PCIe capability provided?  Others?




From VGPU_VFIO point of view, VGPU_VFIO would not provide or modify any 
capabilities. Vendor vGPU driver should provide config space. Then 
vendor driver can provide PCI capabilities or PCIe capabilities, it 
might also have vendor specific information. VGPU_VFIO driver would not 
intercept that information.



Currently only standard PCI caps.

MSI cap is emulated by the vendor drivers via the above interface.

No PCIe caps so far.



Nvidia vGPU device is standard PCI device. We tested standard PCI caps.

Thanks,
Kirti.




+static ssize_t vgpu_dev_rw(void *device_data, char __user *buf,
+   size_t count, loff_t *ppos, bool iswrite)
+{
+   unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+   struct vfio_vgpu_device *vdev = device_data;
+
+   if (index >= VFIO_PCI_NUM_REGIONS)
+   return -EINVAL;
+
+   switch (index) {
+   case VFIO_PCI_CONFIG_REGION_INDEX:
+   return vgpu_dev_config_rw(vdev, buf, count, ppos, iswrite);
+
+   case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+   return vgpu_dev_bar_rw(vdev, buf, count, ppos, iswrite);
+
+   case VFIO_PCI_ROM_REGION_INDEX:
+   case VFIO_PCI_VGA_REGION_INDEX:


Wait a sec, who's doing the VGA emulation?  We can't be claiming to
support a VGA region and then fail to provide read/write access to it
like we said it has.


For Intel side we plan to not support VGA region when upstreaming our
KVMGT work, which means Intel vGPU will be exposed only as a
secondary graphics card then so legacy VGA is not required. Also no
VBIOS/ROM requirement. Guess we can remove above two regions.


So this needs to be optional based on what the mediation driver
provides.  It seems like we're just making passthroughs for the vendor
mediation driver to speak vfio.


+
+static int vgpu_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf)
+{
+   int ret = 0;
+   struct vfio_vgpu_device *vdev = vma->vm_private_data;
+   struct vgpu_device *vgpu_dev;
+   struct gpu_device *gpu_dev;
+   u64 virtaddr = (u64)vmf->virtual_address;
+   u64 offset, phyaddr;
+   unsigned long req_size, pgoff;
+   pgprot_t pg_prot;
+
+   if (!vdev && !vdev->vgpu_dev)
+   return -EINVAL;
+
+   vgpu_dev = vdev->vgpu_dev;
+   gpu_dev  = vgpu_dev->gpu_dev;
+
+   offset   = vma->vm_pgoff << PAGE_SHIFT;
+   phyaddr  = virtaddr - vma->vm_start + offset;
+   pgoff= phyaddr >> PAGE_SHIFT;
+   req_size = vma->vm_end - virtaddr;
+   pg_prot  = vma->vm_page_prot;
+
+   if (gpu_dev->ops->validate_map_request) {
+   ret = gpu_dev->ops->validate_map_request(vgpu_dev, virtaddr, 
,
+_size, _prot);
+   if (ret)
+   return ret;
+
+   if (!req_size)
+   return -EINVAL;
+   }
+
+   ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);


So not supporting validate_map_request() means that the user 

Re: [Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device

2016-05-04 Thread Kirti Wankhede


On 5/4/2016 4:13 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:40 +0530

>>  obj-$(CONFIG_VGPU)+= vgpu.o
>> +obj-$(CONFIG_VGPU_VFIO) += vgpu_vfio.o
>
> This is where we should add a new Kconfig entry for VGPU_VFIO, nothing
> in patch 1 has any vfio dependency.  Perhaps it should also depend on
> VFIO_PCI rather than VFIO since you are getting very PCI specific below.

VGPU_VFIO depends on VFIO but is independent of VFIO_PCI. VGPU_VFIO uses 
VFIO apis defined for PCI devices and uses common #defines but that 
doesn't mean it depends on VFIO_PCI.

I'll move Kconfig entry for VGPU_VFIO here in next version of patch.

>> +#define VFIO_PCI_OFFSET_SHIFT   40
>> +
>> +#define VFIO_PCI_OFFSET_TO_INDEX(off) (off >> VFIO_PCI_OFFSET_SHIFT)
>> +#define VFIO_PCI_INDEX_TO_OFFSET(index)	((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)

>> +#define VFIO_PCI_OFFSET_MASK  (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>
> Change the name of these from vfio-pci please or shift code around to
> use them directly.  You're certainly free to redefine these, but using
> the same name is confusing.
>

I'll move these defines to common location.


>> +  if (gpu_dev->ops->vgpu_bar_info)
>> +  ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
>
> vgpu_bar_info is already optional, further validating that the vgpu
> core is not PCI specific.

It is not optional if vgpu_vfio module should work on the device. If 
vgpu_bar_info is not provided by vendor driver, open() would fail. 
vgpu_vfio expect PCI device. Also need to PCI device validation.



>
> Let's not neglect ioport BARs here, IO_MASK is different.
>

vgpu_device is virtual device, it is not going to drive VGA signals. 
Nvidia vGPU would not support IO BAR.



>> +  vdev->refcnt--;
>> +  if (!vdev->refcnt) {
>> +  memset(>bar_info, 0, sizeof(vdev->bar_info));
>
> Why?

vfio_vgpu_device is allocated when vgpu device is created by vgpu core, 
then QEMU/VMM call open() on that device, where vdev->bar_info is 
populated and allocates vconfig.
In teardown path, QEMU/VMM call close() on the device and 
vfio_vgpu_device is destroyed when vgpu device is destroyed by vgpu core.


If QEMU/VMM restarts and in that case vgpu device is not destroyed, 
vdev->bar_info should be cleared to fetch it again from vendor driver. 
It should not keep any stale addresses.


>> +  if (irq_type == VFIO_PCI_MSIX_IRQ_INDEX)
>> +  return -1;
>
> How are we going to expand the API later for it?  Shouldn't this just
> be a passthrough to a gpu_devices_ops.vgpu_vfio_get_irq_info callback?

Vendor driver convey interrupt type by defining capabilities in config 
space. I don't think we should add new callback for it.



>> +  memcpy((void *)(vdev->vconfig + pos), (void *)user_data, 
count);
>
> So write is expected to user_data to allow only the writable bits to be
> changed?  What's really being saved in the vconfig here vs the vendor
> vgpu driver?  It seems like we're only using it to cache the BAR
> values, but we're not providing the BAR emulation here, which seems
> like one of the few things we could provide so it's not duplicated in
> every vendor driver.  But then we only need a few u32s to do that, not
> all of config space.
>

Vendor driver should emulate config space. It is not just BAR addresses, 
vendor driver should add the capabilities supported by its vGPU device.



>> +
>> +  if (gpu_dev->ops->write) {
>> +  ret = gpu_dev->ops->write(vgpu_dev,
>> +user_data,
>> +count,
>> +vgpu_emul_space_mmio,
>> +pos);
>> +  }
>
> What's the usefulness in a vendor driver that doesn't provide
> read/write?

The checks are to avoid NULL pointer deference if this callbacks are not 
provided. Whether it will work or not that completely depends on vendor 
driver stack in host and guest.


>> +  case VFIO_PCI_ROM_REGION_INDEX:
>> +  case VFIO_PCI_VGA_REGION_INDEX:
>
> Wait a sec, who's doing the VGA emulation?  We can't be claiming to
> support a VGA region and then fail to provide read/write access to it
> like we said it has.
>

Nvidia vGPU doesn't support IO BAR and ROM BAR. But I can move these 
cases to

case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:

So that if vendor driver support IO BAR or ROM BAR emulation, it would 
be same as other BARs.



>> +  ret = remap_pfn_range(vma, virtaddr, pgoff, req_size, pg_prot);
>
> So not supporting validate_map_request() means that the user can
> directly mmap BARs of the host GPU and as shown below, we assume a 1:1
> mapping of vGPU BAR to host GPU BAR.  Is that ever valid in a vGPU
> scenario or should this callback be required?

Yes, if restrictions are imposed such that onle one vGPU device can be 

Re: [Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver

2016-05-04 Thread Kirti Wankhede

Thanks Alex.

>> +config VGPU_VFIO
>> +tristate
>> +depends on VGPU
>> +default n
>> +
>
> This is a little bit convoluted, it seems like everything added in this
> patch is vfio agnostic, it doesn't necessarily care what the consumer
> is.  That makes me think we should only be adding CONFIG_VGPU here and
> it should not depend on CONFIG_VFIO or be enabling CONFIG_VGPU_VFIO.
> The middle config entry is also redundant to the first, just move the
> default line up to the first and remove the rest.

CONFIG_VGPU doesn't directly depend on VFIO. CONFIG_VGPU_VFIO is 
directly dependent on VFIO. But devices created by VGPU core module need 
a driver to manage those devices. CONFIG_VGPU_VFIO is the driver which 
will manage vgpu devices. So I think CONFIG_VGPU_VFIO should be enabled 
by CONFIG_VGPU.


This would look like:
menuconfig VGPU
tristate "VGPU driver framework"
select VGPU_VFIO
default n
help
VGPU provides a framework to virtualize GPU without SR-IOV cap
See Documentation/vgpu.txt for more details.

If you don't know what do here, say N.

config VGPU_VFIO
tristate
depends on VGPU
depends on VFIO
default n



>> +int vgpu_register_device(struct pci_dev *dev, const struct 
gpu_device_ops *ops)

>
> Why do we care that it's a pci_dev?  It seems like there's only a very
> small portion of the API that cares about pci_devs in order to describe
> BARs, which could be switched based on the device type.  Otherwise we
> could operate on a struct device here.
>

GPUs are PCI devices, hence used pci_dev. I agree with you, I'll change 
it to operate on struct device and add checks in vgpu_vfio.c where 
config space and BARs are populated.


>> +static void vgpu_device_free(struct vgpu_device *vgpu_dev)
>> +{
>> +  if (vgpu_dev) {
>> +  mutex_lock(_devices_lock);
>> +  list_del(_dev->list);
>> +  mutex_unlock(_devices_lock);
>> +  kfree(vgpu_dev);
>> +  }
>
> Why aren't we using the kref to remove and free the vgpu when the last
> reference is released?

vgpu_device_free() is called from 2 places,
1. from create_vgpu_device(), when device_register() is failed.
2. vgpu_device_release(), which is set as a release function to device 
registered by device_register():

vgpu_dev->dev.release = vgpu_device_release;
This release function is called from device_unregister() kernel function 
where it uses kref to call release function. I don't think I need to do 
it again.



>> +struct vgpu_device *vgpu_drv_get_vgpu_device(uuid_le uuid, int 
instance)

>> +{
>> +  struct vgpu_device *vdev = NULL;
>> +
>> +  mutex_lock(_devices_lock);
>> +  list_for_each_entry(vdev, _devices_list, list) {
>> +  if ((uuid_le_cmp(vdev->uuid, uuid) == 0) &&
>> +  (vdev->vgpu_instance == instance)) {
>> +  mutex_unlock(_devices_lock);
>> +  return vdev;
>
> We're not taking any sort of reference to the vgpu, what prevents races
> with it being removed?  A common exit path would be easy to achieve
> here too.
>

I'll add reference count.


>> +create_attr_error:
>> +  if (gpu_dev->ops->vgpu_destroy) {
>> +  int ret = 0;
>> +  ret = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>> +   vgpu_dev->uuid,
>> +   vgpu_dev->vgpu_instance);
>
> Unnecessary initialization and we don't do anything with the result.
> Below indicates lack of vgpu_destroy indicates the vendor doesn't
> support unplug, but doesn't that break our error cleanup path here?
>

Comment about vgpu_destroy:
If VM is running and vgpu_destroy is called that 
means the vGPU is being hotunpluged. Return

error if VM is running and graphics driver
doesn't support vgpu hotplug.

Its GPU drivers responsibility to check if VM is running and return 
accordingly. This is vgpu creation path. Vgpu device would be hotplug to 
VM on vgpu_start.


>> +  retval = gpu_dev->ops->vgpu_destroy(gpu_dev->dev,
>> +  vgpu_dev->uuid,
>> +  vgpu_dev->vgpu_instance);
>> +	/* if vendor driver doesn't return success that means vendor 
driver doesn't

>> +   * support hot-unplug */
>> +  if (retval)
>> +  return;
>
> Should we return an error code then?  Inconsistent comment style.
>

destroy_vgpu_device() is called from
- release_vgpubus_dev(), which is a release function of vgpu_class and 
has return type void.

- vgpu_unregister_device(), which again return void

Even if error code is returned from here, it is not going to be used.

I'll change the comment style in next patch update.



>> + * @write:Write emulation callback
>> + *@vdev: vgpu device structure
>> + *@buf: write buffer

[Qemu-devel] [RFC PATCH v3 0/3] Add vGPU support

2016-05-02 Thread Kirti Wankhede
This series adds vGPU support to v4.6 Linux host kernel. Purpose of this series
is to provide a common interface for vGPU management that can be used
by different GPU drivers. This series introduces vGPU core module that create
and manage vGPU devices, VFIO based driver for vGPU devices that are created by
vGPU core module and update VFIO type1 IOMMU module to support vGPU devices.

What's new in v3?
VFIO type1 IOMMU module supports devices which are IOMMU capable. This version
of patched adds support for vGPU devices, which are not IOMMU capable, to use
existing VFIO IOMMU module. VFIO Type1 IOMMU patch provide new set of APIs for
guest page translation.

What's left to do?
VFIO driver for vGPU device doesn't support devices with MSI-X enabled.

Please review.

Thanks,
Kirti

Kirti Wankhede (3):
  vGPU Core driver
  VFIO driver for vGPU device
  VFIO Type1 IOMMU change: to support with iommu and without iommu

 drivers/Kconfig |2 +
 drivers/Makefile|1 +
 drivers/vfio/vfio_iommu_type1.c |  427 +++--
 drivers/vgpu/Kconfig|   21 ++
 drivers/vgpu/Makefile   |5 +
 drivers/vgpu/vgpu-core.c|  424 
 drivers/vgpu/vgpu-driver.c  |  136 
 drivers/vgpu/vgpu-sysfs.c   |  365 +
 drivers/vgpu/vgpu_private.h |   36 ++
 drivers/vgpu/vgpu_vfio.c|  671 +++
 include/linux/vfio.h|6 +
 include/linux/vgpu.h|  216 +
 12 files changed, 2278 insertions(+), 32 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 drivers/vgpu/vgpu_vfio.c
 create mode 100644 include/linux/vgpu.h




[Qemu-devel] [RFC PATCH v3 2/3] VFIO driver for vGPU device

2016-05-02 Thread Kirti Wankhede
VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I949a6b499d2e98d9c3352ae579535a608729b223
---
 drivers/vgpu/Makefile|1 +
 drivers/vgpu/vgpu_vfio.c |  671 ++
 2 files changed, 672 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU) += vgpu.o
+obj-$(CONFIG_VGPU_VFIO) += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000..460a4dc
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,671 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)  (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK   (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+   struct iommu_group *group;
+   struct vgpu_device *vgpu_dev;
+   int refcnt;
+   struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+   u8  *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+   struct pci_bar_info *bar_info,
+   int index)
+{
+   int ret = -1;
+   struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+   if (gpu_dev->ops->vgpu_bar_info)
+   ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+   return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+   int index, pos;
+   u32 start_lo, start_hi;
+   u32 mem_type;
+
+   pos = PCI_BASE_ADDRESS_0;
+
+   for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+   if (!vdev->bar_info[index].size)
+   continue;
+
+   start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_MASK;
+   mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+   switch (mem_type) {
+   case PCI_BASE_ADDRESS_MEM_TYPE_64:
+   start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+   pos += 4;
+   break;
+   case PCI_BASE_ADDRESS_MEM_TYPE_32:
+   case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+   /* 1M mem BAR treated as 32-bit BAR */
+   default:
+   /* mem unknown type treated as 32-bit BAR */
+   start_hi = 0;
+   break;
+   }
+   pos += 4;
+   vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+   }
+   return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+   int ret = 0;
+   struct vfio_vgpu_device *vdev = device_data;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   mutex_lock(_vgpu_lock);
+
+   if (!vdev->refcnt) {
+   u8 *vconfig;
+   int vconfig_size, index;
+
+   for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+   ret = get_virtual_bar_info(vdev->vgpu_dev,
+

[Qemu-devel] [RFC PATCH v3 1/3] vGPU Core driver

2016-05-02 Thread Kirti Wankhede
Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--+vgpu_register_driver()+---+
| __init() +->+   |
|  |  |   |
|  +<-+vgpu.ko|
| vgpu_vfio.ko |   probe()/remove()   |   |
|  |+-+   +-+
+--+| +---+---+ |
| ^ |
| callback| |
| +---++|
| |vgpu_register_device()   |
| |||
+---^-+-++-+--+-+
| nvidia.ko ||  i915.ko   |
|   |||
+---+++

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 struct device_driverdriver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I1c13c411f61b7b2e750e85adfe1b097f9fd218b9
---
 drivers/Kconfig |2 +
 drivers/Makefile|1 +
 drivers/vgpu/Kconfig|   21 ++
 drivers/vgpu/Makefile   |4 +
 drivers/vgpu/vgpu-core.c|  424 +++
 drivers/vgpu/vgpu-driver.c  |  136 ++
 drivers/vgpu/vgpu-sysfs.c   |  365 +
 drivers/vgpu/vgpu_private.h |   36 
 include/linux/vgpu.h|  216 ++
 9 files changed, 1205 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 8f5d076..36f1110 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)  += message/
 obj-y  += firewire/
 obj-$(CONFIG_UIO)  += uio/
 obj-$(CONFIG_VFIO) += vfio/
+obj-$(CONFIG_VFIO) += vgpu/
 obj-y  += cdrom/

[Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-02 Thread Kirti Wankhede
VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
devices, type1 IOMMU driver is modified to support vGPU devices. This change
exports functions to pin and unpin pages for vGPU devices.
It maintains data of pinned pages for vGPU domain. This data is used to verify
unpinning request and also used to unpin pages from detach_group().

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
---
 drivers/vfio/vfio_iommu_type1.c |  427 ---
 include/linux/vfio.h|6 +
 include/linux/vgpu.h|4 +-
 3 files changed, 403 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..a970854 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.william...@redhat.com>"
@@ -67,6 +68,11 @@ struct vfio_domain {
struct list_headgroup_list;
int prot;   /* IOMMU_CACHE */
boolfgsp;   /* Fine-grained super pages */
+   boolvfio_iommu_api_only;/* Domain for device 
which
+  is without physical 
IOMMU */
+   struct mm_struct*vmm_mm;/* VMM's mm */
+   struct rb_root  pfn_list;   /* Host pfn list for requested 
gfns */
+   struct mutexlock;   /* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -83,6 +89,19 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_vgpu_pfn {
+   struct rb_node  node;
+   unsigned long   vmm_va; /* VMM virtual addr */
+   dma_addr_t  iova;   /* IOVA */
+   unsigned long   npage;  /* number of pages */
+   unsigned long   pfn;/* Host pfn */
+   int prot;
+   atomic_tref_count;
+   struct list_headnext;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
struct vfio_dma *old)
rb_erase(>node, >dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
+   unsigned long pfn)
+{
+   struct rb_node *node = domain->pfn_list.rb_node;
+
+   while (node) {
+   struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct 
vfio_vgpu_pfn, node);
+
+   if (pfn <= vgpu_pfn->pfn)
+   node = node->rb_left;
+   else if (pfn >= vgpu_pfn->pfn)
+   node = node->rb_right;
+   else
+   return vgpu_pfn;
+   }
+
+   return NULL;
+}
+
+static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct 
vfio_vgpu_pfn *new)
+{
+   struct rb_node **link = >pfn_list.rb_node, *parent = NULL;
+   struct vfio_vgpu_pfn *vgpu_pfn;
+
+   while (*link) {
+   parent = *link;
+   vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
+
+   if (new->pfn <= vgpu_pfn->pfn)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(>node, parent, link);
+   rb_insert_color(>node, >pfn_list);
+}
+
+static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct 
vfio_vgpu_pfn *old)
+{
+   rb_erase(>node, >pfn_list);
+}
+
 struct vwork {
struct mm_struct*mm;
longnpage;
@@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+int prot, unsigned long *pfn)
 {
struct page *page[1];
struct vm_area_struct *vma;
int ret = -EFAULT;
 
-   if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+   if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot &

[Qemu-devel] [RFC PATCH v2 2/3] VFIO driver for vGPU device

2016-02-23 Thread Kirti Wankhede
VFIO driver registers with vGPU core driver. vGPU core driver creates vGPU
device and calls probe routine of vGPU VFIO driver. This vGPU VFIO driver adds
vGPU device to VFIO core module.
Main aim of this module is to manage all VFIO APIs for each vGPU device.
Those are:
- get region information from GPU driver.
- trap and emulate PCI config space and BAR region.
- Send interrupt configuration information to GPU driver.
- mmap mappable region with invalidate mapping and fault on access to remap pfn.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
---
 drivers/vgpu/Makefile|1 +
 drivers/vgpu/vgpu_vfio.c |  664 ++
 2 files changed, 665 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vgpu_vfio.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index f5be980..a0a2655 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -2,3 +2,4 @@
 vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU) += vgpu.o
+obj-$(CONFIG_VGPU_VFIO) += vgpu_vfio.o
diff --git a/drivers/vgpu/vgpu_vfio.c b/drivers/vgpu/vgpu_vfio.c
new file mode 100644
index 000..dc19630
--- /dev/null
+++ b/drivers/vgpu/vgpu_vfio.c
@@ -0,0 +1,664 @@
+/*
+ * VGPU VFIO device
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "NVIDIA Corporation"
+#define DRIVER_DESC "VGPU VFIO Driver"
+
+#define VFIO_PCI_OFFSET_SHIFT   40
+
+#define VFIO_PCI_OFFSET_TO_INDEX(off)  (off >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_INDEX_TO_OFFSET(index)((u64)(index) << 
VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK   (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+struct vfio_vgpu_device {
+   struct iommu_group *group;
+   struct vgpu_device *vgpu_dev;
+   int refcnt;
+   struct pci_bar_info bar_info[VFIO_PCI_NUM_REGIONS];
+   u8  *vconfig;
+};
+
+static DEFINE_MUTEX(vfio_vgpu_lock);
+
+static int get_virtual_bar_info(struct vgpu_device *vgpu_dev,
+   struct pci_bar_info *bar_info,
+   int index)
+{
+   int ret = -1;
+   struct gpu_device *gpu_dev = vgpu_dev->gpu_dev;
+
+   if (gpu_dev->ops->vgpu_bar_info)
+   ret = gpu_dev->ops->vgpu_bar_info(vgpu_dev, index, bar_info);
+   return ret;
+}
+
+static int vdev_read_base(struct vfio_vgpu_device *vdev)
+{
+   int index, pos;
+   u32 start_lo, start_hi;
+   u32 mem_type;
+
+   pos = PCI_BASE_ADDRESS_0;
+
+   for (index = 0; index <= VFIO_PCI_BAR5_REGION_INDEX; index++) {
+
+   if (!vdev->bar_info[index].size)
+   continue;
+
+   start_lo = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_MASK;
+   mem_type = (*(u32 *)(vdev->vconfig + pos)) &
+   PCI_BASE_ADDRESS_MEM_TYPE_MASK;
+
+   switch (mem_type) {
+   case PCI_BASE_ADDRESS_MEM_TYPE_64:
+   start_hi = (*(u32 *)(vdev->vconfig + pos + 4));
+   pos += 4;
+   break;
+   case PCI_BASE_ADDRESS_MEM_TYPE_32:
+   case PCI_BASE_ADDRESS_MEM_TYPE_1M:
+   /* 1M mem BAR treated as 32-bit BAR */
+   default:
+   /* mem unknown type treated as 32-bit BAR */
+   start_hi = 0;
+   break;
+   }
+   pos += 4;
+   vdev->bar_info[index].start = ((u64)start_hi << 32) | start_lo;
+   }
+   return 0;
+}
+
+static int vgpu_dev_open(void *device_data)
+{
+   int ret = 0;
+   struct vfio_vgpu_device *vdev = device_data;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   mutex_lock(_vgpu_lock);
+
+   if (!vdev->refcnt) {
+   u8 *vconfig;
+   int vconfig_size, index;
+
+   for (index = 0; index < VFIO_PCI_NUM_REGIONS; index++) {
+   ret = get_virtual_bar_info(vdev->vgpu_dev,
+  >bar_info[index]

[Qemu-devel] [RFC PATCH v2 3/3] VFIO: Type1 IOMMU mapping support for vGPU

2016-02-23 Thread Kirti Wankhede
Aim of this module is to pin and unpin guest memory.
This module provides interface to GPU driver that can be used to map guest
physical memory into its kernel space driver.
Currently this module has duplicate code from vfio_iommu_type1.c
Working on refining functions to reuse existing code in vfio_iommu_type1.c and
with that will add API to unpin pages.
This is for the reference to review the overall design of vGPU.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
---
 drivers/vgpu/Makefile|1 +
 drivers/vgpu/vfio_iommu_type1_vgpu.c |  423 ++
 2 files changed, 424 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/vfio_iommu_type1_vgpu.c

diff --git a/drivers/vgpu/Makefile b/drivers/vgpu/Makefile
index a0a2655..8ace18d 100644
--- a/drivers/vgpu/Makefile
+++ b/drivers/vgpu/Makefile
@@ -3,3 +3,4 @@ vgpu-y := vgpu-core.o vgpu-sysfs.o vgpu-driver.o
 
 obj-$(CONFIG_VGPU) += vgpu.o
 obj-$(CONFIG_VGPU_VFIO) += vgpu_vfio.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1_VGPU) += vfio_iommu_type1_vgpu.o
diff --git a/drivers/vgpu/vfio_iommu_type1_vgpu.c 
b/drivers/vgpu/vfio_iommu_type1_vgpu.c
new file mode 100644
index 000..0b36ae5
--- /dev/null
+++ b/drivers/vgpu/vfio_iommu_type1_vgpu.c
@@ -0,0 +1,423 @@
+/*
+ * VGPU : IOMMU DMA mapping support for VGPU
+ *
+ * Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.
+ * Author: Neo Jia <c...@nvidia.com>
+ *Kirti Wankhede <kwankh...@nvidia.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vgpu_private.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR  "NVIDIA Corporation"
+#define DRIVER_DESC "VGPU Type1 IOMMU driver for VFIO"
+
+// VFIO structures
+
+struct vfio_iommu_vgpu {
+   struct mutex lock;
+   struct iommu_group *group;
+   struct vgpu_device *vgpu_dev;
+   struct rb_root dma_list;
+   struct mm_struct * vm_mm;
+};
+
+struct vgpu_vfio_dma {
+   struct rb_node node;
+   dma_addr_t iova;
+   unsigned long vaddr;
+   size_t size;
+   int prot;
+};
+
+/*
+ * VGPU VFIO FOPs definition
+ *
+ */
+
+/*
+ * Duplicated from vfio_link_dma, just quick hack ... should
+ * reuse code later
+ */
+
+static void vgpu_link_dma(struct vfio_iommu_vgpu *iommu,
+ struct vgpu_vfio_dma *new)
+{
+   struct rb_node **link = >dma_list.rb_node, *parent = NULL;
+   struct vgpu_vfio_dma *dma;
+
+   while (*link) {
+   parent = *link;
+   dma = rb_entry(parent, struct vgpu_vfio_dma, node);
+
+   if (new->iova + new->size <= dma->iova)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(>node, parent, link);
+   rb_insert_color(>node, >dma_list);
+}
+
+static struct vgpu_vfio_dma *vgpu_find_dma(struct vfio_iommu_vgpu *iommu,
+  dma_addr_t start, size_t size)
+{
+   struct rb_node *node = iommu->dma_list.rb_node;
+
+   while (node) {
+   struct vgpu_vfio_dma *dma = rb_entry(node, struct 
vgpu_vfio_dma, node);
+
+   if (start + size <= dma->iova)
+   node = node->rb_left;
+   else if (start >= dma->iova + dma->size)
+   node = node->rb_right;
+   else
+   return dma;
+   }
+
+   return NULL;
+}
+
+static void vgpu_unlink_dma(struct vfio_iommu_vgpu *iommu, struct 
vgpu_vfio_dma *old)
+{
+   rb_erase(>node, >dma_list);
+}
+
+static void vgpu_dump_dma(struct vfio_iommu_vgpu *iommu)
+{
+   struct vgpu_vfio_dma *c, *n;
+   uint32_t i = 0;
+
+   rbtree_postorder_for_each_entry_safe(c, n, >dma_list, node)
+   printk(KERN_INFO "%s: dma[%d] iova:0x%llx, vaddr:0x%lx, 
size:0x%lx\n",
+  __FUNCTION__, i++, c->iova, c->vaddr, c->size);
+}
+
+static int vgpu_dma_do_track(struct vfio_iommu_vgpu * vgpu_iommu,
+   struct vfio_iommu_type1_dma_map *map)
+{
+   dma_addr_t iova = map->iova;
+   unsigned long vaddr = map->vaddr;
+   int ret = 0, prot = 0;
+   struct vgpu_vfio_dma *vgpu_dma;
+
+   mutex_lock(_iommu->lock);
+
+   if (vgpu_find_dma(vgpu_iommu, map->iova, map->size)) {
+   mutex_unlock(_iommu->lock);
+   return -EEXIST;
+   }
+
+   vgpu_dma = kzalloc(size

[Qemu-devel] [RFC PATCH v2 1/3] vGPU Core driver

2016-02-23 Thread Kirti Wankhede
Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:

+--+vgpu_register_driver()+---+
| __init() +->+   |
|  |  |   |
|  +<-+vgpu.ko|
| vgpu_vfio.ko |   probe()/remove()   |   |
|  |+-+   +-+
+--+| +---+---+ |
| ^ |
| callback| |
| +---++|
| |vgpu_register_device()   |
| |||
+---^-+-++-+--+-+
| nvidia.ko ||  i915.ko   |
|   |||
+---+++

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 struct device_driverdriver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface
GPU driver interface provides GPU driver the set APIs to manage GPU driver
related work in their own driver. APIs are to:
- vgpu_supported_config: provide supported configuration list by the GPU.
- vgpu_create: to allocate basic resouces in GPU driver for a vGPU device.
- vgpu_destroy: to free resources in GPU driver during vGPU device destroy.
- vgpu_start: to initiate vGPU initialization process from GPU driver when VM
  boots and before QEMU starts.
- vgpu_shutdown: to teardown vGPU resources during VM teardown.
- read : read emulation callback.
- write: write emulation callback.
- vgpu_set_irqs: send interrupt configuration information that QEMU sets.
- vgpu_bar_info: to provice BAR size and its flags for the vGPU device.
- validate_map_request: to validate remap pfn request.

This registration interface should be used by GPU drivers to register
each physical device to vGPU driver.

Updated this patch with couple of more functions in GPU driver interface
which were discussed during v1 version of this RFC.

Thanks,
Kirti.

Signed-off-by: Kirti Wankhede <kwankh...@nvidia.com>
Signed-off-by: Neo Jia <c...@nvidia.com>
---
 drivers/Kconfig |2 +
 drivers/Makefile|1 +
 drivers/vgpu/Kconfig|   26 +++
 drivers/vgpu/Makefile   |4 +
 drivers/vgpu/vgpu-core.c|  422 +++
 drivers/vgpu/vgpu-driver.c  |  137 ++
 drivers/vgpu/vgpu-sysfs.c   |  366 +
 drivers/vgpu/vgpu_private.h |   36 
 include/linux/vgpu.h|  217 ++
 9 files changed, 1211 insertions(+), 0 deletions(-)
 create mode 100644 drivers/vgpu/Kconfig
 create mode 100644 drivers/vgpu/Makefile
 create mode 100644 drivers/vgpu/vgpu-core.c
 create mode 100644 drivers/vgpu/vgpu-driver.c
 create mode 100644 drivers/vgpu/vgpu-sysfs.c
 create mode 100644 drivers/vgpu/vgpu_private.h
 create mode 100644 include/linux/vgpu.h

diff --git a/drivers/Kconfig b/drivers/Kconfig
index d2ac339..5fd9eae 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -122,6 +122,8 @@ source "drivers/uio/Kconfig"
 
 source "drivers/vfio/Kconfig"
 
+source "drivers/vgpu/Kconfig"
+
 source "drivers/vlynq/Kconfig"
 
 source "drivers/virt/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index 795d0ca..1c43250 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -84,6 +84,7 @@ obj-$(CONFIG_FUSION)  += message/
 obj-y  += firewire/
 obj-$(CONFIG_UIO)  += uio/
 obj-$(CONFIG_VFIO) += vfio/
+obj-$(CONFIG_VFIO) += vgpu/
 obj-y  += cdrom/
 obj-y  += auxdisplay/
 obj-$(CON

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-03 Thread Kirti Wankhede



On 2/3/2016 11:26 AM, Tian, Kevin wrote:
[...]

* @vgpu_create:Called to allocate basic resouces in graphics
*  driver for a particular vgpu.
*  @dev: physical pci device structure on which vgpu
*should be created
*  @uuid: uuid for which VM it is intended to
*  @instance: vgpu instance in that VM
*  @vgpu_id: This represents the type of vgpu to be
*created
*  Returns integer: success (0) or error (< 0)


Specifically for Intel GVT-g we didn't hard partition resource among vGPUs.
Instead we allow user to accurately control how many physical resources
are allocated to a vGPU. So this interface should be extensible to allow
vendor specific resource control.



This interface forwards the create request to vendor/GPU driver
informing about which physical GPU this request is intended for and the
type of vGPU. Then its vendor/GPU driver's responsibility to do
resources allocation and manage resources in their own driver.


However the current parameter definition disallows resource configuration
information passed from user. As I said, Intel GVT-g doesn't do static
allocation based on type. We provide flexibility to user for fine-grained
resource management.



int (*vgpu_create)(struct pci_dev *dev, uuid_le uuid,
-  uint32_t instance, uint32_t vgpu_id);
+  uint32_t instance, char *vgpu_params);

If we change integer vgpu_id parameter to char *vgpu_param then GPU 
driver can have multiple parameters.


Suppose there is a GPU at :85:00.0, then to create vgpu:
# echo "::" > 
/sys/bus/pci/devices/\:85\:00.0/vgpu_create


Common vgpu module will not parse vgpu_params string, it will be 
forwarded to GPU driver, then its GPU driver's responsibility to parse 
the string and act accordingly. This should give flexibility to have 
multiple parameters for GPU driver.



*
* Physical GPU that support vGPU should be register with vgpu module with
* gpu_device_ops structure.
*/



Also it'd be good design to allow extensible usages, such as statistics, and
other vendor specific control knobs (e.g. foreground/background VM switch
in Intel GVT-g, etc.)



Can you elaborate on what other control knobs that would be needed?



Some examples:

- foreground/background VM switch
- resource query
- various statistics info
- virtual monitor configuration
- ...

Since this vgpu core driver will become the central point for all vgpu
management, we need provide an easy way for vendor specific extension,
e.g. exposing above callbacks as sysfs nodes and then vendor can create
its own extensions under subdirectory (/intel, /nvidia, ...).




Ok. Makes sense.

Are these parameters per physical device or per vgpu device?

Adding attribute groups to gpu_device_ops structure would provide a way 
to add vendor specific extensions. I think we need two types of 
attributes here, per physical device and per vgpu device. Right?


 const struct attribute_group **dev_groups;
 const struct attribute_group **vgpu_groups;



Thanks,
Kirti.


Thanks
Kevin





Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-02 Thread Kirti Wankhede



On 2/2/2016 1:12 PM, Tian, Kevin wrote:

From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
Sent: Tuesday, February 02, 2016 9:48 AM

Resending this mail again, somehow my previous mail didn't reached every
to everyone's inbox.

On 2/2/2016 3:16 AM, Kirti Wankhede wrote:

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.


Thanks for composing this design which is a good start.



This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:


+--+vgpu_register_driver()+---+
| __init() +->+   |
|  |  |   |
|  +<-+vgpu.ko|
| vfio_vgpu.ko |   probe()/remove()   |   |
|  |+-+   +-+
+--+| +---+---+ |
  | ^ |
  | callback| |
  | +---++|
  | |vgpu_register_device()   |
  | |||
  +---^-+-++-+--+-+
  | nvidia.ko ||  i915.ko   |
  |   |||
  +---+++

vGPU driver provides two types of registration interfaces:


Looks you missed callbacks which vgpu.ko provides to vfio_vgpu.ko,
e.g. to retrieve basic region/resource info, etc...



Basic region info or resource info would come from GPU driver. So 
retrieving such info should be the part of GPU driver interface. Like I 
mentioned we need to enhance these interfaces during development as and 
when we find it useful.
vfio_vgpu.ko gets dev pointer from which it can reach to vgpu_device 
structure and then it can use GPU driver interface directly to retrieve 
such information from GPU driver.


This RFC is to focus more on different modules and its structures, how 
those modules would be inter-linked with each other and have a flexible 
design to keep the scope for enhancements.


We have identified three modules:

* vgpu.ko - vGPU core driver that provide registration interface for GPU 
driver and vGPU VFIO  driver, responsible for creating vGPU devices and 
providing management interface for vGPU devices.
* vfio_vgpu.ko - vGPU VFIO driver for vGPU device, provides VFIO 
interface that is used by QEMU.
* vfio_iommu_type1_vgpu.ko - IOMMU TYPE1 driver supporting the IOMMU 
TYPE1 v1 and v2 interface.


The above block diagram gives an overview how vgpu.ko, vfio_vgpu.ko and 
GPU drivers would be inter-linked with each other.




Also for GPU driver interfaces, better to identify the caller. E.g. it's
easy to understand life-cycle management would come from sysfs
by mgmt. stack like libvirt. What about @read and @write? what's
the connection between this vgpu core driver and specific hypervisor?
etc. Better to connect all necessary dots so we can refine all
necessary requirements on this proposal.



read and write calls are for PCI CFG space amd MMIO space read/write. 
Read/write access request from QEMU is passed to GPU driver through GPU 
driver interface.



[...]


2. GPU driver interface

/**
   * struct gpu_device_ops - Structure to be registered for each physical
GPU to
   * register the device to vgpu module.
   *
   * @owner:  The module owner.
   * @vgpu_supported_config: Called to get information about supported
   *   vgpu types.
   *  @dev : pci device structure of physical GPU.
   *  @config: should return string listing supported
   *  config
   *  Returns integer: success (0) or error (< 0)
   * @vgpu_create:Called to allocate basic resouces in graphics
   *  driver for a particular vgpu.
   *  @dev: physical pci device structure on which vgpu
   *should be created
   *  @uuid: uuid for which VM it is intended to
   *  @instance: vgpu instance in that VM
   *  @vgpu_id: This represents the type of vgpu to be
   *created
   *  Returns integer: success (0) or error (< 0)


Specifically for Intel GVT-g we didn't hard partition resource among vGPUs.
Instead we allow user to accurately control how many physical resources
are allocated to a vGPU. So this interface should be extensible to 

Re: [Qemu-devel] [RFC PATCH v1 1/1] vGPU core driver : to provide common interface for vGPU.

2016-02-01 Thread Kirti Wankhede
Resending this mail again, somehow my previous mail didn't reached every 
to everyone's inbox.


On 2/2/2016 3:16 AM, Kirti Wankhede wrote:

Design for vGPU Driver:
Main purpose of vGPU driver is to provide a common interface for vGPU
management that can be used by differnt GPU drivers.

This module would provide a generic interface to create the device, add
it to vGPU bus, add device to IOMMU group and then add it to vfio group.

High Level block diagram:


+--+vgpu_register_driver()+---+
| __init() +->+   |
|  |  |   |
|  +<-+vgpu.ko|
| vfio_vgpu.ko |   probe()/remove()   |   |
|  |+-+   +-+
+--+| +---+---+ |
 | ^ |
 | callback| |
 | +---++|
 | |vgpu_register_device()   |
 | |||
 +---^-+-++-+--+-+
 | nvidia.ko ||  i915.ko   |
 |   |||
 +---+++

vGPU driver provides two types of registration interfaces:
1. Registration interface for vGPU bus driver:

/**
  * struct vgpu_driver - vGPU device driver
  * @name: driver name
  * @probe: called when new device created
  * @remove: called when device removed
  * @driver: device driver structure
  *
  **/
struct vgpu_driver {
 const char *name;
 int  (*probe)  (struct device *dev);
 void (*remove) (struct device *dev);
 struct device_driverdriver;
};

int  vgpu_register_driver(struct vgpu_driver *drv, struct module *owner);
void vgpu_unregister_driver(struct vgpu_driver *drv);

VFIO bus driver for vgpu, should use this interface to register with
vGPU driver. With this, VFIO bus driver for vGPU devices is responsible
to add vGPU device to VFIO group.

2. GPU driver interface

/**
  * struct gpu_device_ops - Structure to be registered for each physical
GPU to
  * register the device to vgpu module.
  *
  * @owner:  The module owner.
  * @vgpu_supported_config: Called to get information about supported
  *   vgpu types.
  *  @dev : pci device structure of physical GPU.
  *  @config: should return string listing supported
  *  config
  *  Returns integer: success (0) or error (< 0)
  * @vgpu_create:Called to allocate basic resouces in graphics
  *  driver for a particular vgpu.
  *  @dev: physical pci device structure on which vgpu
  *should be created
  *  @uuid: uuid for which VM it is intended to
  *  @instance: vgpu instance in that VM
  *  @vgpu_id: This represents the type of vgpu to be
  *created
  *  Returns integer: success (0) or error (< 0)
  * @vgpu_destroy:   Called to free resources in graphics driver for
  *  a vgpu instance of that VM.
  *  @dev: physical pci device structure to which
  *  this vgpu points to.
  *  @uuid: uuid for which the vgpu belongs to.
  *  @instance: vgpu instance in that VM
  *  Returns integer: success (0) or error (< 0)
  *  If VM is running and vgpu_destroy is called that
  *  means the vGPU is being hotunpluged. Return error
  *  if VM is running and graphics driver doesn't
  *  support vgpu hotplug.
  * @vgpu_start: Called to do initiate vGPU initialization
  *  process in graphics driver when VM boots before
  *  qemu starts.
  *  @uuid: UUID which is booting.
  *  Returns integer: success (0) or error (< 0)
  * @vgpu_shutdown:  Called to teardown vGPU related resources for
  *  the VM
  *  @uuid: UUID which is shutting down .
  *  Returns integer: success (0) or error (< 0)
  * @read:   Read emulation callback
  *  @vdev: vgpu device structure
  *  @buf: read buffer
  *  @count: number bytes to read
  *  @address_space: specifies fo

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-27 Thread Kirti Wankhede



On 1/28/2016 3:28 AM, Alex Williamson wrote:

On Thu, 2016-01-28 at 02:25 +0530, Kirti Wankhede wrote:


On 1/27/2016 9:30 PM, Alex Williamson wrote:

On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:


On 1/27/2016 1:36 AM, Alex Williamson wrote:

On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:

On Mon, Jan 25, 2016 at 09:45:14PM +, Tian, Kevin wrote:

From: Alex Williamson [mailto:alex.william...@redhat.com]


Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.


Yes, and since you're creating and destroying the vgpu here, this is
where I'd expect a struct device to be created and added to an IOMMU
group.  The lifecycle management should really include links between
the vGPU and physical GPU, which would be much, much easier to do with
struct devices create here rather than at the point where we start
doing vfio "stuff".


Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.


For device passthrough or SR-IOV model, PCI devices are created by PCI
bus driver and from the probe routine each device is added in vfio group.


An SR-IOV VF is created by the PF driver using standard interfaces
provided by the PCI core.  The IOMMU group for a VF is added by the
IOMMU driver when the device is created on the pci_bus_type.  The probe
routine of the vfio bus driver (vfio-pci) is what adds the device into
the vfio group.


For vgpu, there should be a common module that create vgpu device, say
vgpu module, add vgpu device to an IOMMU group and then add it to vfio
group.  This module can handle management of vgpus. Advantage of keeping
this module a separate module than doing device creation in vendor
modules is to have generic interface for vgpu management, for example,
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
vgpu driver registration interface.


But you're suggesting something very different from the SR-IOV model.
If we wanted to mimic that model, the GPU specific driver should create
the vgpu using services provided by a common interface.  For instance
i915 could call a new vgpu_device_create() which creates the device,
adds it to the vgpu class, etc.  That vgpu device should not be assumed
to be used with vfio though, that should happen via a separate probe
using a vfio-vgpu driver.  It's that vfio bus driver that will add the
device to a vfio group.



In that case vgpu driver should provide a driver registration interface
to register vfio-vgpu driver.

struct vgpu_driver {
const char *name;
int (*probe) (struct vgpu_device *vdev);
void (*remove) (struct vgpu_device *vdev);
}

int vgpu_register_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_register_driver);

int vgpu_unregister_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_unregister_driver);

vfio-vgpu driver registers to vgpu driver. Then from
vgpu_device_create(), after creating the device it calls
vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to
vfio group.

+--+vgpu_register_driver()+---+

 __init() +->+   |
  |  |   |
  +<-+vgpu.ko|
vfio_vgpu.ko

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-27 Thread Kirti Wankhede



On 1/27/2016 9:30 PM, Alex Williamson wrote:

On Wed, 2016-01-27 at 13:36 +0530, Kirti Wankhede wrote:


On 1/27/2016 1:36 AM, Alex Williamson wrote:

On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:

On Mon, Jan 25, 2016 at 09:45:14PM +, Tian, Kevin wrote:

From: Alex Williamson [mailto:alex.william...@redhat.com]


Hi Alex, Kevin and Jike,

(Seems I shouldn't use attachment, resend it again to the list, patches are
inline at the end)

Thanks for adding me to this technical discussion, a great opportunity
for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.

Instead of directly jumping to the proposal that we have been working on
recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.

Then we can look at what we have, hopefully we can reach some consensus soon.


Yes, and since you're creating and destroying the vgpu here, this is
where I'd expect a struct device to be created and added to an IOMMU
group.  The lifecycle management should really include links between
the vGPU and physical GPU, which would be much, much easier to do with
struct devices create here rather than at the point where we start
doing vfio "stuff".


Infact to keep vfio-vgpu to be more generic, vgpu device creation and management
can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.


For device passthrough or SR-IOV model, PCI devices are created by PCI
bus driver and from the probe routine each device is added in vfio group.


An SR-IOV VF is created by the PF driver using standard interfaces
provided by the PCI core.  The IOMMU group for a VF is added by the
IOMMU driver when the device is created on the pci_bus_type.  The probe
routine of the vfio bus driver (vfio-pci) is what adds the device into
the vfio group.


For vgpu, there should be a common module that create vgpu device, say
vgpu module, add vgpu device to an IOMMU group and then add it to vfio
group.  This module can handle management of vgpus. Advantage of keeping
this module a separate module than doing device creation in vendor
modules is to have generic interface for vgpu management, for example,
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and
vgpu driver registration interface.


But you're suggesting something very different from the SR-IOV model.
If we wanted to mimic that model, the GPU specific driver should create
the vgpu using services provided by a common interface.  For instance
i915 could call a new vgpu_device_create() which creates the device,
adds it to the vgpu class, etc.  That vgpu device should not be assumed
to be used with vfio though, that should happen via a separate probe
using a vfio-vgpu driver.  It's that vfio bus driver that will add the
device to a vfio group.



In that case vgpu driver should provide a driver registration interface 
to register vfio-vgpu driver.


struct vgpu_driver {
const char *name;
int (*probe) (struct vgpu_device *vdev);
void (*remove) (struct vgpu_device *vdev);
}

int vgpu_register_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_register_driver);

int vgpu_unregister_driver(struct vgpu_driver *driver)
{
...
}
EXPORT_SYMBOL(vgpu_unregister_driver);

vfio-vgpu driver registers to vgpu driver. Then from 
vgpu_device_create(), after creating the device it calls 
vgpu_driver->probe(vgpu_device) and vfio-vgpu driver adds the device to 
vfio group.


+--+vgpu_register_driver()+---+
| __init() +->+   |
|  |  |   |
|  +<-+vgpu.ko|
| vfio_vgpu.ko |   probe()/remove()   |   |
|  |+-+   +-+
+--+| +---+---+ |
| ^ |
   

Re: [Qemu-devel] VFIO based vGPU(was Re: [Announcement] 2015-Q3 release of XenGT - a Mediated ...)

2016-01-27 Thread Kirti Wankhede



On 1/27/2016 1:36 AM, Alex Williamson wrote:

On Tue, 2016-01-26 at 02:20 -0800, Neo Jia wrote:

On Mon, Jan 25, 2016 at 09:45:14PM +, Tian, Kevin wrote:

From: Alex Williamson [mailto:alex.william...@redhat.com]
  
Hi Alex, Kevin and Jike,
  
(Seems I shouldn't use attachment, resend it again to the list, patches are

inline at the end)
  
Thanks for adding me to this technical discussion, a great opportunity

for us to design together which can bring both Intel and NVIDIA vGPU solution to
KVM platform.
  
Instead of directly jumping to the proposal that we have been working on

recently for NVIDIA vGPU on KVM, I think it is better for me to put out couple
quick comments / thoughts regarding the existing discussions on this thread as
fundamentally I think we are solving the same problem, DMA, interrupt and MMIO.
  
Then we can look at what we have, hopefully we can reach some consensus soon.
  

Yes, and since you're creating and destroying the vgpu here, this is
where I'd expect a struct device to be created and added to an IOMMU
group.  The lifecycle management should really include links between
the vGPU and physical GPU, which would be much, much easier to do with
struct devices create here rather than at the point where we start
doing vfio "stuff".
  
Infact to keep vfio-vgpu to be more generic, vgpu device creation and management

can be centralized and done in vfio-vgpu. That also include adding to IOMMU
group and VFIO group.

Is this really a good idea?  The concept of a vgpu is not unique to
vfio, we want vfio to be a driver for a vgpu, not an integral part of
the lifecycle of a vgpu.  That certainly doesn't exclude adding
infrastructure to make lifecycle management of a vgpu more consistent
between drivers, but it should be done independently of vfio.  I'll go
back to the SR-IOV model, vfio is often used with SR-IOV VFs, but vfio
does not create the VF, that's done in coordination with the PF making
use of some PCI infrastructure for consistency between drivers.

It seems like we need to take more advantage of the class and driver
core support to perhaps setup a vgpu bus and class with vfio-vgpu just
being a driver for those devices.


For device passthrough or SR-IOV model, PCI devices are created by PCI 
bus driver and from the probe routine each device is added in vfio group.


For vgpu, there should be a common module that create vgpu device, say 
vgpu module, add vgpu device to an IOMMU group and then add it to vfio 
group.  This module can handle management of vgpus. Advantage of keeping 
this module a separate module than doing device creation in vendor 
modules is to have generic interface for vgpu management, for example, 
files /sys/class/vgpu/vgpu_start and  /sys/class/vgpu/vgpu_shudown and 
vgpu driver registration interface.


In the patch, vgpu_dev.c + vgpu_sysfs.c form such vgpu module and 
vgpu_vfio.c is for VFIO interface. Each vgpu device should be added to 
vfio group, so vgpu_group_init() from vgpu_vfio.c should be called per 
device. In the vgpu module, vgpu devices are created on request, so 
vgpu_group_init() should be called explicitly for per vgpu device. 
 That’s why had merged the 2 modules, vgpu + vgpu_vfio to form one vgpu 
module.  Vgpu_vfio would remain separate entity but merged with vgpu 
module.



Thanks,
Kirti







<    5   6   7   8   9   10