Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-06-07 Thread Dong Jia
On Tue, 7 Jun 2016 02:47:10 +
"Tian, Kevin"  wrote:

> > From: Dong Jia
> > Sent: Monday, June 06, 2016 2:59 PM
> > 
> 
> [...]
> 
> > Channel I/O is quite different to PCI, so I copied some more details
> > here. Hope these could help.
> > 
> > Channel subsystem:
> > The channel subsystem directs the flow of information between I/O devices
> > and main storage. It relieves CPUs of the task of communicating directly
> > with I/O devices and permits data processing to proceed concurrently with
> > I/O processing.
> > 
> > Channel path:
> > The channel subsystem uses one or more channel paths as the communication
> > link in managing the flow of information to or from I/O devices.
> > 
> > Subchannel:
> > Within the channel subsystem are subchannels. One subchannel of type I/O
> > is provided for and dedicated to each I/O device accessible to the channel
> > 
> > subsystem.
> > 
> > Control unit:
> > A control unit implements a standardized interface which translates between
> > the Channel Subsystem and the actual device. It does this by adapting the
> > characteristics of each device so that it can respond to the standard form
> > of control provided by the channel subsystem.
> > 
> > Channel Path:
> > The channel subsystem communicates with the I/O devices by means of
> > channel paths between the channel subsystem and control units.
> > 
> > +---+
> > | channel subsystem |
> > +---+
> > |   |
> > |   +--+|  +--+++
> > |   |subchannel|| channel path | control unit || I/O device |
> > |   +---+
> > |   | subchno  ||  |  ||devno   |
> > |   +--+|  +--+++
> > |   |
> > +---+
> > 
> > There is no concept of ccw-device registers by the subchannel. Control
> > unit will interact with the device, collect the I/O result, and inform
> > the result to the subchannel.
> > So it seems to me, there is no needs to provide region info for
> > isolation. As mentioned above, the isolation is quite natural.
> > 
> > Please correct me in case I misunderstood some of the concepts in your
> > questions and gave irrelevant answers. :>
> > 
> 
> Thanks for above background which is very useful. Several follow-up Qs:
> 
> 1) Does it mean that VFIO is managing resource in subchannel level, so Qemu
> can only operate subchannels assigned to itself
Dear Kevin,

This understanding is basically right, but not that exactly.

Linux creates a 'struct ccw_device' for each device it has detected, and
a 'struct subchannel' for the corresponding subchannel. When we issue
a command to a device instance, the device can find the subchannel and
pass the command to it.

The current vfio-ccw implementation targets a device passthru. So I'd say
that VFIO is managing resource in both the device and the
subchannel level.

However, there is a discussion inside our team that tries to figure out
if a subchannel passthru would be better. So, in the future, it's
possible to do the management in a subchannel level only.

> (then emulate as the complete channel io sub-system to guest)?
This is right.

> 
> 2) How are ccw commands associated with a subchannel?
The I/O instruction requires a subchannel id and an ORB
(Operation-Request Block, which contains the execution parameters,
including the address of the ccws). So when an I/O instruction was
intercepted, we know which is the target subchannel. And using this
target information, we can find the real physical device and the
subchannel to perform the instruction.

> Are they submitted
> through a dedicated subchannel interface (so VFIO can easily map that 
> interface)
We can understand it this way.

> or that subchannel is specified by a special ccw cmd (means VFIO-ccw needs
> to scan cmds to avoid malicious attempts on non-assigned subchannels)?
No. CCWs themselves don't contain subchannel information.

> 
> Thanks
> Kevin
> 


Dong Jia




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-06-06 Thread Tian, Kevin
> From: Dong Jia
> Sent: Monday, June 06, 2016 2:59 PM
> 

[...]

> Channel I/O is quite different to PCI, so I copied some more details
> here. Hope these could help.
> 
> Channel subsystem:
> The channel subsystem directs the flow of information between I/O devices
> and main storage. It relieves CPUs of the task of communicating directly
> with I/O devices and permits data processing to proceed concurrently with
> I/O processing.
> 
> Channel path:
> The channel subsystem uses one or more channel paths as the communication
> link in managing the flow of information to or from I/O devices.
> 
> Subchannel:
> Within the channel subsystem are subchannels. One subchannel of type I/O
> is provided for and dedicated to each I/O device accessible to the channel
> 
> subsystem.
> 
> Control unit:
> A control unit implements a standardized interface which translates between
> the Channel Subsystem and the actual device. It does this by adapting the
> characteristics of each device so that it can respond to the standard form
> of control provided by the channel subsystem.
> 
> Channel Path:
> The channel subsystem communicates with the I/O devices by means of
> channel paths between the channel subsystem and control units.
> 
> +---+
> | channel subsystem |
> +---+
> |   |
> |   +--+|  +--+++
> |   |subchannel|| channel path | control unit || I/O device |
> |   +---+
> |   | subchno  ||  |  ||devno   |
> |   +--+|  +--+++
> |   |
> +---+
> 
> There is no concept of ccw-device registers by the subchannel. Control
> unit will interact with the device, collect the I/O result, and inform
> the result to the subchannel.
> So it seems to me, there is no needs to provide region info for
> isolation. As mentioned above, the isolation is quite natural.
> 
> Please correct me in case I misunderstood some of the concepts in your
> questions and gave irrelevant answers. :>
> 

Thanks for above background which is very useful. Several follow-up Qs:

1) Does it mean that VFIO is managing resource in subchannel level, so Qemu
can only operate subchannels assigned to itself (then emulate as the complete
channel io sub-system to guest)?

2) How are ccw commands associated with a subchannel? Are they submitted
through a dedicated subchannel interface (so VFIO can easily map that interface)
or that subchannel is specified by a special ccw cmd (means VFIO-ccw needs
to scan cmds to avoid malicious attempts on non-assigned subchannels)?

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-06-06 Thread Dong Jia
On Fri, 20 May 2016 03:21:31 +
"Tian, Kevin"  wrote:

> > From: Dong Jia [mailto:bjsdj...@linux.vnet.ibm.com]
> > Sent: Thursday, May 19, 2016 3:28 PM
> > 
> > On Fri, 13 May 2016 02:05:01 -0700
> > Neo Jia  wrote:
> > 
> > ...snip...
> > 
> > >
> > > Hi Dong,
> > >
> > > We should definitely be mindful about the data structure performance 
> > > especially
> > > dealing with kernel. But for now, we haven't done any performance 
> > > analysis yet
> > > for the current rbtree implementation, later we will definitely run it 
> > > through
> > > large guest RAM configuration and multiple virtual devices cases, etc. to
> > > collect data.
> > >
> > > Regarding your use case, may I ask if there will be concurrent command 
> > > streams
> > > running for the same VM?
> > Hi Neo:
> > 
> > Sorry for the late response. Spent some time to make the confirmation.
> > 
> > For our case, one iommu group will add one (and only one) ccw-device.
> > For one ccw-device, there will be no concurrent command streams from it.
> > 
> 
> Hi, Dong,
> 
> Looks there can be multiple devices behind one channel, according to:
> https://en.wikipedia.org/wiki/Channel_I/O
Dear Kevin:

One subchannel (the co-processor to offload the I/O operations) could
be assigned to one device at a time. See below.

> 
> Do they need to be assigned together as one iommu group?
So, 'N/A' to this question.

> If not, how is
> the isolation being done in your implementation? Based on cmd scanning in 
> Qemu-side?
It's a 'one device'-'one subchannel'-'one iommu group' relation then.
The isolation looks quite natural.

> 
> Another curious question about channel io itself. I'm unclear whether the 
> channel here only fulfills the role of DMA controller (i.e. controlling how
> device access memory), or also offloads CPU accesses to the registers
> on the ccw-device. Are ccw-device registers directly addressable by CPU
> on s390, similar to MMIO concept on x86? If yes, I guess you also need
> provide region info in vfio-ccw to control which I/O resource can be accessed
> by user space (looks not there in your vfio-ccw patch series). If not, how 
> do you control the isolation in that aspect? :-)
Channel I/O is quite different to PCI, so I copied some more details
here. Hope these could help.

Channel subsystem:
The channel subsystem directs the flow of information between I/O devices
and main storage. It relieves CPUs of the task of communicating directly
with I/O devices and permits data processing to proceed concurrently with
I/O processing.

Channel path:
The channel subsystem uses one or more channel paths as the communication
link in managing the flow of information to or from I/O devices.

Subchannel:
Within the channel subsystem are subchannels. One subchannel of type I/O
is provided for and dedicated to each I/O device accessible to the channel

subsystem.

Control unit:
A control unit implements a standardized interface which translates between
the Channel Subsystem and the actual device. It does this by adapting the
characteristics of each device so that it can respond to the standard form
of control provided by the channel subsystem.

Channel Path:
The channel subsystem communicates with the I/O devices by means of
channel paths between the channel subsystem and control units.

+---+
| channel subsystem |
+---+
|   |
|   +--+|  +--+++
|   |subchannel|| channel path | control unit || I/O device |
|   +---+
|   | subchno  ||  |  ||devno   |
|   +--+|  +--+++
|   |
+---+

There is no concept of ccw-device registers by the subchannel. Control
unit will interact with the device, collect the I/O result, and inform
the result to the subchannel.
So it seems to me, there is no needs to provide region info for
isolation. As mentioned above, the isolation is quite natural.

Please correct me in case I misunderstood some of the concepts in your
questions and gave irrelevant answers. :>

> 
> Thanks
> Kevin
> 




Dong Jia




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-19 Thread Tian, Kevin
> From: Dong Jia [mailto:bjsdj...@linux.vnet.ibm.com]
> Sent: Thursday, May 19, 2016 3:28 PM
> 
> On Fri, 13 May 2016 02:05:01 -0700
> Neo Jia  wrote:
> 
> ...snip...
> 
> >
> > Hi Dong,
> >
> > We should definitely be mindful about the data structure performance 
> > especially
> > dealing with kernel. But for now, we haven't done any performance analysis 
> > yet
> > for the current rbtree implementation, later we will definitely run it 
> > through
> > large guest RAM configuration and multiple virtual devices cases, etc. to
> > collect data.
> >
> > Regarding your use case, may I ask if there will be concurrent command 
> > streams
> > running for the same VM?
> Hi Neo:
> 
> Sorry for the late response. Spent some time to make the confirmation.
> 
> For our case, one iommu group will add one (and only one) ccw-device.
> For one ccw-device, there will be no concurrent command streams from it.
> 

Hi, Dong,

Looks there can be multiple devices behind one channel, according to:
https://en.wikipedia.org/wiki/Channel_I/O

Do they need to be assigned together as one iommu group? If not, how is
the isolation being done in your implementation? Based on cmd scanning in 
Qemu-side?

Another curious question about channel io itself. I'm unclear whether the 
channel here only fulfills the role of DMA controller (i.e. controlling how
device access memory), or also offloads CPU accesses to the registers
on the ccw-device. Are ccw-device registers directly addressable by CPU
on s390, similar to MMIO concept on x86? If yes, I guess you also need
provide region info in vfio-ccw to control which I/O resource can be accessed
by user space (looks not there in your vfio-ccw patch series). If not, how 
do you control the isolation in that aspect? :-)

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-19 Thread Dong Jia
On Fri, 13 May 2016 02:05:01 -0700
Neo Jia  wrote:

...snip...

> 
> Hi Dong,
> 
> We should definitely be mindful about the data structure performance 
> especially
> dealing with kernel. But for now, we haven't done any performance analysis yet
> for the current rbtree implementation, later we will definitely run it through
> large guest RAM configuration and multiple virtual devices cases, etc. to
> collect data.
> 
> Regarding your use case, may I ask if there will be concurrent command streams
> running for the same VM?
Hi Neo:

Sorry for the late response. Spent some time to make the confirmation.

For our case, one iommu group will add one (and only one) ccw-device.
For one ccw-device, there will be no concurrent command streams from it.

> If yes, those two transaction requests (if we
> implement) will compete not only the rbtree lock but also the GUP locks.
Since the answer is 'no, I guess we needn't do this. :>

> 
> Also, what is the typical guest RAM we are talking about here for your usecase
> and any rough estimation of the active working set of those DMA pages? 
> 
I'm afraid there is no typical guest RAM for the I/O instructions
issued by the passed-through ccw-device drivers. They can use any
memory chunk allocated by a kmalloc.

The working set depends on how much memory used by the device drivers,
and of course the number of the available memory. Since there is no
restrictions of the memory usage for this case, it varies...

[...]


Dong Jia




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-16 Thread Jike Song
On 05/13/2016 11:50 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
>> On 05/13/2016 04:31 PM, Neo Jia wrote:
>>> On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:

 We use page tracking framework, which is newly added to KVM recently,
 to mark RAM pages as read-only so write accesses are intercepted to 
 device model.
>>>
>>> Yes, I am aware of that patchset from Guangrong. So far the interface are 
>>> all
>>> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
>>>
>>> - kvm_page_track_add_page(): add the page to the tracking pool after
>>>   that later specified access on that page will be tracked
>>>
>>> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>>>   the specified access on the page is not tracked after the last user is
>>>   gone
>>>
>>> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
>>> enum kvm_page_track_mode mode);
>>> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>>>enum kvm_page_track_mode mode);
>>>
>>> Really curious how you are going to have access to the struct kvm *kvm, or 
>>> you
>>> are relying on the userfaultfd to track the write faults only as part of the
>>> QEMU userfault thread?
>>>
>>
>> Hi Neo,
>>
>> For the vGPU used as a device for KVM guest, there will be interfaces
>> wrapped or implemented in KVM layer, as a rival thing diverted from
>> the interfaces for Xen. That is where the KVM related code supposed to be.
> 
> Hi Jike,
> 
> Is this discussed anywhere on the mailing list already? Sorry if I have missed
> such conversation.
>

Hi Neo,

Not exactly, but we can discuss it if necessary :)

Intel vGPU device-model, which is a part of i915 driver, has to be able to
emulate vGPU for *both* XenGT and KVMGT guests. That means there must be
a ridge somewhere, directing to Xen-specific and KVM-specific logic accordingly.


--
Thanks,
Jike



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-15 Thread Jike Song
On 05/13/2016 11:48 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
>> On 05/13/2016 04:12 AM, Neo Jia wrote:
>>> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:

 If you're trying to equate the scale of what we need to track vs what
 type1 currently tracks, they're significantly different.  Possible
 things we need to track include the pfn, the iova, and possibly a
 reference count or some sort of pinned page map.  In the pin-all model
 we can assume that every page is pinned on map and unpinned on unmap,
 so a reference count or map is unnecessary.  We can also assume that we
 can always regenerate the pfn with get_user_pages() from the vaddr, so
 we don't need to track that.  
>>>
>>> Hi Alex,
>>>
>>> Thanks for pointing this out, we will not track those in our next rev and
>>> get_user_pages will be used from the vaddr as you suggested to handle the
>>> single VM with both passthru + mediated device case.
>>>
>>
>> Just a gut feeling:
>>
>> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
>> every time for a particular process. If the VM has dozens of VCPU, which
>> is not rare, the semaphore is likely to be the bottleneck.
> 
> Hi Jike,
> 
> We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
> don't quite follow the reasoning with "dozens of vcpus", one situation that I
> can think of is that we have other thread competing with the mmap_sem for the
> VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
> seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".
>

I meant, on guest's writing a gfn to GPU MMU, which could happen on any vcpu,
so vmexit happens and mmap_sem required.  But I'm now realized that it's
also the situation even we store the pfn in rbtree ..

> We will definitely conduct performance analysis with large configuration on
> servers with E5-2697 v4. :-)

My homage :)

--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Alex Williamson
On Fri, 13 May 2016 03:55:09 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, May 13, 2016 3:06 AM
> >   
> > > >  
> > >
> > > Based on above thought I'm thinking whether below would work:
> > > (let's use gpa to replace existing iova in type1 driver, while using iova
> > > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > > which matches existing vfio logic)
> > >
> > > - No change to existing vfio_dma structure. VFIO still maintains 
> > > gpa<->vaddr
> > > mapping, in coarse-grained regions;
> > >
> > > - Leverage same page accounting/pinning logic in type1 driver, which
> > > should be enough for 'pin-all' usage;
> > >
> > > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > > iommu_domain for vGPU so same iommu_map/unmap can be reused.  
> > 
> > This seems troublesome.  Kirti's version used numerous api-only tests
> > to avoid these which made the code difficult to trace.  Clearly one
> > option is to split out the common code so that a new mediated-type1
> > backend skips this, but they thought they could clean it up without
> > this, so we'll see what happens in the next version.
> >   
> > > If not, we may introduce two new map/unmap callbacks provided
> > > specifically by vGPU core driver, as you suggested:
> > >
> > >   * vGPU core driver uses dma_map_page to map specified pfns:
> > >
> > >   o When IOMMU is enabled, we'll get an iova returned different
> > > from pfn;
> > >   o When IOMMU is disabled, returned iova is same as pfn;  
> > 
> > Either way each iova needs to be stored and we have a worst case of one
> > iova per page of guest memory.
> >   
> > >   * Then vGPU core driver just maintains its own gpa<->iova lookup
> > > table (e.g. called vgpu_dma)
> > >
> > >   * Because each vfio_iommu_map invocation is about a contiguous
> > > region, we can expect same number of vgpu_dma entries as maintained
> > > for vfio_dma list;
> > >
> > > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > > lookup for vendor specific GPU driver. And we don't need worry about
> > > tens of thousands of entries. Once we get this simple 'pin-all' model
> > > ready, then it can be further extended to support 'pin-sparse'
> > > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > > further link its own sparse mapping structure. In reality I don't expect
> > > we really need to maintain per-page translation even with sparse pinning. 
> > >  
> > 
> > If you're trying to equate the scale of what we need to track vs what
> > type1 currently tracks, they're significantly different.  Possible
> > things we need to track include the pfn, the iova, and possibly a
> > reference count or some sort of pinned page map.  In the pin-all model
> > we can assume that every page is pinned on map and unpinned on unmap,
> > so a reference count or map is unnecessary.  We can also assume that we
> > can always regenerate the pfn with get_user_pages() from the vaddr, so
> > we don't need to track that.  I don't see any way around tracking the
> > iova.  The iommu can't tell us this like it can with the normal type1
> > model because the pfn is the result of the translation, not the key for
> > the translation. So we're always going to have between 1 and
> > (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> > manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> > data structure tracking every iova.  
> 
> There is one option. We may use alloc_iova to reserve continuous iova
> range for each vgpu_dma range and then use iommu_map/unmap to
> write iommu ptes later upon map request (then could be same #entries
> as vfio_dma compared to unbounded entries when using dma_map_page). 
> Of course this needs to be done in vGPU core driver, since vfio type1 only 
> sees a faked iommu domain.

I'm not sure this is really how iova domains work.  There's only one
iova domain per iommu domain using the dma-iommu API, and
iommu_map/unmap are part of a different API.  iova domain may be an
interesting solution though.
 
> > Sparse mapping has the same issue but of course the tree of iovas is
> > potentially incomplete and we need a way to determine where it's
> > incomplete.  A page table rooted in the vgpu_dma and indexed by the
> > offset from the start vaddr seems like the way to go here.  It's also
> > possible that some mediated device models might store the iova in the
> > command sent to the device and therefore be able to parse those entries
> > back out to unmap them without storing them separately.  This might be
> > how the s390 channel-io model would prefer to work.  That seems like
> > further validation that such tracking is going to be dependent on the
> > mediated driver itself and probably not something to centralize in a
> > 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 05:23:44PM +0800, Jike Song wrote:
> On 05/13/2016 04:31 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:
> >>
> >> We use page tracking framework, which is newly added to KVM recently,
> >> to mark RAM pages as read-only so write accesses are intercepted to 
> >> device model.
> > 
> > Yes, I am aware of that patchset from Guangrong. So far the interface are 
> > all
> > requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> > 
> > - kvm_page_track_add_page(): add the page to the tracking pool after
> >   that later specified access on that page will be tracked
> > 
> > - kvm_page_track_remove_page(): remove the page from the tracking pool,
> >   the specified access on the page is not tracked after the last user is
> >   gone
> > 
> > void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
> > enum kvm_page_track_mode mode);
> > void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
> >enum kvm_page_track_mode mode);
> > 
> > Really curious how you are going to have access to the struct kvm *kvm, or 
> > you
> > are relying on the userfaultfd to track the write faults only as part of the
> > QEMU userfault thread?
> >
> 
> Hi Neo,
> 
> For the vGPU used as a device for KVM guest, there will be interfaces
> wrapped or implemented in KVM layer, as a rival thing diverted from
> the interfaces for Xen. That is where the KVM related code supposed to be.

Hi Jike,

Is this discussed anywhere on the mailing list already? Sorry if I have missed
such conversation.

Thanks,
Neo

> 
> --
> Thanks,
> Jike



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 05:46:17PM +0800, Jike Song wrote:
> On 05/13/2016 04:12 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> >>
> >> If you're trying to equate the scale of what we need to track vs what
> >> type1 currently tracks, they're significantly different.  Possible
> >> things we need to track include the pfn, the iova, and possibly a
> >> reference count or some sort of pinned page map.  In the pin-all model
> >> we can assume that every page is pinned on map and unpinned on unmap,
> >> so a reference count or map is unnecessary.  We can also assume that we
> >> can always regenerate the pfn with get_user_pages() from the vaddr, so
> >> we don't need to track that.  
> > 
> > Hi Alex,
> > 
> > Thanks for pointing this out, we will not track those in our next rev and
> > get_user_pages will be used from the vaddr as you suggested to handle the
> > single VM with both passthru + mediated device case.
> >
> 
> Just a gut feeling:
> 
> Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
> every time for a particular process. If the VM has dozens of VCPU, which
> is not rare, the semaphore is likely to be the bottleneck.

Hi Jike,

We do need to hold the lock of mm->mmap_sem for the VMM/QEMU process, but I
don't quite follow the reasoning with "dozens of vcpus", one situation that I
can think of is that we have other thread competing with the mmap_sem for the
VMM/QEMU process within KVM kernel such as hva_to_pfn, after a quick search it
seems only mostly gets used by iotcl "KVM_ASSIGN_PCI_DEVICE".

We will definitely conduct performance analysis with large configuration on
servers with E5-2697 v4. :-)

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Jike Song
On 05/13/2016 04:12 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
>>
>> If you're trying to equate the scale of what we need to track vs what
>> type1 currently tracks, they're significantly different.  Possible
>> things we need to track include the pfn, the iova, and possibly a
>> reference count or some sort of pinned page map.  In the pin-all model
>> we can assume that every page is pinned on map and unpinned on unmap,
>> so a reference count or map is unnecessary.  We can also assume that we
>> can always regenerate the pfn with get_user_pages() from the vaddr, so
>> we don't need to track that.  
> 
> Hi Alex,
> 
> Thanks for pointing this out, we will not track those in our next rev and
> get_user_pages will be used from the vaddr as you suggested to handle the
> single VM with both passthru + mediated device case.
>

Just a gut feeling:

Calling GUP every time for a particular vaddr, means locking mm->mmap_sem
every time for a particular process. If the VM has dozens of VCPU, which
is not rare, the semaphore is likely to be the bottleneck.


--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Jike Song
On 05/13/2016 04:31 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:
>>
>> We use page tracking framework, which is newly added to KVM recently,
>> to mark RAM pages as read-only so write accesses are intercepted to 
>> device model.
> 
> Yes, I am aware of that patchset from Guangrong. So far the interface are all
> requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644
> 
> - kvm_page_track_add_page(): add the page to the tracking pool after
>   that later specified access on that page will be tracked
> 
> - kvm_page_track_remove_page(): remove the page from the tracking pool,
>   the specified access on the page is not tracked after the last user is
>   gone
> 
> void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
> enum kvm_page_track_mode mode);
> void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
>enum kvm_page_track_mode mode);
> 
> Really curious how you are going to have access to the struct kvm *kvm, or you
> are relying on the userfaultfd to track the write faults only as part of the
> QEMU userfault thread?
>

Hi Neo,

For the vGPU used as a device for KVM guest, there will be interfaces
wrapped or implemented in KVM layer, as a rival thing diverted from
the interfaces for Xen. That is where the KVM related code supposed to be.

--
Thanks,
Jike



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 04:39:37PM +0800, Dong Jia wrote:
> On Fri, 13 May 2016 00:24:34 -0700
> Neo Jia  wrote:
> 
> > On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > > On Thu, 12 May 2016 13:05:52 -0600
> > > Alex Williamson  wrote:
> > > 
> > > > On Thu, 12 May 2016 08:00:36 +
> > > > "Tian, Kevin"  wrote:
> > > > 
> > > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > > 
> > > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > > Jike Song  wrote:
> > > > > >   
> > > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > > >  From: Song, Jike
> > > > > > > 
> > > > > > >  IIUC, an api-only domain is a VFIO domain *without* 
> > > > > > >  underlying IOMMU
> > > > > > >  hardware. It just, as you said in another mail, "rather than
> > > > > > >  programming them into an IOMMU for a device, it simply 
> > > > > > >  stores the
> > > > > > >  translations for use by later requests".
> > > > > > > 
> > > > > > >  That imposes a constraint on gfx driver: hardware IOMMU must 
> > > > > > >  be disabled.
> > > > > > >  Otherwise, if IOMMU is present, the gfx driver eventually 
> > > > > > >  programs
> > > > > > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > > > >  dma_map_page;
> > > > > > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> 
> > > > > > >  HPA
> > > > > > >  translations without any knowledge about hardware IOMMU, how 
> > > > > > >  is the
> > > > > > >  device model supposed to do to get an IOVA for a given GPA 
> > > > > > >  (thereby HPA
> > > > > > >  by the IOMMU backend here)?
> > > > > > > 
> > > > > > >  If things go as guessed above, as vfio_pin_pages() 
> > > > > > >  indicates, it
> > > > > > >  pin & translate vaddr to PFN, then it will be very difficult 
> > > > > > >  for the
> > > > > > >  device model to figure out:
> > > > > > > 
> > > > > > >   1, for a given GPA, how to avoid calling dma_map_page 
> > > > > > >  multiple times?
> > > > > > >   2, for which page to call dma_unmap_page?
> > > > > > > 
> > > > > > >  --  
> > > > > > > >>>
> > > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > > >>> Then in this file we only need to cache GPA to whatever 
> > > > > > > >>> dmadr_t
> > > > > > > >>> returned by dma_map_page.
> > > > > > > >>>  
> > > > > > > >>
> > > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility 
> > > > > > > >> here?  
> > > > > > > >
> > > > > > > > Hi Jike,
> > > > > > > >
> > > > > > > > With mediated passthru, you still can use hardware iommu, but 
> > > > > > > > more important
> > > > > > > > that part is actually orthogonal to what we are discussing here 
> > > > > > > > as we will only
> > > > > > > > cache the mapping between  > > > > > > > (qemu) va>, once we
> > > > > > > > have pinned pages later with the help of above info, you can 
> > > > > > > > map it into the
> > > > > > > > proper iommu domain if the system has configured so.
> > > > > > > >  
> > > > > > >
> > > > > > > Hi Neo,
> > > > > > >
> > > > > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > > > > elsewhere,
> > > > > > > but to find out whether a pfn was previously mapped or not, you 
> > > > > > > have to
> > > > > > > track it with another rbtree-alike data structure (the IOMMU 
> > > > > > > driver simply
> > > > > > > doesn't bother with tracking), that seems somehow duplicate with 
> > > > > > > the vGPU
> > > > > > > IOMMU backend we are discussing here.
> > > > > > >
> > > > > > > And it is also semantically correct for an IOMMU backend to 
> > > > > > > handle both w/
> > > > > > > and w/o an IOMMU hardware? :)  
> > > > > > 
> > > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > > device does it do this?  In the mediated case the vfio 
> > > > > > infrastructure
> > > > > > is dealing with a software representation of a device.  For all we
> > > > > > know that software model could transparently migrate from one 
> > > > > > physical
> > > > > > GPU to another.  There may not even be a physical device backing
> > > > > > the mediated device.  Those are details left to the vgpu driver 
> > > > > > itself.  
> > > > > 
> > > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > > requests, where 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 08:02:41AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 3:38 PM
> > 
> > On Fri, May 13, 2016 at 07:13:44AM +, Tian, Kevin wrote:
> > > > From: Neo Jia [mailto:c...@nvidia.com]
> > > > Sent: Friday, May 13, 2016 2:42 PM
> > > >
> > > >
> > > > >
> > > > > We possibly have the same requirement from the mediate driver backend:
> > > > >
> > > > >   a) get a GFN, when guest try to tell hardware;
> > > > >   b) consult the vfio iommu with that GFN[1]: will you find me a 
> > > > > proper dma_addr?
> > > >
> > > > We will provide you the pfn via vfio_pin_pages, so you can map it for 
> > > > dma
> > > > purpose in your i915 driver, which is what we are doing today.
> > > >
> > >
> > > Can such 'map' operation be consolidated in vGPU core driver? I don't 
> > > think
> > > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > > why we keep talking iommu is just because the kernel may enable iommu
> > > for physical GPU so we need make sure our device model can work in such
> > > configuration. And this requirement should apply to all vendors, not Intel
> > > specific (like you said you are doing it already today).
> > 
> > Hi Kevin,
> > 
> > Actually, such requirement is already satisfied today as all vendor drivers
> > should transparently work with and without system iommu on bare-metal, 
> > right?
> > 
> > So I don't see any new requirement here, also such consolidation doesn't 
> > help
> > any but adding complexity to the system as vendor driver will not remove
> > their own dma_map_xxx functions as they are still required to support
> > non-mediated cases.
> > 
> 
> Thanks for your information, which makes it clearer where the difference is. 
> :-)
> 
> Based on your description, looks you treat guest pages same as normal process
> pages, which all share the same code path when mapping as DMA target, so it
> is pointless to separate guest page map out to vGPU core driver. Is this
> understanding correct?

Yes.

It is Linux's responsibility to allocate the physical pages for the QEMU process
which will happen to be the guest physical memory that we might use as DMA
target. From the device point of view, it is just some physical location he
needs to hit.

> 
> In our side, so far guest pages are treated differently from normal process
> pages, which is the main reason why I asked whether we can consolidate that
> part. Looks now it's not necessary since it's already not a common 
> requirement.

> 
> One additional question though. Jike already mentioned the need to shadow
> GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
> basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
> as returned by dma_map_xxx. Based on gfn->pfn translation provided by
> VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
> in the vendor driver. So do you have similar requirement like this? If yes, do
> you think any value to unify that translation structure or prefer to maintain
> it by vendor driver?

Yes, I think it would make sense to do this in the vendor driver as it keeps the
iommu type1 clean - it will only track the gfn to pfn translation/pinning (on
CPU). Then, you can reuse your existing driver code to map the pfn as DMA
target.

Also you can do some kind of optimization such as keeping a small cache within
your device driver, if the gfn is already translated, no need to query again.

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Dong Jia
On Fri, 13 May 2016 00:24:34 -0700
Neo Jia  wrote:

> On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> > On Thu, 12 May 2016 13:05:52 -0600
> > Alex Williamson  wrote:
> > 
> > > On Thu, 12 May 2016 08:00:36 +
> > > "Tian, Kevin"  wrote:
> > > 
> > > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > > 
> > > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > > Jike Song  wrote:
> > > > >   
> > > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > > >  From: Song, Jike
> > > > > > 
> > > > > >  IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > > > >  IOMMU
> > > > > >  hardware. It just, as you said in another mail, "rather than
> > > > > >  programming them into an IOMMU for a device, it simply stores 
> > > > > >  the
> > > > > >  translations for use by later requests".
> > > > > > 
> > > > > >  That imposes a constraint on gfx driver: hardware IOMMU must 
> > > > > >  be disabled.
> > > > > >  Otherwise, if IOMMU is present, the gfx driver eventually 
> > > > > >  programs
> > > > > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > > >  dma_map_page;
> > > > > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> 
> > > > > >  HPA
> > > > > >  translations without any knowledge about hardware IOMMU, how 
> > > > > >  is the
> > > > > >  device model supposed to do to get an IOVA for a given GPA 
> > > > > >  (thereby HPA
> > > > > >  by the IOMMU backend here)?
> > > > > > 
> > > > > >  If things go as guessed above, as vfio_pin_pages() indicates, 
> > > > > >  it
> > > > > >  pin & translate vaddr to PFN, then it will be very difficult 
> > > > > >  for the
> > > > > >  device model to figure out:
> > > > > > 
> > > > > > 1, for a given GPA, how to avoid calling dma_map_page 
> > > > > >  multiple times?
> > > > > > 2, for which page to call dma_unmap_page?
> > > > > > 
> > > > > >  --  
> > > > > > >>>
> > > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > > >>> returned by dma_map_page.
> > > > > > >>>  
> > > > > > >>
> > > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility 
> > > > > > >> here?  
> > > > > > >
> > > > > > > Hi Jike,
> > > > > > >
> > > > > > > With mediated passthru, you still can use hardware iommu, but 
> > > > > > > more important
> > > > > > > that part is actually orthogonal to what we are discussing here 
> > > > > > > as we will only
> > > > > > > cache the mapping between  > > > > > > va>, once we
> > > > > > > have pinned pages later with the help of above info, you can map 
> > > > > > > it into the
> > > > > > > proper iommu domain if the system has configured so.
> > > > > > >  
> > > > > >
> > > > > > Hi Neo,
> > > > > >
> > > > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > > > elsewhere,
> > > > > > but to find out whether a pfn was previously mapped or not, you 
> > > > > > have to
> > > > > > track it with another rbtree-alike data structure (the IOMMU driver 
> > > > > > simply
> > > > > > doesn't bother with tracking), that seems somehow duplicate with 
> > > > > > the vGPU
> > > > > > IOMMU backend we are discussing here.
> > > > > >
> > > > > > And it is also semantically correct for an IOMMU backend to handle 
> > > > > > both w/
> > > > > > and w/o an IOMMU hardware? :)  
> > > > > 
> > > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > > is dealing with a software representation of a device.  For all we
> > > > > know that software model could transparently migrate from one physical
> > > > > GPU to another.  There may not even be a physical device backing
> > > > > the mediated device.  Those are details left to the vgpu driver 
> > > > > itself.  
> > > > 
> > > > This is a fair argument. VFIO iommu driver simply serves user space
> > > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > > 
> > > > > 
> > > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > > map and unmap callbacks.  The unmap callback might provide the
> > > > > 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 07:45:14AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 3:42 PM
> > 
> > On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> > >  From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
> > >  2016 3:49 AM
> > > 
> > > >
> > > >> Perhaps one possibility would be to allow the vgpu driver
> > > >> to register map and unmap callbacks.  The unmap callback
> > > >> might provide the invalidation interface that we're so far
> > > >> missing.  The combination of map and unmap callbacks might
> > > >> simplify the Intel approach of pinning the entire VM memory
> > > >> space, ie. for each map callback do a translation (pin) and
> > > >> dma_map_page, for each unmap do a dma_unmap_page and
> > > >> release the translation.
> > > >
> > > > Yes adding map/unmap ops in pGPU drvier (I assume you are
> > > > refering to gpu_device_ops as implemented in Kirti's patch)
> > > > sounds a good idea, satisfying both: 1) keeping vGPU purely
> > > > virtual; 2) dealing with the Linux DMA API to achive hardware
> > > > IOMMU compatibility.
> > > >
> > > > PS, this has very little to do with pinning wholly or
> > > > partially. Intel KVMGT has once been had the whole guest
> > > > memory pinned, only because we used a spinlock, which can't
> > > > sleep at runtime.  We have removed that spinlock in our
> > > > another upstreaming effort, not here but for i915 driver, so
> > > > probably no biggie.
> > > >
> > > 
> > >  OK, then you guys don't need to pin everything. The next
> > >  question will be if you can send the pinning request from your
> > >  mediated driver backend to request memory pinning like we have
> > >  demonstrated in the v3 patch, function vfio_pin_pages and
> > >  vfio_unpin_pages?
> > > 
> > > >>>
> > > >>> Jike can you confirm this statement? My feeling is that we don't
> > > >>> have such logic in our device model to figure out which pages
> > > >>> need to be pinned on demand. So currently pin-everything is same
> > > >>> requirement in both KVM and Xen side...
> > > >>
> > > >> [Correct me in case of any neglect:)]
> > > >>
> > > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > > >> faulting-handling-retrying.
> > > >>
> > > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > > >> address region used by Guest for GPU access, whenever Guest sets
> > > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > > >> the page before it get used by Guest. This probably doesn't need
> > > >> device model to change :)
> > > >
> > > > Hi Jike
> > > >
> > > > Just out of curiosity, how does the host intercept this before it
> > > > goes on the bus?
> > > >
> > >
> > > Hi Neo,
> > >
> > > [prologize if I mis-expressed myself, bad English ..]
> > >
> > > I was talking about intercepting the setting-up of GPU page tables,
> > > not the DMA itself.  For currently Intel GPU, the page tables are
> > > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > > Table), the writing event to an GTT entry from Guest, is always
> > > intercepted by Host.
> > 
> > Hi Jike,
> > 
> > Thanks for the details, one more question if the page tables are guest RAM, 
> > how do you
> > intercept it from host? I can see it get intercepted when it is in MMIO 
> > range.
> > 
> 
> We use page tracking framework, which is newly added to KVM recently,
> to mark RAM pages as read-only so write accesses are intercepted to 
> device model.

Yes, I am aware of that patchset from Guangrong. So far the interface are all
requiring struct *kvm, copied from https://lkml.org/lkml/2015/11/30/644

- kvm_page_track_add_page(): add the page to the tracking pool after
  that later specified access on that page will be tracked

- kvm_page_track_remove_page(): remove the page from the tracking pool,
  the specified access on the page is not tracked after the last user is
  gone

void kvm_page_track_add_page(struct kvm *kvm, gfn_t gfn,
enum kvm_page_track_mode mode);
void kvm_page_track_remove_page(struct kvm *kvm, gfn_t gfn,
   enum kvm_page_track_mode mode);

Really curious how you are going to have access to the struct kvm *kvm, or you
are relying on the userfaultfd to track the write faults only as part of the
QEMU userfault thread?

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Friday, May 13, 2016 3:38 PM
> 
> On Fri, May 13, 2016 at 07:13:44AM +, Tian, Kevin wrote:
> > > From: Neo Jia [mailto:c...@nvidia.com]
> > > Sent: Friday, May 13, 2016 2:42 PM
> > >
> > >
> > > >
> > > > We possibly have the same requirement from the mediate driver backend:
> > > >
> > > > a) get a GFN, when guest try to tell hardware;
> > > > b) consult the vfio iommu with that GFN[1]: will you find me a 
> > > > proper dma_addr?
> > >
> > > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > > purpose in your i915 driver, which is what we are doing today.
> > >
> >
> > Can such 'map' operation be consolidated in vGPU core driver? I don't think
> > Intel vGPU driver has any feature proactively relying on iommu. The reason
> > why we keep talking iommu is just because the kernel may enable iommu
> > for physical GPU so we need make sure our device model can work in such
> > configuration. And this requirement should apply to all vendors, not Intel
> > specific (like you said you are doing it already today).
> 
> Hi Kevin,
> 
> Actually, such requirement is already satisfied today as all vendor drivers
> should transparently work with and without system iommu on bare-metal, right?
> 
> So I don't see any new requirement here, also such consolidation doesn't help
> any but adding complexity to the system as vendor driver will not remove
> their own dma_map_xxx functions as they are still required to support
> non-mediated cases.
> 

Thanks for your information, which makes it clearer where the difference is. :-)

Based on your description, looks you treat guest pages same as normal process
pages, which all share the same code path when mapping as DMA target, so it
is pointless to separate guest page map out to vGPU core driver. Is this
understanding correct?

In our side, so far guest pages are treated differently from normal process
pages, which is the main reason why I asked whether we can consolidate that
part. Looks now it's not necessary since it's already not a common requirement.

One additional question though. Jike already mentioned the need to shadow
GPU MMU (called GTT table in Intel side) in our device model. 'shadow' here
basically means we need translate from 'gfn' in guest pte to 'dmadr_t'
as returned by dma_map_xxx. Based on gfn->pfn translation provided by
VFIO (in your v3 driver), gfn->dmadr_t mapping can be constructed accordingly
in the vendor driver. So do you have similar requirement like this? If yes, do
you think any value to unify that translation structure or prefer to maintain
it by vendor driver?

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Friday, May 13, 2016 3:42 PM
> 
> On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> > On 05/13/2016 02:43 PM, Neo Jia wrote:
> > > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> > >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >  From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
> >  2016 3:49 AM
> > 
> > >
> > >> Perhaps one possibility would be to allow the vgpu driver
> > >> to register map and unmap callbacks.  The unmap callback
> > >> might provide the invalidation interface that we're so far
> > >> missing.  The combination of map and unmap callbacks might
> > >> simplify the Intel approach of pinning the entire VM memory
> > >> space, ie. for each map callback do a translation (pin) and
> > >> dma_map_page, for each unmap do a dma_unmap_page and
> > >> release the translation.
> > >
> > > Yes adding map/unmap ops in pGPU drvier (I assume you are
> > > refering to gpu_device_ops as implemented in Kirti's patch)
> > > sounds a good idea, satisfying both: 1) keeping vGPU purely
> > > virtual; 2) dealing with the Linux DMA API to achive hardware
> > > IOMMU compatibility.
> > >
> > > PS, this has very little to do with pinning wholly or
> > > partially. Intel KVMGT has once been had the whole guest
> > > memory pinned, only because we used a spinlock, which can't
> > > sleep at runtime.  We have removed that spinlock in our
> > > another upstreaming effort, not here but for i915 driver, so
> > > probably no biggie.
> > >
> > 
> >  OK, then you guys don't need to pin everything. The next
> >  question will be if you can send the pinning request from your
> >  mediated driver backend to request memory pinning like we have
> >  demonstrated in the v3 patch, function vfio_pin_pages and
> >  vfio_unpin_pages?
> > 
> > >>>
> > >>> Jike can you confirm this statement? My feeling is that we don't
> > >>> have such logic in our device model to figure out which pages
> > >>> need to be pinned on demand. So currently pin-everything is same
> > >>> requirement in both KVM and Xen side...
> > >>
> > >> [Correct me in case of any neglect:)]
> > >>
> > >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> > >> from a GPU is certainly a DMA operation. The DMA facility of most
> > >> platforms, IGD and NVIDIA GPU included, is not capable of
> > >> faulting-handling-retrying.
> > >>
> > >> As for vGPU solutions like Nvidia and Intel provide, the memory
> > >> address region used by Guest for GPU access, whenever Guest sets
> > >> the mappings, it is intercepted by Host, so it's safe to only pin
> > >> the page before it get used by Guest. This probably doesn't need
> > >> device model to change :)
> > >
> > > Hi Jike
> > >
> > > Just out of curiosity, how does the host intercept this before it
> > > goes on the bus?
> > >
> >
> > Hi Neo,
> >
> > [prologize if I mis-expressed myself, bad English ..]
> >
> > I was talking about intercepting the setting-up of GPU page tables,
> > not the DMA itself.  For currently Intel GPU, the page tables are
> > MMIO registers or simply RAM pages, called GTT (Graphics Translation
> > Table), the writing event to an GTT entry from Guest, is always
> > intercepted by Host.
> 
> Hi Jike,
> 
> Thanks for the details, one more question if the page tables are guest RAM, 
> how do you
> intercept it from host? I can see it get intercepted when it is in MMIO range.
> 

We use page tracking framework, which is newly added to KVM recently,
to mark RAM pages as read-only so write accesses are intercepted to 
device model.

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 03:30:27PM +0800, Jike Song wrote:
> On 05/13/2016 02:43 PM, Neo Jia wrote:
> > On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> >> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>  From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
>  2016 3:49 AM
>  
> > 
> >> Perhaps one possibility would be to allow the vgpu driver
> >> to register map and unmap callbacks.  The unmap callback
> >> might provide the invalidation interface that we're so far
> >> missing.  The combination of map and unmap callbacks might
> >> simplify the Intel approach of pinning the entire VM memory
> >> space, ie. for each map callback do a translation (pin) and
> >> dma_map_page, for each unmap do a dma_unmap_page and
> >> release the translation.
> > 
> > Yes adding map/unmap ops in pGPU drvier (I assume you are
> > refering to gpu_device_ops as implemented in Kirti's patch)
> > sounds a good idea, satisfying both: 1) keeping vGPU purely 
> > virtual; 2) dealing with the Linux DMA API to achive hardware
> > IOMMU compatibility.
> > 
> > PS, this has very little to do with pinning wholly or
> > partially. Intel KVMGT has once been had the whole guest
> > memory pinned, only because we used a spinlock, which can't
> > sleep at runtime.  We have removed that spinlock in our
> > another upstreaming effort, not here but for i915 driver, so
> > probably no biggie.
> > 
>  
>  OK, then you guys don't need to pin everything. The next
>  question will be if you can send the pinning request from your
>  mediated driver backend to request memory pinning like we have
>  demonstrated in the v3 patch, function vfio_pin_pages and 
>  vfio_unpin_pages?
>  
> >>> 
> >>> Jike can you confirm this statement? My feeling is that we don't
> >>> have such logic in our device model to figure out which pages
> >>> need to be pinned on demand. So currently pin-everything is same
> >>> requirement in both KVM and Xen side...
> >> 
> >> [Correct me in case of any neglect:)]
> >> 
> >> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
> >> from a GPU is certainly a DMA operation. The DMA facility of most
> >> platforms, IGD and NVIDIA GPU included, is not capable of
> >> faulting-handling-retrying.
> >> 
> >> As for vGPU solutions like Nvidia and Intel provide, the memory
> >> address region used by Guest for GPU access, whenever Guest sets
> >> the mappings, it is intercepted by Host, so it's safe to only pin
> >> the page before it get used by Guest. This probably doesn't need
> >> device model to change :)
> > 
> > Hi Jike
> > 
> > Just out of curiosity, how does the host intercept this before it
> > goes on the bus?
> > 
> 
> Hi Neo,
> 
> [prologize if I mis-expressed myself, bad English ..] 
> 
> I was talking about intercepting the setting-up of GPU page tables,
> not the DMA itself.  For currently Intel GPU, the page tables are
> MMIO registers or simply RAM pages, called GTT (Graphics Translation
> Table), the writing event to an GTT entry from Guest, is always
> intercepted by Host.

Hi Jike,

Thanks for the details, one more question if the page tables are guest RAM, how 
do you
intercept it from host? I can see it get intercepted when it is in MMIO range.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 07:13:44AM +, Tian, Kevin wrote:
> > From: Neo Jia [mailto:c...@nvidia.com]
> > Sent: Friday, May 13, 2016 2:42 PM
> > 
> > 
> > >
> > > We possibly have the same requirement from the mediate driver backend:
> > >
> > >   a) get a GFN, when guest try to tell hardware;
> > >   b) consult the vfio iommu with that GFN[1]: will you find me a proper 
> > > dma_addr?
> > 
> > We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> > purpose in your i915 driver, which is what we are doing today.
> > 
> 
> Can such 'map' operation be consolidated in vGPU core driver? I don't think 
> Intel vGPU driver has any feature proactively relying on iommu. The reason 
> why we keep talking iommu is just because the kernel may enable iommu 
> for physical GPU so we need make sure our device model can work in such
> configuration. And this requirement should apply to all vendors, not Intel
> specific (like you said you are doing it already today).

Hi Kevin,

Actually, such requirement is already satisfied today as all vendor drivers
should transparently work with and without system iommu on bare-metal, right?

So I don't see any new requirement here, also such consolidation doesn't help
any but adding complexity to the system as vendor driver will not remove
their own dma_map_xxx functions as they are still required to support
non-mediated cases. 

Thanks,
Neo

> 
> Thanks
> Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Jike Song
On 05/13/2016 02:43 PM, Neo Jia wrote:
> On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
>> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
 From: Neo Jia [mailto:c...@nvidia.com] Sent: Friday, May 13,
 2016 3:49 AM
 
> 
>> Perhaps one possibility would be to allow the vgpu driver
>> to register map and unmap callbacks.  The unmap callback
>> might provide the invalidation interface that we're so far
>> missing.  The combination of map and unmap callbacks might
>> simplify the Intel approach of pinning the entire VM memory
>> space, ie. for each map callback do a translation (pin) and
>> dma_map_page, for each unmap do a dma_unmap_page and
>> release the translation.
> 
> Yes adding map/unmap ops in pGPU drvier (I assume you are
> refering to gpu_device_ops as implemented in Kirti's patch)
> sounds a good idea, satisfying both: 1) keeping vGPU purely 
> virtual; 2) dealing with the Linux DMA API to achive hardware
> IOMMU compatibility.
> 
> PS, this has very little to do with pinning wholly or
> partially. Intel KVMGT has once been had the whole guest
> memory pinned, only because we used a spinlock, which can't
> sleep at runtime.  We have removed that spinlock in our
> another upstreaming effort, not here but for i915 driver, so
> probably no biggie.
> 
 
 OK, then you guys don't need to pin everything. The next
 question will be if you can send the pinning request from your
 mediated driver backend to request memory pinning like we have
 demonstrated in the v3 patch, function vfio_pin_pages and 
 vfio_unpin_pages?
 
>>> 
>>> Jike can you confirm this statement? My feeling is that we don't
>>> have such logic in our device model to figure out which pages
>>> need to be pinned on demand. So currently pin-everything is same
>>> requirement in both KVM and Xen side...
>> 
>> [Correct me in case of any neglect:)]
>> 
>> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM
>> from a GPU is certainly a DMA operation. The DMA facility of most
>> platforms, IGD and NVIDIA GPU included, is not capable of
>> faulting-handling-retrying.
>> 
>> As for vGPU solutions like Nvidia and Intel provide, the memory
>> address region used by Guest for GPU access, whenever Guest sets
>> the mappings, it is intercepted by Host, so it's safe to only pin
>> the page before it get used by Guest. This probably doesn't need
>> device model to change :)
> 
> Hi Jike
> 
> Just out of curiosity, how does the host intercept this before it
> goes on the bus?
> 

Hi Neo,

[prologize if I mis-expressed myself, bad English ..] 

I was talking about intercepting the setting-up of GPU page tables,
not the DMA itself.  For currently Intel GPU, the page tables are
MMIO registers or simply RAM pages, called GTT (Graphics Translation
Table), the writing event to an GTT entry from Guest, is always
intercepted by Host.

--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 03:10:22PM +0800, Dong Jia wrote:
> On Thu, 12 May 2016 13:05:52 -0600
> Alex Williamson  wrote:
> 
> > On Thu, 12 May 2016 08:00:36 +
> > "Tian, Kevin"  wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > > Sent: Thursday, May 12, 2016 6:06 AM
> > > > 
> > > > On Wed, 11 May 2016 17:15:15 +0800
> > > > Jike Song  wrote:
> > > >   
> > > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > > >  From: Song, Jike
> > > > > 
> > > > >  IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > > >  IOMMU
> > > > >  hardware. It just, as you said in another mail, "rather than
> > > > >  programming them into an IOMMU for a device, it simply stores the
> > > > >  translations for use by later requests".
> > > > > 
> > > > >  That imposes a constraint on gfx driver: hardware IOMMU must be 
> > > > >  disabled.
> > > > >  Otherwise, if IOMMU is present, the gfx driver eventually 
> > > > >  programs
> > > > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > > > >  dma_map_page;
> > > > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > > >  translations without any knowledge about hardware IOMMU, how is 
> > > > >  the
> > > > >  device model supposed to do to get an IOVA for a given GPA 
> > > > >  (thereby HPA
> > > > >  by the IOMMU backend here)?
> > > > > 
> > > > >  If things go as guessed above, as vfio_pin_pages() indicates, it
> > > > >  pin & translate vaddr to PFN, then it will be very difficult for 
> > > > >  the
> > > > >  device model to figure out:
> > > > > 
> > > > >   1, for a given GPA, how to avoid calling dma_map_page multiple 
> > > > >  times?
> > > > >   2, for which page to call dma_unmap_page?
> > > > > 
> > > > >  --  
> > > > > >>>
> > > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > > >>> returned by dma_map_page.
> > > > > >>>  
> > > > > >>
> > > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility 
> > > > > >> here?  
> > > > > >
> > > > > > Hi Jike,
> > > > > >
> > > > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > > > important
> > > > > > that part is actually orthogonal to what we are discussing here as 
> > > > > > we will only
> > > > > > cache the mapping between  > > > > > va>, once we
> > > > > > have pinned pages later with the help of above info, you can map it 
> > > > > > into the
> > > > > > proper iommu domain if the system has configured so.
> > > > > >  
> > > > >
> > > > > Hi Neo,
> > > > >
> > > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > > elsewhere,
> > > > > but to find out whether a pfn was previously mapped or not, you have 
> > > > > to
> > > > > track it with another rbtree-alike data structure (the IOMMU driver 
> > > > > simply
> > > > > doesn't bother with tracking), that seems somehow duplicate with the 
> > > > > vGPU
> > > > > IOMMU backend we are discussing here.
> > > > >
> > > > > And it is also semantically correct for an IOMMU backend to handle 
> > > > > both w/
> > > > > and w/o an IOMMU hardware? :)  
> > > > 
> > > > A problem with the iommu doing the dma_map_page() though is for what
> > > > device does it do this?  In the mediated case the vfio infrastructure
> > > > is dealing with a software representation of a device.  For all we
> > > > know that software model could transparently migrate from one physical
> > > > GPU to another.  There may not even be a physical device backing
> > > > the mediated device.  Those are details left to the vgpu driver itself. 
> > > >  
> > > 
> > > This is a fair argument. VFIO iommu driver simply serves user space
> > > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > > 
> > > > 
> > > > Perhaps one possibility would be to allow the vgpu driver to register
> > > > map and unmap callbacks.  The unmap callback might provide the
> > > > invalidation interface that we're so far missing.  The combination of
> > > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > > entire VM memory space, ie. for each map callback do a translation
> > > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > > the translation.  There's still 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Friday, May 13, 2016 2:42 PM
> 
> 
> >
> > We possibly have the same requirement from the mediate driver backend:
> >
> > a) get a GFN, when guest try to tell hardware;
> > b) consult the vfio iommu with that GFN[1]: will you find me a proper 
> > dma_addr?
> 
> We will provide you the pfn via vfio_pin_pages, so you can map it for dma
> purpose in your i915 driver, which is what we are doing today.
> 

Can such 'map' operation be consolidated in vGPU core driver? I don't think 
Intel vGPU driver has any feature proactively relying on iommu. The reason 
why we keep talking iommu is just because the kernel may enable iommu 
for physical GPU so we need make sure our device model can work in such
configuration. And this requirement should apply to all vendors, not Intel
specific (like you said you are doing it already today).

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Dong Jia
On Thu, 12 May 2016 13:05:52 -0600
Alex Williamson  wrote:

> On Thu, 12 May 2016 08:00:36 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song  wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > >  From: Song, Jike
> > > > 
> > > >  IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > >  IOMMU
> > > >  hardware. It just, as you said in another mail, "rather than
> > > >  programming them into an IOMMU for a device, it simply stores the
> > > >  translations for use by later requests".
> > > > 
> > > >  That imposes a constraint on gfx driver: hardware IOMMU must be 
> > > >  disabled.
> > > >  Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > > >  dma_map_page;
> > > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > >  translations without any knowledge about hardware IOMMU, how is the
> > > >  device model supposed to do to get an IOVA for a given GPA 
> > > >  (thereby HPA
> > > >  by the IOMMU backend here)?
> > > > 
> > > >  If things go as guessed above, as vfio_pin_pages() indicates, it
> > > >  pin & translate vaddr to PFN, then it will be very difficult for 
> > > >  the
> > > >  device model to figure out:
> > > > 
> > > > 1, for a given GPA, how to avoid calling dma_map_page multiple 
> > > >  times?
> > > > 2, for which page to call dma_unmap_page?
> > > > 
> > > >  --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? 
> > > > >>  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > > important
> > > > > that part is actually orthogonal to what we are discussing here as we 
> > > > > will only
> > > > > cache the mapping between , 
> > > > > once we
> > > > > have pinned pages later with the help of above info, you can map it 
> > > > > into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver 
> > > > simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the 
> > > > vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both 
> > > > w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 02:22:37PM +0800, Jike Song wrote:
> On 05/13/2016 10:41 AM, Tian, Kevin wrote:
> >> From: Neo Jia [mailto:c...@nvidia.com]
> >> Sent: Friday, May 13, 2016 3:49 AM
> >>
> >>>
>  Perhaps one possibility would be to allow the vgpu driver to register
>  map and unmap callbacks.  The unmap callback might provide the
>  invalidation interface that we're so far missing.  The combination of
>  map and unmap callbacks might simplify the Intel approach of pinning the
>  entire VM memory space, ie. for each map callback do a translation
>  (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>  the translation.
> >>>
> >>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >>> gpu_device_ops as
> >>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >>> keeping vGPU purely
> >>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >>> compatibility.
> >>>
> >>> PS, this has very little to do with pinning wholly or partially. Intel 
> >>> KVMGT has
> >>> once been had the whole guest memory pinned, only because we used a 
> >>> spinlock,
> >>> which can't sleep at runtime.  We have removed that spinlock in our 
> >>> another
> >>> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>>
> >>
> >> OK, then you guys don't need to pin everything. The next question will be 
> >> if you
> >> can send the pinning request from your mediated driver backend to request 
> >> memory
> >> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages 
> >> and
> >> vfio_unpin_pages?
> >>
> > 
> > Jike can you confirm this statement? My feeling is that we don't have such 
> > logic
> > in our device model to figure out which pages need to be pinned on demand. 
> > So
> > currently pin-everything is same requirement in both KVM and Xen side...
> 
> [Correct me in case of any neglect:)]
> 
> IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
> certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
> GPU included, is not capable of faulting-handling-retrying.
> 
> As for vGPU solutions like Nvidia and Intel provide, the memory address region
> used by Guest for GPU access, whenever Guest sets the mappings, it is
> intercepted by Host, so it's safe to only pin the page before it get used by
> Guest. This probably doesn't need device model to change :)

Hi Jike

Just out of curiosity, how does the host intercept this before it goes on the
bus?

Thanks,
Neo

> 
> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Neo Jia
On Fri, May 13, 2016 at 02:08:36PM +0800, Jike Song wrote:
> On 05/13/2016 03:49 AM, Neo Jia wrote:
> > On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> >> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
> >>  wrote:
> >>> On Wed, 11 May 2016 17:15:15 +0800
> >>> Jike Song  wrote:
> >>>
>  On 05/11/2016 12:02 AM, Neo Jia wrote:
> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>  From: Song, Jike
> 
>  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>  hardware. It just, as you said in another mail, "rather than
>  programming them into an IOMMU for a device, it simply stores the
>  translations for use by later requests".
> 
>  That imposes a constraint on gfx driver: hardware IOMMU must be 
>  disabled.
>  Otherwise, if IOMMU is present, the gfx driver eventually programs
>  the hardware IOMMU with IOVA returned by pci_map_page or 
>  dma_map_page;
>  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>  translations without any knowledge about hardware IOMMU, how is the
>  device model supposed to do to get an IOVA for a given GPA (thereby 
>  HPA
>  by the IOMMU backend here)?
> 
>  If things go as guessed above, as vfio_pin_pages() indicates, it
>  pin & translate vaddr to PFN, then it will be very difficult for the
>  device model to figure out:
> 
>   1, for a given GPA, how to avoid calling dma_map_page multiple 
>  times?
>   2, for which page to call dma_unmap_page?
> 
>  --
> >>>
> >>> We have to support both w/ iommu and w/o iommu case, since
> >>> that fact is out of GPU driver control. A simple way is to use
> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>> returned by dma_map_page.
> >>>
> >>
> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >
> > Hi Jike,
> >
> > With mediated passthru, you still can use hardware iommu, but more 
> > important
> > that part is actually orthogonal to what we are discussing here as we 
> > will only
> > cache the mapping between , 
> > once we
> > have pinned pages later with the help of above info, you can map it 
> > into the
> > proper iommu domain if the system has configured so.
> >
> 
>  Hi Neo,
> 
>  Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>  but to find out whether a pfn was previously mapped or not, you have to
>  track it with another rbtree-alike data structure (the IOMMU driver 
>  simply
>  doesn't bother with tracking), that seems somehow duplicate with the vGPU
>  IOMMU backend we are discussing here.
> 
>  And it is also semantically correct for an IOMMU backend to handle both 
>  w/
>  and w/o an IOMMU hardware? :)
> >>>
> >>> A problem with the iommu doing the dma_map_page() though is for what
> >>> device does it do this?  In the mediated case the vfio infrastructure
> >>> is dealing with a software representation of a device.  For all we
> >>> know that software model could transparently migrate from one physical
> >>> GPU to another.  There may not even be a physical device backing
> >>> the mediated device.  Those are details left to the vgpu driver itself.
> >>>
> >>
> >> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> >> a particular
> >> pdev in an vGPU IOMMU BE.
> >>
> >>> Perhaps one possibility would be to allow the vgpu driver to register
> >>> map and unmap callbacks.  The unmap callback might provide the
> >>> invalidation interface that we're so far missing.  The combination of
> >>> map and unmap callbacks might simplify the Intel approach of pinning the
> >>> entire VM memory space, ie. for each map callback do a translation
> >>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> >>> the translation.
> >>
> >> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> >> gpu_device_ops as
> >> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> >> keeping vGPU purely
> >> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> >> compatibility.
> >>
> >> PS, this has very little to do with pinning wholly or partially. Intel 
> >> KVMGT has
> >> once been had the whole guest memory pinned, only because we used a 
> >> spinlock,
> >> which can't sleep at runtime.  We have removed that spinlock in our another
> >> upstreaming effort, not here but for i915 driver, so probably no biggie.
> >>
> > 
> 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Jike Song
On 05/13/2016 10:41 AM, Tian, Kevin wrote:
>> From: Neo Jia [mailto:c...@nvidia.com]
>> Sent: Friday, May 13, 2016 3:49 AM
>>
>>>
 Perhaps one possibility would be to allow the vgpu driver to register
 map and unmap callbacks.  The unmap callback might provide the
 invalidation interface that we're so far missing.  The combination of
 map and unmap callbacks might simplify the Intel approach of pinning the
 entire VM memory space, ie. for each map callback do a translation
 (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
 the translation.
>>>
>>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>>> gpu_device_ops as
>>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>>> keeping vGPU purely
>>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>>> compatibility.
>>>
>>> PS, this has very little to do with pinning wholly or partially. Intel 
>>> KVMGT has
>>> once been had the whole guest memory pinned, only because we used a 
>>> spinlock,
>>> which can't sleep at runtime.  We have removed that spinlock in our another
>>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>>
>>
>> OK, then you guys don't need to pin everything. The next question will be if 
>> you
>> can send the pinning request from your mediated driver backend to request 
>> memory
>> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages 
>> and
>> vfio_unpin_pages?
>>
> 
> Jike can you confirm this statement? My feeling is that we don't have such 
> logic
> in our device model to figure out which pages need to be pinned on demand. So
> currently pin-everything is same requirement in both KVM and Xen side...

[Correct me in case of any neglect:)]

IMO the ultimate reason to pin a page, is for DMA. Accessing RAM from a GPU is
certainly a DMA operation. The DMA facility of most platforms, IGD and NVIDIA
GPU included, is not capable of faulting-handling-retrying.

As for vGPU solutions like Nvidia and Intel provide, the memory address region
used by Guest for GPU access, whenever Guest sets the mappings, it is
intercepted by Host, so it's safe to only pin the page before it get used by
Guest. This probably doesn't need device model to change :)


--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-13 Thread Jike Song
On 05/13/2016 03:49 AM, Neo Jia wrote:
> On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
>> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
>>  wrote:
>>> On Wed, 11 May 2016 17:15:15 +0800
>>> Jike Song  wrote:
>>>
 On 05/11/2016 12:02 AM, Neo Jia wrote:
> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
 From: Song, Jike

 IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
 hardware. It just, as you said in another mail, "rather than
 programming them into an IOMMU for a device, it simply stores the
 translations for use by later requests".

 That imposes a constraint on gfx driver: hardware IOMMU must be 
 disabled.
 Otherwise, if IOMMU is present, the gfx driver eventually programs
 the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
 Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
 translations without any knowledge about hardware IOMMU, how is the
 device model supposed to do to get an IOVA for a given GPA (thereby HPA
 by the IOMMU backend here)?

 If things go as guessed above, as vfio_pin_pages() indicates, it
 pin & translate vaddr to PFN, then it will be very difficult for the
 device model to figure out:

  1, for a given GPA, how to avoid calling dma_map_page multiple times?
  2, for which page to call dma_unmap_page?

 --
>>>
>>> We have to support both w/ iommu and w/o iommu case, since
>>> that fact is out of GPU driver control. A simple way is to use
>>> dma_map_page which internally will cope with w/ and w/o iommu
>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>> returned by dma_map_page.
>>>
>>
>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>
> Hi Jike,
>
> With mediated passthru, you still can use hardware iommu, but more 
> important
> that part is actually orthogonal to what we are discussing here as we 
> will only
> cache the mapping between , 
> once we
> have pinned pages later with the help of above info, you can map it into 
> the
> proper iommu domain if the system has configured so.
>

 Hi Neo,

 Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
 but to find out whether a pfn was previously mapped or not, you have to
 track it with another rbtree-alike data structure (the IOMMU driver simply
 doesn't bother with tracking), that seems somehow duplicate with the vGPU
 IOMMU backend we are discussing here.

 And it is also semantically correct for an IOMMU backend to handle both w/
 and w/o an IOMMU hardware? :)
>>>
>>> A problem with the iommu doing the dma_map_page() though is for what
>>> device does it do this?  In the mediated case the vfio infrastructure
>>> is dealing with a software representation of a device.  For all we
>>> know that software model could transparently migrate from one physical
>>> GPU to another.  There may not even be a physical device backing
>>> the mediated device.  Those are details left to the vgpu driver itself.
>>>
>>
>> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
>> a particular
>> pdev in an vGPU IOMMU BE.
>>
>>> Perhaps one possibility would be to allow the vgpu driver to register
>>> map and unmap callbacks.  The unmap callback might provide the
>>> invalidation interface that we're so far missing.  The combination of
>>> map and unmap callbacks might simplify the Intel approach of pinning the
>>> entire VM memory space, ie. for each map callback do a translation
>>> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
>>> the translation.
>>
>> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
>> gpu_device_ops as
>> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
>> keeping vGPU purely
>> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
>> compatibility.
>>
>> PS, this has very little to do with pinning wholly or partially. Intel KVMGT 
>> has
>> once been had the whole guest memory pinned, only because we used a spinlock,
>> which can't sleep at runtime.  We have removed that spinlock in our another
>> upstreaming effort, not here but for i915 driver, so probably no biggie.
>>
> 
> OK, then you guys don't need to pin everything.

Yes :)

> The next question will be if you
> can send the pinning request from your mediated driver backend to request 
> memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?

Kind of yes, not exactly.

IMO the 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, May 13, 2016 3:06 AM
> 
> > >
> >
> > Based on above thought I'm thinking whether below would work:
> > (let's use gpa to replace existing iova in type1 driver, while using iova
> > for the one actually used in vGPU driver. Assume 'pin-all' scenario first
> > which matches existing vfio logic)
> >
> > - No change to existing vfio_dma structure. VFIO still maintains gpa<->vaddr
> > mapping, in coarse-grained regions;
> >
> > - Leverage same page accounting/pinning logic in type1 driver, which
> > should be enough for 'pin-all' usage;
> >
> > - Then main divergence point for vGPU would be in vfio_unmap_unpin
> > and vfio_iommu_map. I'm not sure whether it's easy to fake an
> > iommu_domain for vGPU so same iommu_map/unmap can be reused.
> 
> This seems troublesome.  Kirti's version used numerous api-only tests
> to avoid these which made the code difficult to trace.  Clearly one
> option is to split out the common code so that a new mediated-type1
> backend skips this, but they thought they could clean it up without
> this, so we'll see what happens in the next version.
> 
> > If not, we may introduce two new map/unmap callbacks provided
> > specifically by vGPU core driver, as you suggested:
> >
> > * vGPU core driver uses dma_map_page to map specified pfns:
> >
> > o When IOMMU is enabled, we'll get an iova returned different
> > from pfn;
> > o When IOMMU is disabled, returned iova is same as pfn;
> 
> Either way each iova needs to be stored and we have a worst case of one
> iova per page of guest memory.
> 
> > * Then vGPU core driver just maintains its own gpa<->iova lookup
> > table (e.g. called vgpu_dma)
> >
> > * Because each vfio_iommu_map invocation is about a contiguous
> > region, we can expect same number of vgpu_dma entries as maintained
> > for vfio_dma list;
> >
> > Then it's vGPU core driver's responsibility to provide gpa<->iova
> > lookup for vendor specific GPU driver. And we don't need worry about
> > tens of thousands of entries. Once we get this simple 'pin-all' model
> > ready, then it can be further extended to support 'pin-sparse'
> > scenario. We still maintain a top-level vgpu_dma list with each entry to
> > further link its own sparse mapping structure. In reality I don't expect
> > we really need to maintain per-page translation even with sparse pinning.
> 
> If you're trying to equate the scale of what we need to track vs what
> type1 currently tracks, they're significantly different.  Possible
> things we need to track include the pfn, the iova, and possibly a
> reference count or some sort of pinned page map.  In the pin-all model
> we can assume that every page is pinned on map and unpinned on unmap,
> so a reference count or map is unnecessary.  We can also assume that we
> can always regenerate the pfn with get_user_pages() from the vaddr, so
> we don't need to track that.  I don't see any way around tracking the
> iova.  The iommu can't tell us this like it can with the normal type1
> model because the pfn is the result of the translation, not the key for
> the translation. So we're always going to have between 1 and
> (size/PAGE_SIZE) iova entries per vgpu_dma entry.  You might be able to
> manage the vgpu_dma with an rb-tree, but each vgpu_dma entry needs some
> data structure tracking every iova.

There is one option. We may use alloc_iova to reserve continuous iova
range for each vgpu_dma range and then use iommu_map/unmap to
write iommu ptes later upon map request (then could be same #entries
as vfio_dma compared to unbounded entries when using dma_map_page). 
Of course this needs to be done in vGPU core driver, since vfio type1 only 
sees a faked iommu domain.

> 
> Sparse mapping has the same issue but of course the tree of iovas is
> potentially incomplete and we need a way to determine where it's
> incomplete.  A page table rooted in the vgpu_dma and indexed by the
> offset from the start vaddr seems like the way to go here.  It's also
> possible that some mediated device models might store the iova in the
> command sent to the device and therefore be able to parse those entries
> back out to unmap them without storing them separately.  This might be
> how the s390 channel-io model would prefer to work.  That seems like
> further validation that such tracking is going to be dependent on the
> mediated driver itself and probably not something to centralize in a
> mediated iommu driver.  Thanks,
> 

Another simpler way might be allocate an array for each memory
regions registered from user space. For a 512MB region, it means
512K*4=2MB array to track pfn or iova mapping corresponding to
a gfn. It may consume more resource than rb tree when not many
pages need to be pinned, but could be less when rb tree increases
a lot. 

Is such array-based approach considered ugly in kernel? :-)

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Friday, May 13, 2016 3:49 AM
> 
> >
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.
> >
> > Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> > gpu_device_ops as
> > implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> > keeping vGPU purely
> > virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> > compatibility.
> >
> > PS, this has very little to do with pinning wholly or partially. Intel 
> > KVMGT has
> > once been had the whole guest memory pinned, only because we used a 
> > spinlock,
> > which can't sleep at runtime.  We have removed that spinlock in our another
> > upstreaming effort, not here but for i915 driver, so probably no biggie.
> >
> 
> OK, then you guys don't need to pin everything. The next question will be if 
> you
> can send the pinning request from your mediated driver backend to request 
> memory
> pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
> vfio_unpin_pages?
> 

Jike can you confirm this statement? My feeling is that we don't have such logic
in our device model to figure out which pages need to be pinned on demand. So
currently pin-everything is same requirement in both KVM and Xen side...

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Neo Jia
On Thu, May 12, 2016 at 01:05:52PM -0600, Alex Williamson wrote:
> On Thu, 12 May 2016 08:00:36 +
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Thursday, May 12, 2016 6:06 AM
> > > 
> > > On Wed, 11 May 2016 17:15:15 +0800
> > > Jike Song  wrote:
> > >   
> > > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > > >  From: Song, Jike
> > > > 
> > > >  IIUC, an api-only domain is a VFIO domain *without* underlying 
> > > >  IOMMU
> > > >  hardware. It just, as you said in another mail, "rather than
> > > >  programming them into an IOMMU for a device, it simply stores the
> > > >  translations for use by later requests".
> > > > 
> > > >  That imposes a constraint on gfx driver: hardware IOMMU must be 
> > > >  disabled.
> > > >  Otherwise, if IOMMU is present, the gfx driver eventually programs
> > > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > > >  dma_map_page;
> > > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > > >  translations without any knowledge about hardware IOMMU, how is the
> > > >  device model supposed to do to get an IOVA for a given GPA 
> > > >  (thereby HPA
> > > >  by the IOMMU backend here)?
> > > > 
> > > >  If things go as guessed above, as vfio_pin_pages() indicates, it
> > > >  pin & translate vaddr to PFN, then it will be very difficult for 
> > > >  the
> > > >  device model to figure out:
> > > > 
> > > > 1, for a given GPA, how to avoid calling dma_map_page multiple 
> > > >  times?
> > > > 2, for which page to call dma_unmap_page?
> > > > 
> > > >  --  
> > > > >>>
> > > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > > >>> that fact is out of GPU driver control. A simple way is to use
> > > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > > >>> returned by dma_map_page.
> > > > >>>  
> > > > >>
> > > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here? 
> > > > >>  
> > > > >
> > > > > Hi Jike,
> > > > >
> > > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > > important
> > > > > that part is actually orthogonal to what we are discussing here as we 
> > > > > will only
> > > > > cache the mapping between , 
> > > > > once we
> > > > > have pinned pages later with the help of above info, you can map it 
> > > > > into the
> > > > > proper iommu domain if the system has configured so.
> > > > >  
> > > >
> > > > Hi Neo,
> > > >
> > > > Technically yes you can map a pfn into the proper IOMMU domain 
> > > > elsewhere,
> > > > but to find out whether a pfn was previously mapped or not, you have to
> > > > track it with another rbtree-alike data structure (the IOMMU driver 
> > > > simply
> > > > doesn't bother with tracking), that seems somehow duplicate with the 
> > > > vGPU
> > > > IOMMU backend we are discussing here.
> > > >
> > > > And it is also semantically correct for an IOMMU backend to handle both 
> > > > w/
> > > > and w/o an IOMMU hardware? :)  
> > > 
> > > A problem with the iommu doing the dma_map_page() though is for what
> > > device does it do this?  In the mediated case the vfio infrastructure
> > > is dealing with a software representation of a device.  For all we
> > > know that software model could transparently migrate from one physical
> > > GPU to another.  There may not even be a physical device backing
> > > the mediated device.  Those are details left to the vgpu driver itself.  
> > 
> > This is a fair argument. VFIO iommu driver simply serves user space
> > requests, where only vaddr<->iova (essentially gpa in kvm case) is
> > mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> > 
> > > 
> > > Perhaps one possibility would be to allow the vgpu driver to register
> > > map and unmap callbacks.  The unmap callback might provide the
> > > invalidation interface that we're so far missing.  The combination of
> > > map and unmap callbacks might simplify the Intel approach of pinning the
> > > entire VM memory space, ie. for each map callback do a translation
> > > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > > the translation.  There's still the problem of where that dma_addr_t
> > > from the dma_map_page is stored though.  Someone would need to keep
> > > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > > that since we're already tracking information based on iova, possibly
> > > in an opaque data element provided by the vgpu driver.  However, 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Neo Jia
On Thu, May 12, 2016 at 12:11:00PM +0800, Jike Song wrote:
> On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
>  wrote:
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song  wrote:
> >
> >> On 05/11/2016 12:02 AM, Neo Jia wrote:
> >> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> >> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >>  From: Song, Jike
> >> 
> >>  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >>  hardware. It just, as you said in another mail, "rather than
> >>  programming them into an IOMMU for a device, it simply stores the
> >>  translations for use by later requests".
> >> 
> >>  That imposes a constraint on gfx driver: hardware IOMMU must be 
> >>  disabled.
> >>  Otherwise, if IOMMU is present, the gfx driver eventually programs
> >>  the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >>  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >>  translations without any knowledge about hardware IOMMU, how is the
> >>  device model supposed to do to get an IOVA for a given GPA (thereby 
> >>  HPA
> >>  by the IOMMU backend here)?
> >> 
> >>  If things go as guessed above, as vfio_pin_pages() indicates, it
> >>  pin & translate vaddr to PFN, then it will be very difficult for the
> >>  device model to figure out:
> >> 
> >>   1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>   2, for which page to call dma_unmap_page?
> >> 
> >>  --
> >> >>>
> >> >>> We have to support both w/ iommu and w/o iommu case, since
> >> >>> that fact is out of GPU driver control. A simple way is to use
> >> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >> >>> returned by dma_map_page.
> >> >>>
> >> >>
> >> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> >> >
> >> > Hi Jike,
> >> >
> >> > With mediated passthru, you still can use hardware iommu, but more 
> >> > important
> >> > that part is actually orthogonal to what we are discussing here as we 
> >> > will only
> >> > cache the mapping between , 
> >> > once we
> >> > have pinned pages later with the help of above info, you can map it into 
> >> > the
> >> > proper iommu domain if the system has configured so.
> >> >
> >>
> >> Hi Neo,
> >>
> >> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> >> but to find out whether a pfn was previously mapped or not, you have to
> >> track it with another rbtree-alike data structure (the IOMMU driver simply
> >> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> >> IOMMU backend we are discussing here.
> >>
> >> And it is also semantically correct for an IOMMU backend to handle both w/
> >> and w/o an IOMMU hardware? :)
> >
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.
> >
> 
> Great point :) Yes, I agree it's a bit intrusive to do the mapping for
> a particular
> pdev in an vGPU IOMMU BE.
> 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.
> 
> Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
> gpu_device_ops as
> implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
> keeping vGPU purely
> virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
> compatibility.
> 
> PS, this has very little to do with pinning wholly or partially. Intel KVMGT 
> has
> once been had the whole guest memory pinned, only because we used a spinlock,
> which can't sleep at runtime.  We have removed that spinlock in our another
> upstreaming effort, not here but for i915 driver, so probably no biggie.
> 

OK, then you guys don't need to pin everything. The next question will be if you
can send the pinning request from your mediated driver backend to request memory
pinning like we have demonstrated in the v3 patch, function vfio_pin_pages and
vfio_unpin_pages?

Thanks,
Neo

> 
> > There's still the problem of where that 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Alex Williamson
On Thu, 12 May 2016 08:00:36 +
"Tian, Kevin"  wrote:

> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Thursday, May 12, 2016 6:06 AM
> > 
> > On Wed, 11 May 2016 17:15:15 +0800
> > Jike Song  wrote:
> >   
> > > On 05/11/2016 12:02 AM, Neo Jia wrote:  
> > > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> > > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
> > >  From: Song, Jike
> > > 
> > >  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> > >  hardware. It just, as you said in another mail, "rather than
> > >  programming them into an IOMMU for a device, it simply stores the
> > >  translations for use by later requests".
> > > 
> > >  That imposes a constraint on gfx driver: hardware IOMMU must be 
> > >  disabled.
> > >  Otherwise, if IOMMU is present, the gfx driver eventually programs
> > >  the hardware IOMMU with IOVA returned by pci_map_page or 
> > >  dma_map_page;
> > >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> > >  translations without any knowledge about hardware IOMMU, how is the
> > >  device model supposed to do to get an IOVA for a given GPA (thereby 
> > >  HPA
> > >  by the IOMMU backend here)?
> > > 
> > >  If things go as guessed above, as vfio_pin_pages() indicates, it
> > >  pin & translate vaddr to PFN, then it will be very difficult for the
> > >  device model to figure out:
> > > 
> > >   1, for a given GPA, how to avoid calling dma_map_page multiple 
> > >  times?
> > >   2, for which page to call dma_unmap_page?
> > > 
> > >  --  
> > > >>>
> > > >>> We have to support both w/ iommu and w/o iommu case, since
> > > >>> that fact is out of GPU driver control. A simple way is to use
> > > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > > >>> returned by dma_map_page.
> > > >>>  
> > > >>
> > > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > > >
> > > > Hi Jike,
> > > >
> > > > With mediated passthru, you still can use hardware iommu, but more 
> > > > important
> > > > that part is actually orthogonal to what we are discussing here as we 
> > > > will only
> > > > cache the mapping between , 
> > > > once we
> > > > have pinned pages later with the help of above info, you can map it 
> > > > into the
> > > > proper iommu domain if the system has configured so.
> > > >  
> > >
> > > Hi Neo,
> > >
> > > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > > but to find out whether a pfn was previously mapped or not, you have to
> > > track it with another rbtree-alike data structure (the IOMMU driver simply
> > > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > > IOMMU backend we are discussing here.
> > >
> > > And it is also semantically correct for an IOMMU backend to handle both w/
> > > and w/o an IOMMU hardware? :)  
> > 
> > A problem with the iommu doing the dma_map_page() though is for what
> > device does it do this?  In the mediated case the vfio infrastructure
> > is dealing with a software representation of a device.  For all we
> > know that software model could transparently migrate from one physical
> > GPU to another.  There may not even be a physical device backing
> > the mediated device.  Those are details left to the vgpu driver itself.  
> 
> This is a fair argument. VFIO iommu driver simply serves user space
> requests, where only vaddr<->iova (essentially gpa in kvm case) is
> mattered. How iova is mapped into real IOMMU is not VFIO's interest.
> 
> > 
> > Perhaps one possibility would be to allow the vgpu driver to register
> > map and unmap callbacks.  The unmap callback might provide the
> > invalidation interface that we're so far missing.  The combination of
> > map and unmap callbacks might simplify the Intel approach of pinning the
> > entire VM memory space, ie. for each map callback do a translation
> > (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> > the translation.  There's still the problem of where that dma_addr_t
> > from the dma_map_page is stored though.  Someone would need to keep
> > track of iova to dma_addr_t.  The vfio iommu might be a place to do
> > that since we're already tracking information based on iova, possibly
> > in an opaque data element provided by the vgpu driver.  However, we're
> > going to need to take a serious look at whether an rb-tree is the right
> > data structure for the job.  It works well for the current type1
> > functionality where we typically have tens of entries.  I think the
> > NVIDIA model of sparse pinning the VM is pushing that up to tens of
> > thousands.  If Intel intends to pin 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-12 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Thursday, May 12, 2016 6:06 AM
> 
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song  wrote:
> 
> > On 05/11/2016 12:02 AM, Neo Jia wrote:
> > > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> > >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >  From: Song, Jike
> > 
> >  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >  hardware. It just, as you said in another mail, "rather than
> >  programming them into an IOMMU for a device, it simply stores the
> >  translations for use by later requests".
> > 
> >  That imposes a constraint on gfx driver: hardware IOMMU must be 
> >  disabled.
> >  Otherwise, if IOMMU is present, the gfx driver eventually programs
> >  the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >  translations without any knowledge about hardware IOMMU, how is the
> >  device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >  by the IOMMU backend here)?
> > 
> >  If things go as guessed above, as vfio_pin_pages() indicates, it
> >  pin & translate vaddr to PFN, then it will be very difficult for the
> >  device model to figure out:
> > 
> > 1, for a given GPA, how to avoid calling dma_map_page multiple 
> >  times?
> > 2, for which page to call dma_unmap_page?
> > 
> >  --
> > >>>
> > >>> We have to support both w/ iommu and w/o iommu case, since
> > >>> that fact is out of GPU driver control. A simple way is to use
> > >>> dma_map_page which internally will cope with w/ and w/o iommu
> > >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > >>> Then in this file we only need to cache GPA to whatever dmadr_t
> > >>> returned by dma_map_page.
> > >>>
> > >>
> > >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> > >
> > > Hi Jike,
> > >
> > > With mediated passthru, you still can use hardware iommu, but more 
> > > important
> > > that part is actually orthogonal to what we are discussing here as we 
> > > will only
> > > cache the mapping between , 
> > > once we
> > > have pinned pages later with the help of above info, you can map it into 
> > > the
> > > proper iommu domain if the system has configured so.
> > >
> >
> > Hi Neo,
> >
> > Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> > but to find out whether a pfn was previously mapped or not, you have to
> > track it with another rbtree-alike data structure (the IOMMU driver simply
> > doesn't bother with tracking), that seems somehow duplicate with the vGPU
> > IOMMU backend we are discussing here.
> >
> > And it is also semantically correct for an IOMMU backend to handle both w/
> > and w/o an IOMMU hardware? :)
> 
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.

This is a fair argument. VFIO iommu driver simply serves user space
requests, where only vaddr<->iova (essentially gpa in kvm case) is
mattered. How iova is mapped into real IOMMU is not VFIO's interest.

> 
> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.  There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.  However, we're
> going to need to take a serious look at whether an rb-tree is the right
> data structure for the job.  It works well for the current type1
> functionality where we typically have tens of entries.  I think the
> NVIDIA model of sparse pinning the VM is pushing that up to tens of
> thousands.  If Intel intends to pin the entire guest, that's
> potentially tens of millions of tracked entries and I don't know that
> an rb-tree is the right tool for that job.  Thanks,
> 

Based on above thought I'm thinking whether below would work:
(let's use gpa to replace existing iova in type1 driver, while using iova
for the one actually 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-11 Thread Jike Song
On Thu, May 12, 2016 at 6:06 AM, Alex Williamson
 wrote:
> On Wed, 11 May 2016 17:15:15 +0800
> Jike Song  wrote:
>
>> On 05/11/2016 12:02 AM, Neo Jia wrote:
>> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>>  From: Song, Jike
>> 
>>  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>>  hardware. It just, as you said in another mail, "rather than
>>  programming them into an IOMMU for a device, it simply stores the
>>  translations for use by later requests".
>> 
>>  That imposes a constraint on gfx driver: hardware IOMMU must be 
>>  disabled.
>>  Otherwise, if IOMMU is present, the gfx driver eventually programs
>>  the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>>  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>>  translations without any knowledge about hardware IOMMU, how is the
>>  device model supposed to do to get an IOVA for a given GPA (thereby HPA
>>  by the IOMMU backend here)?
>> 
>>  If things go as guessed above, as vfio_pin_pages() indicates, it
>>  pin & translate vaddr to PFN, then it will be very difficult for the
>>  device model to figure out:
>> 
>>   1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>   2, for which page to call dma_unmap_page?
>> 
>>  --
>> >>>
>> >>> We have to support both w/ iommu and w/o iommu case, since
>> >>> that fact is out of GPU driver control. A simple way is to use
>> >>> dma_map_page which internally will cope with w/ and w/o iommu
>> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>> >>> Then in this file we only need to cache GPA to whatever dmadr_t
>> >>> returned by dma_map_page.
>> >>>
>> >>
>> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
>> >
>> > Hi Jike,
>> >
>> > With mediated passthru, you still can use hardware iommu, but more 
>> > important
>> > that part is actually orthogonal to what we are discussing here as we will 
>> > only
>> > cache the mapping between , once 
>> > we
>> > have pinned pages later with the help of above info, you can map it into 
>> > the
>> > proper iommu domain if the system has configured so.
>> >
>>
>> Hi Neo,
>>
>> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
>> but to find out whether a pfn was previously mapped or not, you have to
>> track it with another rbtree-alike data structure (the IOMMU driver simply
>> doesn't bother with tracking), that seems somehow duplicate with the vGPU
>> IOMMU backend we are discussing here.
>>
>> And it is also semantically correct for an IOMMU backend to handle both w/
>> and w/o an IOMMU hardware? :)
>
> A problem with the iommu doing the dma_map_page() though is for what
> device does it do this?  In the mediated case the vfio infrastructure
> is dealing with a software representation of a device.  For all we
> know that software model could transparently migrate from one physical
> GPU to another.  There may not even be a physical device backing
> the mediated device.  Those are details left to the vgpu driver itself.
>

Great point :) Yes, I agree it's a bit intrusive to do the mapping for
a particular
pdev in an vGPU IOMMU BE.

> Perhaps one possibility would be to allow the vgpu driver to register
> map and unmap callbacks.  The unmap callback might provide the
> invalidation interface that we're so far missing.  The combination of
> map and unmap callbacks might simplify the Intel approach of pinning the
> entire VM memory space, ie. for each map callback do a translation
> (pin) and dma_map_page, for each unmap do a dma_unmap_page and release
> the translation.

Yes adding map/unmap ops in pGPU drvier (I assume you are refering to
gpu_device_ops as
implemented in Kirti's patch) sounds a good idea, satisfying both: 1)
keeping vGPU purely
virtual; 2) dealing with the Linux DMA API to achive hardware IOMMU
compatibility.

PS, this has very little to do with pinning wholly or partially. Intel KVMGT has
once been had the whole guest memory pinned, only because we used a spinlock,
which can't sleep at runtime.  We have removed that spinlock in our another
upstreaming effort, not here but for i915 driver, so probably no biggie.


> There's still the problem of where that dma_addr_t
> from the dma_map_page is stored though.  Someone would need to keep
> track of iova to dma_addr_t.  The vfio iommu might be a place to do
> that since we're already tracking information based on iova, possibly
> in an opaque data element provided by the vgpu driver.

Any reason to keep it opaque? Given that vfio iommu is already tracking
PFN for iova (vaddr as vGPU is), seems adding dma_addr_t as another field is
simple. But I don't have a strong opinion here, opaque definitely
works for me :)

> However, we're
> going to need to 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-11 Thread Alex Williamson
On Wed, 11 May 2016 17:15:15 +0800
Jike Song  wrote:

> On 05/11/2016 12:02 AM, Neo Jia wrote:
> > On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:  
> >> On 05/05/2016 05:27 PM, Tian, Kevin wrote:  
>  From: Song, Jike
> 
>  IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>  hardware. It just, as you said in another mail, "rather than
>  programming them into an IOMMU for a device, it simply stores the
>  translations for use by later requests".
> 
>  That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>  Otherwise, if IOMMU is present, the gfx driver eventually programs
>  the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>  Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>  translations without any knowledge about hardware IOMMU, how is the
>  device model supposed to do to get an IOVA for a given GPA (thereby HPA
>  by the IOMMU backend here)?
> 
>  If things go as guessed above, as vfio_pin_pages() indicates, it
>  pin & translate vaddr to PFN, then it will be very difficult for the
>  device model to figure out:
> 
>   1, for a given GPA, how to avoid calling dma_map_page multiple times?
>   2, for which page to call dma_unmap_page?
> 
>  --  
> >>>
> >>> We have to support both w/ iommu and w/o iommu case, since
> >>> that fact is out of GPU driver control. A simple way is to use
> >>> dma_map_page which internally will cope with w/ and w/o iommu
> >>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> >>> Then in this file we only need to cache GPA to whatever dmadr_t
> >>> returned by dma_map_page.
> >>>  
> >>
> >> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?  
> > 
> > Hi Jike,
> > 
> > With mediated passthru, you still can use hardware iommu, but more important
> > that part is actually orthogonal to what we are discussing here as we will 
> > only
> > cache the mapping between , once 
> > we 
> > have pinned pages later with the help of above info, you can map it into the
> > proper iommu domain if the system has configured so.
> >  
> 
> Hi Neo,
> 
> Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
> but to find out whether a pfn was previously mapped or not, you have to
> track it with another rbtree-alike data structure (the IOMMU driver simply
> doesn't bother with tracking), that seems somehow duplicate with the vGPU
> IOMMU backend we are discussing here.
> 
> And it is also semantically correct for an IOMMU backend to handle both w/
> and w/o an IOMMU hardware? :)

A problem with the iommu doing the dma_map_page() though is for what
device does it do this?  In the mediated case the vfio infrastructure
is dealing with a software representation of a device.  For all we
know that software model could transparently migrate from one physical
GPU to another.  There may not even be a physical device backing
the mediated device.  Those are details left to the vgpu driver itself.

Perhaps one possibility would be to allow the vgpu driver to register
map and unmap callbacks.  The unmap callback might provide the
invalidation interface that we're so far missing.  The combination of
map and unmap callbacks might simplify the Intel approach of pinning the
entire VM memory space, ie. for each map callback do a translation
(pin) and dma_map_page, for each unmap do a dma_unmap_page and release
the translation.  There's still the problem of where that dma_addr_t
from the dma_map_page is stored though.  Someone would need to keep
track of iova to dma_addr_t.  The vfio iommu might be a place to do
that since we're already tracking information based on iova, possibly
in an opaque data element provided by the vgpu driver.  However, we're
going to need to take a serious look at whether an rb-tree is the right
data structure for the job.  It works well for the current type1
functionality where we typically have tens of entries.  I think the
NVIDIA model of sparse pinning the VM is pushing that up to tens of
thousands.  If Intel intends to pin the entire guest, that's
potentially tens of millions of tracked entries and I don't know that
an rb-tree is the right tool for that job.  Thanks,

Alex



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-11 Thread Jike Song
On 05/11/2016 12:02 AM, Neo Jia wrote:
> On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
>> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
 From: Song, Jike

 IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
 hardware. It just, as you said in another mail, "rather than
 programming them into an IOMMU for a device, it simply stores the
 translations for use by later requests".

 That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
 Otherwise, if IOMMU is present, the gfx driver eventually programs
 the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
 Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
 translations without any knowledge about hardware IOMMU, how is the
 device model supposed to do to get an IOVA for a given GPA (thereby HPA
 by the IOMMU backend here)?

 If things go as guessed above, as vfio_pin_pages() indicates, it
 pin & translate vaddr to PFN, then it will be very difficult for the
 device model to figure out:

1, for a given GPA, how to avoid calling dma_map_page multiple times?
2, for which page to call dma_unmap_page?

 --
>>>
>>> We have to support both w/ iommu and w/o iommu case, since
>>> that fact is out of GPU driver control. A simple way is to use
>>> dma_map_page which internally will cope with w/ and w/o iommu
>>> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
>>> Then in this file we only need to cache GPA to whatever dmadr_t
>>> returned by dma_map_page.
>>>
>>
>> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?
> 
> Hi Jike,
> 
> With mediated passthru, you still can use hardware iommu, but more important
> that part is actually orthogonal to what we are discussing here as we will 
> only
> cache the mapping between , once we 
> have pinned pages later with the help of above info, you can map it into the
> proper iommu domain if the system has configured so.
>

Hi Neo,

Technically yes you can map a pfn into the proper IOMMU domain elsewhere,
but to find out whether a pfn was previously mapped or not, you have to
track it with another rbtree-alike data structure (the IOMMU driver simply
doesn't bother with tracking), that seems somehow duplicate with the vGPU
IOMMU backend we are discussing here.

And it is also semantically correct for an IOMMU backend to handle both w/
and w/o an IOMMU hardware? :)


--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-10 Thread Neo Jia
On Tue, May 10, 2016 at 03:52:27PM +0800, Jike Song wrote:
> On 05/05/2016 05:27 PM, Tian, Kevin wrote:
> >> From: Song, Jike
> >>
> >> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> >> hardware. It just, as you said in another mail, "rather than
> >> programming them into an IOMMU for a device, it simply stores the
> >> translations for use by later requests".
> >>
> >> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> >> Otherwise, if IOMMU is present, the gfx driver eventually programs
> >> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> >> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> >> translations without any knowledge about hardware IOMMU, how is the
> >> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> >> by the IOMMU backend here)?
> >>
> >> If things go as guessed above, as vfio_pin_pages() indicates, it
> >> pin & translate vaddr to PFN, then it will be very difficult for the
> >> device model to figure out:
> >>
> >>1, for a given GPA, how to avoid calling dma_map_page multiple times?
> >>2, for which page to call dma_unmap_page?
> >>
> >> --
> > 
> > We have to support both w/ iommu and w/o iommu case, since
> > that fact is out of GPU driver control. A simple way is to use
> > dma_map_page which internally will cope with w/ and w/o iommu
> > case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> > Then in this file we only need to cache GPA to whatever dmadr_t
> > returned by dma_map_page.
> > 
> 
> Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

Hi Jike,

With mediated passthru, you still can use hardware iommu, but more important
that part is actually orthogonal to what we are discussing here as we will only
cache the mapping between , once we 
have pinned pages later with the help of above info, you can map it into the
proper iommu domain if the system has configured so.

Thanks,
Neo

> 
> --
> Thanks,
> Jike
> 



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-10 Thread Jike Song
On 05/05/2016 05:27 PM, Tian, Kevin wrote:
>> From: Song, Jike
>>
>> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
>> hardware. It just, as you said in another mail, "rather than
>> programming them into an IOMMU for a device, it simply stores the
>> translations for use by later requests".
>>
>> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
>> Otherwise, if IOMMU is present, the gfx driver eventually programs
>> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
>> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
>> translations without any knowledge about hardware IOMMU, how is the
>> device model supposed to do to get an IOVA for a given GPA (thereby HPA
>> by the IOMMU backend here)?
>>
>> If things go as guessed above, as vfio_pin_pages() indicates, it
>> pin & translate vaddr to PFN, then it will be very difficult for the
>> device model to figure out:
>>
>>  1, for a given GPA, how to avoid calling dma_map_page multiple times?
>>  2, for which page to call dma_unmap_page?
>>
>> --
> 
> We have to support both w/ iommu and w/o iommu case, since
> that fact is out of GPU driver control. A simple way is to use
> dma_map_page which internally will cope with w/ and w/o iommu
> case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
> Then in this file we only need to cache GPA to whatever dmadr_t
> returned by dma_map_page.
> 

Hi Alex, Kirti and Neo, any thought on the IOMMU compatibility here?

--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-05 Thread Tian, Kevin
> From: Song, Jike
> Sent: Thursday, May 05, 2016 2:56 PM
> >
> > The only reason I can come up with for why we'd want to integrate an
> > api-only domain into the existing type1 code would be to avoid page
> > accounting issues where we count locked pages once for a normal
> > assigned device and again for a vgpu, but that's not what we're doing
> > here.  We're not only locking the pages again regardless of them
> > already being locked, we're counting every time we lock them through
> > this new interface.  So there's really no point at all to making type1
> > become this unsupportable.  In that case we should be pulling out the
> > common code that we want to share from type1 and making a new type1
> > compatible vfio iommu backend rather than conditionalizing everything
> > here.
> >
> >> +
> >> +  // add to pfn_list
> >> +  lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
> >> +  if (!lpfn) {
> >> +  ret = -ENOMEM;
> >> +  mutex_unlock(_vgpu->lock);
> >> +  goto pin_done;
> >> +  }
> >> +  lpfn->vmm_va = remote_vaddr;
> >> +  lpfn->iova = iova;
> >> +  lpfn->pfn = pfn[i];
> >> +  lpfn->npage = 1;
> >> +  lpfn->prot = prot;
> >> +  atomic_inc(>ref_count);
> >> +  vfio_link_vgpu_pfn(domain_vgpu, lpfn);
> >> +  mutex_unlock(_vgpu->lock);
> >> +  }
> >> +
> >> +  ret = i;
> >> +
> >> +pin_done:
> >> +  return ret;
> >> +}
> >> +EXPORT_SYMBOL(vfio_pin_pages);
> >> +
> 
> IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
> hardware. It just, as you said in another mail, "rather than
> programming them into an IOMMU for a device, it simply stores the
> translations for use by later requests".
> 
> That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
> Otherwise, if IOMMU is present, the gfx driver eventually programs
> the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
> Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
> translations without any knowledge about hardware IOMMU, how is the
> device model supposed to do to get an IOVA for a given GPA (thereby HPA
> by the IOMMU backend here)?
> 
> If things go as guessed above, as vfio_pin_pages() indicates, it
> pin & translate vaddr to PFN, then it will be very difficult for the
> device model to figure out:
> 
>   1, for a given GPA, how to avoid calling dma_map_page multiple times?
>   2, for which page to call dma_unmap_page?
> 
> --

We have to support both w/ iommu and w/o iommu case, since
that fact is out of GPU driver control. A simple way is to use
dma_map_page which internally will cope with w/ and w/o iommu
case gracefully, i.e. return HPA w/o iommu and IOVA w/ iommu.
Then in this file we only need to cache GPA to whatever dmadr_t
returned by dma_map_page.

Thanks
Kevin


Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-05 Thread Kirti Wankhede


On 5/4/2016 4:13 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:41 +0530
> Kirti Wankhede  wrote:
>
[..]


>> +  if (domain->vfio_iommu_api_only)
>> +  mm = domain->vmm_mm;
>> +  else
>> +  mm = current->mm;
>> +
>> +  if (!mm)
>> +  return -ENODEV;
>> +
>> +  ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
>
> We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
> NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
> gup variant to use.  Of course we need to deal with mmap_sem somewhere
> too without turning the code into swiss cheese.
>

Yes, I missed that. Thanks for pointing out. I'll fix it.

> Correct me if I'm wrong, but I assume the main benefit of interweaving
> this into type1 vs pulling out common code and making a new vfio iommu
> backend is the page accounting, ie. not over accounting locked pages.
> TBH, I don't know if it's worth it.  Any idea what the high water mark
> of pinned pages for a vgpu might be?
>

It depends in which VM (Linux/Windows) is running and what workload is 
running. On Windows DMA pages are managed by WDDM model. On Linux each 
user space application can DMA to pages and there are not restrictions.



> The only reason I can come up with for why we'd want to integrate an
> api-only domain into the existing type1 code would be to avoid page
> accounting issues where we count locked pages once for a normal
> assigned device and again for a vgpu, but that's not what we're doing
> here.  We're not only locking the pages again regardless of them
> already being locked, we're counting every time we lock them through
> this new interface.  So there's really no point at all to making type1
> become this unsupportable.  In that case we should be pulling out the
> common code that we want to share from type1 and making a new type1
> compatible vfio iommu backend rather than conditionalizing everything
> here.
>

I tried to add pfn tracking logic and use already locked pages, but that 
didn't worked somehow, I'll revisit it again.
With this there will be additional pfn tracking logic for the case where 
device is directly assigned and vGPU device is not present.




>> +  // verify if pfn exist in pfn_list
>> +  if (!(p = vfio_find_vgpu_pfn(domain_vgpu, *(pfn + i {
>> +  continue;
>
> How does the caller deal with this, the function returns number of
> pages unpinned which will not match the requested number of pages to
> unpin if there are any missing.  Also, no setting variables within a
> test when easily avoidable please, separate to a set then test.
>

Here we are following the current code logic. Do you have any suggestion 
how to deal with that?



>> +  (iommu, , _vgpu);
>> +
>> +  if (!domain)
>> +  return;
>> +
>> +  d = domain;
>>list_for_each_entry_continue(d, >domain_list, next) {
>> -  iommu_unmap(d->domain, dma->iova, dma->size);
>> -  cond_resched();
>> +  if (!d->vfio_iommu_api_only) {
>> +  iommu_unmap(d->domain, dma->iova, dma->size);
>> +  cond_resched();
>> +  }get_first_domains
>>}
>>
>>while (iova < end) {
>
> How do api-only domain not blowup on the iommu API code in this next
> code block?  Are you just getting lucky that the api-only domain is
> first in the list and the real domain is last?
>

Control will not come here if there is no domain with IOMMU due to below 
change:


>> +  if (!domain)
>> +  return;

get_first_domains() returns the first domain with IOMMU and first domain 
with api_only.



>> +  if (d->vfio_iommu_api_only)
>> +  continue;
>> +
>
> Really disliking all these switches everywhere, too many different code
> paths.
>

I'll move such APIs to inline functions such that this check would be 
within inline functions and code would look much cleaner.



>> +  // Skip pin and map only if domain without IOMMU is present
>> +  if (!domain_with_iommu_present) {
>> +  dma->size = size;
>> +  goto map_done;
>> +  }
>> +
>
> Yet more special cases, the code is getting unsupportable.

From vfio_dma_do_map(), if there is no devices pass-throughed then we 
don't want to pin all pages upfront. and that is the reason of this check.


>
> I'm really not convinced that pushing this into the type1 code is the
> right approach vs pulling out shareable code chunks where it makes
> sense and creating a separate iommu backend.  We're not getting
> anything but code complexity out of this approach it seems.

I find pulling out shared code is also not simple. I would like to 
revisit this again and sort out the concerns you raised rather than 
making separate module.


Thanks,
Kirti.





Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-05 Thread Jike Song
On 05/04/2016 06:43 AM, Alex Williamson wrote:
> On Tue, 3 May 2016 00:10:41 +0530
> Kirti Wankhede  wrote:
>> +
>> +/*
>> + * Pin a set of guest PFNs and return their associated host PFNs for vGPU.
>> + * @vaddr [in]: array of guest PFNs
>> + * @npage [in]: count of array elements
>> + * @prot [in] : protection flags
>> + * @pfn_base[out] : array of host PFNs
>> + */
>> +int vfio_pin_pages(void *iommu_data, dma_addr_t *vaddr, long npage,
>> +   int prot, dma_addr_t *pfn_base)
>> +{
>> +struct vfio_iommu *iommu = iommu_data;
>> +struct vfio_domain *domain = NULL, *domain_vgpu = NULL;
>> +int i = 0, ret = 0;
>> +long retpage;
>> +dma_addr_t remote_vaddr = 0;
>> +dma_addr_t *pfn = pfn_base;
>> +struct vfio_dma *dma;
>> +
>> +if (!iommu || !pfn_base)
>> +return -EINVAL;
>> +
>> +if (list_empty(>domain_list)) {
>> +ret = -EINVAL;
>> +goto pin_done;
>> +}
>> +
>> +get_first_domains(iommu, , _vgpu);
>> +
>> +// Return error if vGPU domain doesn't exist
> 
> No c++ style comments please.
> 
>> +if (!domain_vgpu) {
>> +ret = -EINVAL;
>> +goto pin_done;
>> +}
>> +
>> +for (i = 0; i < npage; i++) {
>> +struct vfio_vgpu_pfn *p;
>> +struct vfio_vgpu_pfn *lpfn;
>> +unsigned long tpfn;
>> +dma_addr_t iova;
>> +
>> +mutex_lock(>lock);
>> +
>> +iova = vaddr[i] << PAGE_SHIFT;
>> +
>> +dma = vfio_find_dma(iommu, iova, 0 /*  size */);
>> +if (!dma) {
>> +mutex_unlock(>lock);
>> +ret = -EINVAL;
>> +goto pin_done;
>> +}
>> +
>> +remote_vaddr = dma->vaddr + iova - dma->iova;
>> +
>> +retpage = vfio_pin_pages_internal(domain_vgpu, remote_vaddr,
>> +  (long)1, prot, );
>> +mutex_unlock(>lock);
>> +if (retpage <= 0) {
>> +WARN_ON(!retpage);
>> +ret = (int)retpage;
>> +goto pin_done;
>> +}
>> +
>> +pfn[i] = tpfn;
>> +
>> +mutex_lock(_vgpu->lock);
>> +
>> +// search if pfn exist
>> +if ((p = vfio_find_vgpu_pfn(domain_vgpu, tpfn))) {
>> +atomic_inc(>ref_count);
>> +mutex_unlock(_vgpu->lock);
>> +continue;
>> +}
> 
> The only reason I can come up with for why we'd want to integrate an
> api-only domain into the existing type1 code would be to avoid page
> accounting issues where we count locked pages once for a normal
> assigned device and again for a vgpu, but that's not what we're doing
> here.  We're not only locking the pages again regardless of them
> already being locked, we're counting every time we lock them through
> this new interface.  So there's really no point at all to making type1
> become this unsupportable.  In that case we should be pulling out the
> common code that we want to share from type1 and making a new type1
> compatible vfio iommu backend rather than conditionalizing everything
> here.
> 
>> +
>> +// add to pfn_list
>> +lpfn = kzalloc(sizeof(*lpfn), GFP_KERNEL);
>> +if (!lpfn) {
>> +ret = -ENOMEM;
>> +mutex_unlock(_vgpu->lock);
>> +goto pin_done;
>> +}
>> +lpfn->vmm_va = remote_vaddr;
>> +lpfn->iova = iova;
>> +lpfn->pfn = pfn[i];
>> +lpfn->npage = 1;
>> +lpfn->prot = prot;
>> +atomic_inc(>ref_count);
>> +vfio_link_vgpu_pfn(domain_vgpu, lpfn);
>> +mutex_unlock(_vgpu->lock);
>> +}
>> +
>> +ret = i;
>> +
>> +pin_done:
>> +return ret;
>> +}
>> +EXPORT_SYMBOL(vfio_pin_pages);
>> +

IIUC, an api-only domain is a VFIO domain *without* underlying IOMMU
hardware. It just, as you said in another mail, "rather than
programming them into an IOMMU for a device, it simply stores the
translations for use by later requests".

That imposes a constraint on gfx driver: hardware IOMMU must be disabled.
Otherwise, if IOMMU is present, the gfx driver eventually programs
the hardware IOMMU with IOVA returned by pci_map_page or dma_map_page;
Meanwhile, the IOMMU backend for vgpu only maintains GPA <-> HPA
translations without any knowledge about hardware IOMMU, how is the
device model supposed to do to get an IOVA for a given GPA (thereby HPA
by the IOMMU backend here)?

If things go as guessed above, as vfio_pin_pages() indicates, it
pin & translate vaddr to PFN, then it will be very difficult for the
device model to figure out:

1, for a given GPA, how to avoid calling dma_map_page multiple times?
2, for which page to call dma_unmap_page?

--
Thanks,
Jike




Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-03 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, May 04, 2016 6:43 AM
> > +int prot, unsigned long *pfn_base)
> >  {
> > +   struct vfio_domain *domain = domain_data;
> > unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > bool lock_cap = capable(CAP_IPC_LOCK);
> > long ret, i;
> > bool rsvd;
> > +   struct mm_struct *mm;
> >
> > -   if (!current->mm)
> > +   if (!domain)
> > return -ENODEV;
> >
> > -   ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> > +   if (domain->vfio_iommu_api_only)
> > +   mm = domain->vmm_mm;
> > +   else
> > +   mm = current->mm;
> > +
> > +   if (!mm)
> > +   return -ENODEV;
> > +
> > +   ret = vaddr_get_pfn(mm, vaddr, prot, pfn_base);
> 
> We could pass domain->mm unconditionally to vaddr_get_pfn(), let it be
> NULL in the !api_only case and use it as a cue to vaddr_get_pfn() which
> gup variant to use.  Of course we need to deal with mmap_sem somewhere
> too without turning the code into swiss cheese.
> 
> Correct me if I'm wrong, but I assume the main benefit of interweaving
> this into type1 vs pulling out common code and making a new vfio iommu
> backend is the page accounting, ie. not over accounting locked pages.
> TBH, I don't know if it's worth it.  Any idea what the high water mark
> of pinned pages for a vgpu might be?

The baseline is same as today's PCI device passthrough, i.e. we need to
pin all memory pages allocated to the said VM at least for current KVMGT. 
Ideally we may reduce the pinned set based on fined-grained resource 
tracking within vGPU device model (then it might be in 100MBs based on 
active graphics memory working set). 

Thanks
Kevin



Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-03 Thread Alex Williamson
On Tue, 3 May 2016 00:10:41 +0530
Kirti Wankhede  wrote:

> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through
>
> Signed-off-by: Kirti Wankhede 
> Signed-off-by: Neo Jia 
> Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
> ---
>  drivers/vfio/vfio_iommu_type1.c |  427 
> ---
>  include/linux/vfio.h|6 +
>  include/linux/vgpu.h|4 +-
>  3 files changed, 403 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 75b24e9..a970854 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DRIVER_VERSION  "0.2"
>  #define DRIVER_AUTHOR   "Alex Williamson "
> @@ -67,6 +68,11 @@ struct vfio_domain {
>   struct list_headgroup_list;
>   int prot;   /* IOMMU_CACHE */
>   boolfgsp;   /* Fine-grained super pages */
> + boolvfio_iommu_api_only;/* Domain for device 
> which
> +is without physical 
> IOMMU */
> + struct mm_struct*vmm_mm;/* VMM's mm */

I really don't like assuming a VMM, vfio is a userspace driver
interface, this is just an mm associated with this set of mappings.

> + struct rb_root  pfn_list;   /* Host pfn list for requested 
> gfns */

This is just an iova to pfn list, whether it's a guest running on a
VMM, we don't care.

> + struct mutexlock;   /* mutex for pfn_list */

So pfn_list_lock might be a better name for it.

>  };
>  
>  struct vfio_dma {
> @@ -83,6 +89,19 @@ struct vfio_group {
>  };
>  
>  /*
> + * Guest RAM pinning working set or DMA target
> + */
> +struct vfio_vgpu_pfn {
> + struct rb_node  node;
> + unsigned long   vmm_va; /* VMM virtual addr */

vaddr

> + dma_addr_t  iova;   /* IOVA */
> + unsigned long   npage;  /* number of pages */
> + unsigned long   pfn;/* Host pfn */
> + int prot;
> + atomic_tref_count;
> + struct list_headnext;
> +};

Why is any of this vgpu specific?  It's just a data structure for
tracking iova to vaddr pins.

> +/*
>   * This code handles mapping and unmapping of user data buffers
>   * into DMA'ble space using the IOMMU
>   */
> @@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
> struct vfio_dma *old)
>   rb_erase(>node, >dma_list);
>  }
>  
> +/*
> + * Helper Functions for host pfn list
> + */
> +
> +static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
> + unsigned long pfn)
> +{
> + struct rb_node *node = domain->pfn_list.rb_node;
> +
> + while (node) {
> + struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct 
> vfio_vgpu_pfn, node);
> +
> + if (pfn <= vgpu_pfn->pfn)
> + node = node->rb_left;
> + else if (pfn >= vgpu_pfn->pfn)
> + node = node->rb_right;
> + else
> + return vgpu_pfn;
> + }
> +
> + return NULL;
> +}
> +
> +static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct 
> vfio_vgpu_pfn *new)
> +{
> + struct rb_node **link = >pfn_list.rb_node, *parent = NULL;
> + struct vfio_vgpu_pfn *vgpu_pfn;
> +
> + while (*link) {
> + parent = *link;
> + vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
> +
> + if (new->pfn <= vgpu_pfn->pfn)
> + link = &(*link)->rb_left;
> + else
> + link = &(*link)->rb_right;
> + }
> +
> + rb_link_node(>node, parent, link);
> + rb_insert_color(>node, >pfn_list);
> +}
> +
> +static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct 
> vfio_vgpu_pfn *old)
> +{
> + rb_erase(>node, >pfn_list);
> +}
> +

None of the above is really vgpu specific either, just managing a tree
of mappings.  Name things based on what 

Re: [Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-03 Thread Jike Song
On 05/03/2016 02:40 AM, Kirti Wankhede wrote:
> VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
> vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
> managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
> devices, type1 IOMMU driver is modified to support vGPU devices. This change
> exports functions to pin and unpin pages for vGPU devices.
> It maintains data of pinned pages for vGPU domain. This data is used to verify
> unpinning request and also used to unpin pages from detach_group().
> 
> Tested by assigning below combinations of devices to a single VM:
> - GPU pass through only
> - vGPU device only
> - One GPU pass through and one vGPU device
> - two GPU pass through

{the patch trimmed}

Hi Kirti,

 I have a question: in the scenario above, how many PCI BDFs do your vGPUs 
consume?

 Per my understanding, you take the GPA of a KVM guest as the IOVA of IOMMU 
domain,
and if there are multiple guests with vGPU assigned, the vGPUs must belong to
different vGPUs (thereby having different BDFs).

 Do I miss anything?


--
Thanks,
Jike



[Qemu-devel] [RFC PATCH v3 3/3] VFIO Type1 IOMMU change: to support with iommu and without iommu

2016-05-02 Thread Kirti Wankhede
VFIO Type1 IOMMU driver is designed for the devices which are IOMMU capable.
vGPU device are only using IOMMU TYPE1 API, the underlying hardware can be
managed by an IOMMU domain. To use most of the code of IOMMU driver for vGPU
devices, type1 IOMMU driver is modified to support vGPU devices. This change
exports functions to pin and unpin pages for vGPU devices.
It maintains data of pinned pages for vGPU domain. This data is used to verify
unpinning request and also used to unpin pages from detach_group().

Tested by assigning below combinations of devices to a single VM:
- GPU pass through only
- vGPU device only
- One GPU pass through and one vGPU device
- two GPU pass through

Signed-off-by: Kirti Wankhede 
Signed-off-by: Neo Jia 
Change-Id: I6e35e9fc7f14049226365e9ecef3814dc4ca1738
---
 drivers/vfio/vfio_iommu_type1.c |  427 ---
 include/linux/vfio.h|6 +
 include/linux/vgpu.h|4 +-
 3 files changed, 403 insertions(+), 34 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 75b24e9..a970854 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson "
@@ -67,6 +68,11 @@ struct vfio_domain {
struct list_headgroup_list;
int prot;   /* IOMMU_CACHE */
boolfgsp;   /* Fine-grained super pages */
+   boolvfio_iommu_api_only;/* Domain for device 
which
+  is without physical 
IOMMU */
+   struct mm_struct*vmm_mm;/* VMM's mm */
+   struct rb_root  pfn_list;   /* Host pfn list for requested 
gfns */
+   struct mutexlock;   /* mutex for pfn_list */
 };
 
 struct vfio_dma {
@@ -83,6 +89,19 @@ struct vfio_group {
 };
 
 /*
+ * Guest RAM pinning working set or DMA target
+ */
+struct vfio_vgpu_pfn {
+   struct rb_node  node;
+   unsigned long   vmm_va; /* VMM virtual addr */
+   dma_addr_t  iova;   /* IOVA */
+   unsigned long   npage;  /* number of pages */
+   unsigned long   pfn;/* Host pfn */
+   int prot;
+   atomic_tref_count;
+   struct list_headnext;
+};
+/*
  * This code handles mapping and unmapping of user data buffers
  * into DMA'ble space using the IOMMU
  */
@@ -130,6 +149,53 @@ static void vfio_unlink_dma(struct vfio_iommu *iommu, 
struct vfio_dma *old)
rb_erase(>node, >dma_list);
 }
 
+/*
+ * Helper Functions for host pfn list
+ */
+
+static struct vfio_vgpu_pfn *vfio_find_vgpu_pfn(struct vfio_domain *domain,
+   unsigned long pfn)
+{
+   struct rb_node *node = domain->pfn_list.rb_node;
+
+   while (node) {
+   struct vfio_vgpu_pfn *vgpu_pfn = rb_entry(node, struct 
vfio_vgpu_pfn, node);
+
+   if (pfn <= vgpu_pfn->pfn)
+   node = node->rb_left;
+   else if (pfn >= vgpu_pfn->pfn)
+   node = node->rb_right;
+   else
+   return vgpu_pfn;
+   }
+
+   return NULL;
+}
+
+static void vfio_link_vgpu_pfn(struct vfio_domain *domain, struct 
vfio_vgpu_pfn *new)
+{
+   struct rb_node **link = >pfn_list.rb_node, *parent = NULL;
+   struct vfio_vgpu_pfn *vgpu_pfn;
+
+   while (*link) {
+   parent = *link;
+   vgpu_pfn = rb_entry(parent, struct vfio_vgpu_pfn, node);
+
+   if (new->pfn <= vgpu_pfn->pfn)
+   link = &(*link)->rb_left;
+   else
+   link = &(*link)->rb_right;
+   }
+
+   rb_link_node(>node, parent, link);
+   rb_insert_color(>node, >pfn_list);
+}
+
+static void vfio_unlink_vgpu_pfn(struct vfio_domain *domain, struct 
vfio_vgpu_pfn *old)
+{
+   rb_erase(>node, >pfn_list);
+}
+
 struct vwork {
struct mm_struct*mm;
longnpage;
@@ -228,20 +294,22 @@ static int put_pfn(unsigned long pfn, int prot)
return 0;
 }
 
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
+int prot, unsigned long *pfn)
 {
struct page *page[1];
struct vm_area_struct *vma;
int ret = -EFAULT;
 
-   if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+   if (get_user_pages_remote(NULL, mm, vaddr, 1, !!(prot & IOMMU_WRITE),
+   0, page, NULL) == 1) {
*pfn =