Re: [virtio-dev] RE: [RFC] virtio-iommu version 0.4

2017-09-25 Thread Jean-Philippe Brucker
On 21/09/17 07:27, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker
>> Sent: Wednesday, September 6, 2017 7:55 PM
>>
>> Hi Kevin,
>>
>> On 28/08/17 08:39, Tian, Kevin wrote:
>>> Here comes some comments:
>>>
>>> 1.1 Motivation
>>>
>>> You describe I/O page faults handling as future work. Seems you
>> considered
>>> only recoverable fault (since "aka. PCI PRI" being used). What about other
>>> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
>>> a valid mapping? Even when there is no PRI support, we need some basic
>>> form of fault reporting mechanism to indicate such errors to guest.
>>
>> I am considering recoverable faults as the end goal, but reporting
>> unrecoverable faults should use the same queue, with slightly different
>> fields and no need for the driver to reply to the device.
> 
> what about adding a placeholder for now? Though same mechanism
> can be reused, it's an essential part to make virtio-iommu architecture
> complete even before talking support for recoverable faults. :-)

I'll see if I can come up with something simple for v0.5, but it seems
like a big chunk of work. I don't really know what to report to the guest
at the moment. I don't want to report vendor-specific details about the
fault, but it should still be useful content, to let the guest decide
whether they need to reset/kill the device or just print something

[...]
>> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
>> used for both iGPU and USB controllers on my x86 machines. Do you know
>> more precisely what they are used for by the firmware?
> 
> VTd spec has a clear description:
> 
> 3.14 Handling Requests to Reserved System Memory
> Reserved system memory regions are typically allocated by BIOS at boot 
> time and reported to OS as reserved address ranges in the system memory 
> map. Requests-without-PASID to these reserved regions may either occur 
> as a result of operations performed by the system software driver (for 
> example in the case of DMA from unified memory access (UMA) graphics 
> controllers to graphics reserved memory), or may be initiated by non 
> system software (for example in case of DMA performed by a USB 
> controller under BIOS SMM control for legacy keyboard emulation). 
> For proper functioning of these legacy reserved memory usages, when 
> system software enables DMA remapping, the second-level translation 
> structures for the respective devices are expected to be set up to provide
> identity mapping for the specified reserved memory regions with read 
> and write permissions.
> 
> (one specific example for GPU happens in legacy VGA usage in early
> boot time before actual graphics driver is loaded)

Thanks for the explanation. So it is only legacy, and enabling nested mode
would be forbidden for a device with Reserved System Memory regions? I'm
wondering if virtio-iommu RESV regions will be extended to affect a
specific PASIDs (or all requests-with-PASID) in the future.
>> It's not necessary with the base virtio-iommu device though (v0.4),
>> because the device can create the identity mappings itself and report them
>> to the guest as MEM_T_BYPASS. However, when we start handing page
> 
> when you say "the device can create ...", I think you really meant
> "host iommu driver can create identity mapping for assigned device",
> correct?
> 
> Then yes, I think above works.

Yes it can be the host IOMMU driver, or simply Qemu sending VFIO ioctls to
create those identity mappings (they are reported in sysfs reserved_regions).

>> table
>> control over to the guest, the host won't be in control of IOVA->GPA
>> mappings and will need to gracefully ask the guest to do it.
>>
>> I'm not aware of any firmware description resembling Intel RMRR or AMD
>> IVMD on ARM platforms. I do think ARM platforms could need
>> MEM_T_IDENTITY
>> for requesting the guest to map MSI windows when page-table handover is
>> in
>> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
>> mapping must be installed by the guest). But since a vSMMU would need a
>> solution as well, I think I'll try to implement something more generic.
> 
> curious do you need identity mapping full IOVA->GPA->HPA translation, 
> or just in GPA->HPA stage sufficient for above MSI scenario?

It has to be IOVA->GPA->HPA. So it'll be a bit complicated to implement
for us, I think we're going to need a VFIO ioctl to tell the host what
IOVA the guest allocated for its MSI, but it's not ideal.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] virtio-iommu version 0.4

2017-09-21 Thread Jean-Philippe Brucker
On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved 

RE: [RFC] virtio-iommu version 0.4

2017-09-21 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think 

Re: [RFC] virtio-iommu version 0.4

2017-09-20 Thread Auger Eric
Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] virtio-iommu version 0.4

2017-09-19 Thread Jean-Philippe Brucker
Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] virtio-iommu version 0.4

2017-09-12 Thread Bharat Bhushan
Hi Eric,

> -Original Message-
> From: Auger Eric [mailto:eric.au...@redhat.com]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.bruc...@arm.com>;
> iommu@lists.linux-foundation.org; k...@vger.kernel.org;
> virtualizat...@lists.linux-foundation.org; virtio-...@lists.oasis-open.org
> Cc: will.dea...@arm.com; robin.mur...@arm.com;
> lorenzo.pieral...@arm.com; m...@redhat.com; jasow...@redhat.com;
> marc.zyng...@arm.com; eric.auger@gmail.com; Bharat Bhushan
> <bharat.bhus...@nxp.com>; pet...@redhat.com; kevin.t...@intel.com
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_r

Re: [RFC] virtio-iommu version 0.4

2017-09-12 Thread Auger Eric
Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use 

Re: [RFC] virtio-iommu version 0.4

2017-09-06 Thread Jean-Philippe Brucker
Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in 
> implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which 

RE: [RFC] virtio-iommu version 0.4

2017-08-28 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] virtio-iommu version 0.4

2017-08-23 Thread Jean-Philippe Brucker
On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC] virtio-iommu version 0.4

2017-08-14 Thread Jean-Philippe Brucker
On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] virtio-iommu version 0.4

2017-08-14 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> I reiterate the disclaimers: don't use this document as a reference,
> it's a draft. It's also not an OASIS document yet. It may be riddled
> with mistakes. As this is a working draft, it is unstable and I do not
> guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg4.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> Warning: UAPI headers have changed! They didn't follow the spec,
> please