Re: [virtio-dev] RE: [RFC] virtio-iommu version 0.4
On 21/09/17 07:27, Tian, Kevin wrote: >> From: Jean-Philippe Brucker >> Sent: Wednesday, September 6, 2017 7:55 PM >> >> Hi Kevin, >> >> On 28/08/17 08:39, Tian, Kevin wrote: >>> Here comes some comments: >>> >>> 1.1 Motivation >>> >>> You describe I/O page faults handling as future work. Seems you >> considered >>> only recoverable fault (since "aka. PCI PRI" being used). What about other >>> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find >>> a valid mapping? Even when there is no PRI support, we need some basic >>> form of fault reporting mechanism to indicate such errors to guest. >> >> I am considering recoverable faults as the end goal, but reporting >> unrecoverable faults should use the same queue, with slightly different >> fields and no need for the driver to reply to the device. > > what about adding a placeholder for now? Though same mechanism > can be reused, it's an essential part to make virtio-iommu architecture > complete even before talking support for recoverable faults. :-) I'll see if I can come up with something simple for v0.5, but it seems like a big chunk of work. I don't really know what to report to the guest at the moment. I don't want to report vendor-specific details about the fault, but it should still be useful content, to let the guest decide whether they need to reset/kill the device or just print something [...] >> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are >> used for both iGPU and USB controllers on my x86 machines. Do you know >> more precisely what they are used for by the firmware? > > VTd spec has a clear description: > > 3.14 Handling Requests to Reserved System Memory > Reserved system memory regions are typically allocated by BIOS at boot > time and reported to OS as reserved address ranges in the system memory > map. Requests-without-PASID to these reserved regions may either occur > as a result of operations performed by the system software driver (for > example in the case of DMA from unified memory access (UMA) graphics > controllers to graphics reserved memory), or may be initiated by non > system software (for example in case of DMA performed by a USB > controller under BIOS SMM control for legacy keyboard emulation). > For proper functioning of these legacy reserved memory usages, when > system software enables DMA remapping, the second-level translation > structures for the respective devices are expected to be set up to provide > identity mapping for the specified reserved memory regions with read > and write permissions. > > (one specific example for GPU happens in legacy VGA usage in early > boot time before actual graphics driver is loaded) Thanks for the explanation. So it is only legacy, and enabling nested mode would be forbidden for a device with Reserved System Memory regions? I'm wondering if virtio-iommu RESV regions will be extended to affect a specific PASIDs (or all requests-with-PASID) in the future. >> It's not necessary with the base virtio-iommu device though (v0.4), >> because the device can create the identity mappings itself and report them >> to the guest as MEM_T_BYPASS. However, when we start handing page > > when you say "the device can create ...", I think you really meant > "host iommu driver can create identity mapping for assigned device", > correct? > > Then yes, I think above works. Yes it can be the host IOMMU driver, or simply Qemu sending VFIO ioctls to create those identity mappings (they are reported in sysfs reserved_regions). >> table >> control over to the guest, the host won't be in control of IOVA->GPA >> mappings and will need to gracefully ask the guest to do it. >> >> I'm not aware of any firmware description resembling Intel RMRR or AMD >> IVMD on ARM platforms. I do think ARM platforms could need >> MEM_T_IDENTITY >> for requesting the guest to map MSI windows when page-table handover is >> in >> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA >> mapping must be installed by the guest). But since a vSMMU would need a >> solution as well, I think I'll try to implement something more generic. > > curious do you need identity mapping full IOVA->GPA->HPA translation, > or just in GPA->HPA stage sufficient for above MSI scenario? It has to be IOVA->GPA->HPA. So it'll be a bit complicated to implement for us, I think we're going to need a VFIO ioctl to tell the host what IOVA the guest allocated for its MSI, but it's not ideal. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] virtio-iommu version 0.4
On 20/09/17 10:37, Auger Eric wrote: > Hi Jean, > On 19/09/2017 12:47, Jean-Philippe Brucker wrote: >> Hi Eric, >> >> On 12/09/17 18:13, Auger Eric wrote: >>> 2.6.7 >>> - As I am currently integrating v0.4 in QEMU here are some other comments: >>> At the moment struct virtio_iommu_req_probe flags is missing in your >>> header. As such I understood the ACK protocol was not implemented by the >>> driver in your branch. >> >> Uh indeed. And yet I could swear I've written that code... somewhere. I >> will add it to the batch of v0.5 changes, it shouldn't be too invasive. >> >>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your >>> header too. >> >> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best >> (though it is a mouthful). >> >>> 2.6.8.2: >>> - I am really confused about what the device should report as resv >>> regions depending on the PE nature (VFIO or not VFIO) >>> >>> In other iommu drivers, the resv regions are populated by the iommu >>> driver through its get_resv_regions callback. They are usually composed >>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU >>> specific (device specific) reserved regions: >>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those >>> are the guest reserved regions. >>> >>> First in the current virtio-iommu driver I don't see the >>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu >>> driver should compute the non IOMMU specific MSI regions ie. this is >>> not the responsability of the virtio-iommu device. >> >> For SW_MSI, certainly. The driver allocates a fixed IOVA region for >> mapping the MSI doorbell. But the driver has to know whether the doorbell >> region is translated or bypassed. > Sorry I was talking about *non* IOMMU specific MSI regions, typically > the regions corresponding to guest PCI host bridge windows. This is > normally computed in the iommu driver and I didn't that that in the > existing virtio-iommu driver. Ah right, I don't think the virtio-iommu device has to report the windows of the emulated host bridge. RESV is useful for things that are not obvious to the guest, for instance the physical PCI bridges that are hidden from the guest. It's an interesting point though, I can imagine non-Linux guest that are not as well-equipped with dealing with things like PCI bridge windows, and could benefit from the device reporting them. In the end it is up to the device implementation to decide what regions to report, and make sure that the guest is aware of the various traps in IOVA space. For things like emulated bridges, the device can expect the guest to find out about them. For physical bridges/SW_MSI of the host, it should report the region and make sure that the guest doesn't map them. I'll add a few more examples to the Implementation Notes, but I suspect reading your Qemu source code will always be more helpful to people. >>> Then why is it more the job of the device to return the guest iommu >>> specific region rather than the driver itself? >> >> The MSI region is architectural on x86 IOMMUs, but >> implementation-defined on virtio-iommu. It depends which platform the host >> is emulating. In Linux, x86 IOMMU drivers register the bypass region >> because there always is an IOAPIC on the other end, with a fixed MSI >> address. But virtio-iommu may be either behind a GIC, an APIC or some >> other IRQ chip. >> >> The driver *could* go over all the irqchips/platforms it knows and try to >> guess if there is a fixed doorbell or if it needs to reserve an IOVA for >> them, but it would look horrible. I much prefer having a well-defined way >> of doing this, so a description from the device. > > This means I must have target specific code in the virtio-iommu device > which is unusual, right? I was initially thinking you could handle that > on the driver side using a config set for ARM|ARM64. But on the other > hand you should communicate the info to the device ... But the device has to know that it has a region that DMA transactions bypass, right? If you want to implement MSI bypass, then you already have to add a special case to your device code, and reporting it in a probe shouldn't require a lot more work. For example, amdvi_translate() has a special case for amdvi_is_interrupt_addr(). >>> Then I understand this is the responsability of the virtio-iommu device >>> to gather information about the host resv regions in case of VFIO EP. >>> Typically the host PCIe host bridge windows cannot be used for IOVA. >>> Also the host MSI reserved IOVA window cannot be used. Do you agree. >> >> Yes, all regions reported in sysfs reserved_regions in the host would be >> reported as RESV_T_RESERVED by virtio-iommu. > So to summarize if the probe request is sent to an emulated device, we > should return the target specific MSI window. We can't and don't return > the non IOMMU specific guest reserved windows. > > For a VFIO device, we would return all reserved re
RE: [RFC] virtio-iommu version 0.4
> From: Jean-Philippe Brucker > Sent: Wednesday, September 6, 2017 7:55 PM > > Hi Kevin, > > On 28/08/17 08:39, Tian, Kevin wrote: > > Here comes some comments: > > > > 1.1 Motivation > > > > You describe I/O page faults handling as future work. Seems you > considered > > only recoverable fault (since "aka. PCI PRI" being used). What about other > > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find > > a valid mapping? Even when there is no PRI support, we need some basic > > form of fault reporting mechanism to indicate such errors to guest. > > I am considering recoverable faults as the end goal, but reporting > unrecoverable faults should use the same queue, with slightly different > fields and no need for the driver to reply to the device. what about adding a placeholder for now? Though same mechanism can be reused, it's an essential part to make virtio-iommu architecture complete even before talking support for recoverable faults. :-) > > > 2.6.8.2 Property RESV_MEM > > > > I'm not immediately clear when > VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT > > should be explicitly reported. Is there any real example on bare metal > IOMMU? > > usually reserved memory is reported to CPU through other method (e.g. > e820 > > on x86 platform). Of course MSI is a special case which is covered by > BYPASS > > and MSI flag... If yes, maybe you can also include an example in > implementation > > notes. > > The RESV_MEM regions only describes IOVA space for the moment, not > guest-physical, so I guess it provides different information than e820. > > I think a useful example is the PCI bridge windows reported by the Linux > host to userspace using RESV_RESERVED regions (see > iommu_dma_get_resv_regions). If I understand correctly, they represent > DMA > addresses that shouldn't be accessed by endpoints because they won't > reach > the IOMMU. These are specific to the physical topology: a device will have > different reserved regions depending on the PCI slot it occupies. > > When handled properly, PCI bridge windows quickly become a nuisance. > With > kvmtool we observed that carving out their addresses globally removes a > lot of useful GPA space from the guest. Without a virtual IOMMU we can > either ignore them and hope everything will be fine, or remove all > reserved regions from the GPA space (which currently means editing by > hand > the static guest-physical map...) > > That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It > describes > reserved IOVAs for a specific endpoint, and therefore removes the need to > carve the window out of the whole guest. Understand and thanks for elaboration. > > > Another thing I want to ask your opinion, about whether there is value of > > adding another subtype (MEM_T_IDENTITY), asking for identity mapping > > in the address space. It's similar to Reserved Memory Region Reporting > > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved > > memory ranges which may be DMA target and has to be identity mapped > > when DMA remapping is enabled. I'm not sure whether ARM has similar > > capability and whether there might be a general usage beyond VT-d. For > > now the only usage in my mind is to assign a device with RMRR associated > > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs > > propagated to the guest (since identity mapping also means reservation > > of virtual address space). > > Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are > used for both iGPU and USB controllers on my x86 machines. Do you know > more precisely what they are used for by the firmware? VTd spec has a clear description: 3.14 Handling Requests to Reserved System Memory Reserved system memory regions are typically allocated by BIOS at boot time and reported to OS as reserved address ranges in the system memory map. Requests-without-PASID to these reserved regions may either occur as a result of operations performed by the system software driver (for example in the case of DMA from unified memory access (UMA) graphics controllers to graphics reserved memory), or may be initiated by non system software (for example in case of DMA performed by a USB controller under BIOS SMM control for legacy keyboard emulation). For proper functioning of these legacy reserved memory usages, when system software enables DMA remapping, the second-level translation structures for the respective devices are expected to be set up to provide identity mapping for the specified reserved memory regions with read and write permissions. (one specific example for GPU happens in legacy VGA usage in early boot time before actual graphics driver is loaded) > > It's not necessary with the base virtio-iommu device though (v0.4), > because the device can create the identity mappings itself and report them > to the guest as MEM_T_BYPASS. However, when we start handing page when you say "the device can create ...", I think
Re: [RFC] virtio-iommu version 0.4
Hi Jean, On 19/09/2017 12:47, Jean-Philippe Brucker wrote: > Hi Eric, > > On 12/09/17 18:13, Auger Eric wrote: >> 2.6.7 >> - As I am currently integrating v0.4 in QEMU here are some other comments: >> At the moment struct virtio_iommu_req_probe flags is missing in your >> header. As such I understood the ACK protocol was not implemented by the >> driver in your branch. > > Uh indeed. And yet I could swear I've written that code... somewhere. I > will add it to the batch of v0.5 changes, it shouldn't be too invasive. > >> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your >> header too. > > Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best > (though it is a mouthful). > >> 2.6.8.2: >> - I am really confused about what the device should report as resv >> regions depending on the PE nature (VFIO or not VFIO) >> >> In other iommu drivers, the resv regions are populated by the iommu >> driver through its get_resv_regions callback. They are usually composed >> of an iommu specific MSI region (mapped or bypassed) and non IOMMU >> specific (device specific) reserved regions: >> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those >> are the guest reserved regions. >> >> First in the current virtio-iommu driver I don't see the >> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu >> driver should compute the non IOMMU specific MSI regions ie. this is >> not the responsability of the virtio-iommu device. > > For SW_MSI, certainly. The driver allocates a fixed IOVA region for > mapping the MSI doorbell. But the driver has to know whether the doorbell > region is translated or bypassed. Sorry I was talking about *non* IOMMU specific MSI regions, typically the regions corresponding to guest PCI host bridge windows. This is normally computed in the iommu driver and I didn't that that in the existing virtio-iommu driver. > >> Then why is it more the job of the device to return the guest iommu >> specific region rather than the driver itself? > > The MSI region is architectural on x86 IOMMUs, but > implementation-defined on virtio-iommu. It depends which platform the host > is emulating. In Linux, x86 IOMMU drivers register the bypass region > because there always is an IOAPIC on the other end, with a fixed MSI > address. But virtio-iommu may be either behind a GIC, an APIC or some > other IRQ chip. > > The driver *could* go over all the irqchips/platforms it knows and try to > guess if there is a fixed doorbell or if it needs to reserve an IOVA for > them, but it would look horrible. I much prefer having a well-defined way > of doing this, so a description from the device. This means I must have target specific code in the virtio-iommu device which is unusual, right? I was initially thinking you could handle that on the driver side using a config set for ARM|ARM64. But on the other hand you should communicate the info to the device ... > >> Then I understand this is the responsability of the virtio-iommu device >> to gather information about the host resv regions in case of VFIO EP. >> Typically the host PCIe host bridge windows cannot be used for IOVA. >> Also the host MSI reserved IOVA window cannot be used. Do you agree. > > Yes, all regions reported in sysfs reserved_regions in the host would be > reported as RESV_T_RESERVED by virtio-iommu. So to summarize if the probe request is sent to an emulated device, we should return the target specific MSI window. We can't and don't return the non IOMMU specific guest reserved windows. For a VFIO device, we would return all reserved regions of the group the device belongs to. Is that correct? > >> I really think the spec should clarify what exact resv regions the >> device should return in case of VFIO device and non VFIO device. > > Agreed. I will add something about RESV_T_RESERVED with the PCI bridge > example in Implementation Notes. Do you think the MSI examples at the end > need improvement as well? I can try to explain that RESV_MSI regions in > virtio-iommu are only those of the emulated platform, not the HW or SW MSI > regions from the host. I think I would expect an explanation detailing returned reserved regions for pure emulated devices and HW/VFIO devices. Another unrelated remark: - you should add a permission violation error. - wrt the probe request ACK protocol, this looks pretty heavy as both the driver and the device need to parse the req_probe buffer. The device need to fill in the output buffer and then read the same info on the input buffer. Couldn't we imagine something simpler? Thanks Eric > > Thanks, > Jean > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] virtio-iommu version 0.4
Hi Eric, On 12/09/17 18:13, Auger Eric wrote: > 2.6.7 > - As I am currently integrating v0.4 in QEMU here are some other comments: > At the moment struct virtio_iommu_req_probe flags is missing in your > header. As such I understood the ACK protocol was not implemented by the > driver in your branch. Uh indeed. And yet I could swear I've written that code... somewhere. I will add it to the batch of v0.5 changes, it shouldn't be too invasive. > - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your > header too. Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best (though it is a mouthful). > 2.6.8.2: > - I am really confused about what the device should report as resv > regions depending on the PE nature (VFIO or not VFIO) > > In other iommu drivers, the resv regions are populated by the iommu > driver through its get_resv_regions callback. They are usually composed > of an iommu specific MSI region (mapped or bypassed) and non IOMMU > specific (device specific) reserved regions: > iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those > are the guest reserved regions. > > First in the current virtio-iommu driver I don't see the > iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu > driver should compute the non IOMMU specific MSI regions. ie. this is > not the responsability of the virtio-iommu device. For SW_MSI, certainly. The driver allocates a fixed IOVA region for mapping the MSI doorbell. But the driver has to know whether the doorbell region is translated or bypassed. > Then why is it more the job of the device to return the guest iommu > specific region rather than the driver itself? The MSI region is architectural on x86 IOMMUs, but implementation-defined on virtio-iommu. It depends which platform the host is emulating. In Linux, x86 IOMMU drivers register the bypass region because there always is an IOAPIC on the other end, with a fixed MSI address. But virtio-iommu may be either behind a GIC, an APIC or some other IRQ chip. The driver *could* go over all the irqchips/platforms it knows and try to guess if there is a fixed doorbell or if it needs to reserve an IOVA for them, but it would look horrible. I much prefer having a well-defined way of doing this, so a description from the device. > Then I understand this is the responsability of the virtio-iommu device > to gather information about the host resv regions in case of VFIO EP. > Typically the host PCIe host bridge windows cannot be used for IOVA. > Also the host MSI reserved IOVA window cannot be used. Do you agree. Yes, all regions reported in sysfs reserved_regions in the host would be reported as RESV_T_RESERVED by virtio-iommu. > I really think the spec should clarify what exact resv regions the > device should return in case of VFIO device and non VFIO device. Agreed. I will add something about RESV_T_RESERVED with the PCI bridge example in Implementation Notes. Do you think the MSI examples at the end need improvement as well? I can try to explain that RESV_MSI regions in virtio-iommu are only those of the emulated platform, not the HW or SW MSI regions from the host. Thanks, Jean ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] virtio-iommu version 0.4
Hi Eric, > -Original Message- > From: Auger Eric [mailto:eric.au...@redhat.com] > Sent: Tuesday, September 12, 2017 10:43 PM > To: Jean-Philippe Brucker ; > iommu@lists.linux-foundation.org; k...@vger.kernel.org; > virtualizat...@lists.linux-foundation.org; virtio-...@lists.oasis-open.org > Cc: will.dea...@arm.com; robin.mur...@arm.com; > lorenzo.pieral...@arm.com; m...@redhat.com; jasow...@redhat.com; > marc.zyng...@arm.com; eric.auger@gmail.com; Bharat Bhushan > ; pet...@redhat.com; kevin.t...@intel.com > Subject: Re: [RFC] virtio-iommu version 0.4 > > Hi jean, > > On 04/08/2017 20:19, Jean-Philippe Brucker wrote: > > This is the continuation of my proposal for virtio-iommu, the para- > > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > > > * The virtio-iommu document now resembles an actual specification. It is > > split into a formal description of the virtio device, and implementation > > notes. Please find sources and binaries at [2]. > > > > * Added a probe request to describe to the guest different properties that > > do not fit in firmware or in the virtio config space. This is a > > necessary stepping stone for extending the virtio-iommu. > > > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > > Bhushan. > > > > You can find the Linux driver and kvmtool device at [4] and [5]. I > > plan to rework driver and kvmtool device slightly before sending the > > patches. > > > > To understand the virtio-iommu, I advise to first read introduction > > and motivation, then skim through implementation notes and finally > > look at the device specification. > > > > I wasn't sure how to organize the review. For those who prefer to > > comment inline, I attached v0.4 of device-operations.tex and > > topology.tex+MSI.tex to this thread. They are the biggest chunks of > > the document. But LaTeX isn't very pleasant to read, so you can simply > > send a list of comments in relation to section numbers and a few words of > context, we'll manage. > > > > --- > > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to > > compare more easily differences since the RFC (see [6]), but haven't > > been made public so far. This is the first public posting since > > initial proposal [1], and the following describes all changes. > > > > ## v0.1 ## > > > > Content is the same as the RFC, but formatted to LaTeX. 'make' > > generates one PDF and one HTML document. > > > > ## v0.2 ## > > > > Add introductions, improve topology example and firmware description > > based on feedback and a number of useful discussions. > > > > ## v0.3 ## > > > > Add normative sections (MUST, SHOULD, etc). Clarify some things, > > tighten the device and driver behaviour. Unmap semantics are > > consolidated; they are now closer to VFIO Type1 v2 semantics. > > > > ## v0.4 ## > > > > Introduce PROBE requests. They provide per-endpoint information to the > > driver that couldn't be described otherwise. > > > > For the moment, they allow to handle MSIs on x86 virtual platforms > > (see 3.2). To do that we communicate reserved IOVA regions, that will > > also be useful for describing regions that cannot be mapped for a > > given endpoint, for instance addresses that correspond to a PCI bridge > window. > > > > Introducing such a large framework for this tiny feature may seem > > overkill, but it is needed for future extensions of the virtio-iommu > > and I believe it really is worth the effort. > > 2.6.7 > - As I am currently integrating v0.4 in QEMU here are some other comments: > At the moment struct virtio_iommu_req_probe flags is missing in your > header. As such I understood the ACK protocol was not implemented by the > driver in your branch. > - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is > VIRTIO_IOMMU_T_MASK in your header too. > 2.6.8.2: > - I am really confused about what the device should report as resv regions > depending on the PE nature (VFIO or not VFIO) > > In other iommu drivers, the resv regions are populated by the iommu driver > through its get_resv_regions callback. They are usually composed of an > iommu specific MSI region (mapped or bypassed) and non IOMMU specific > (device specific) reserved regions: > iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those > are the guest reserved regions. > > First in the current virtio-iommu driver I don't see the > iommu_dma_get_resv_regions call. Do you agree th
Re: [RFC] virtio-iommu version 0.4
Hi jean, On 04/08/2017 20:19, Jean-Philippe Brucker wrote: > This is the continuation of my proposal for virtio-iommu, the para- > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > * The virtio-iommu document now resembles an actual specification. It is > split into a formal description of the virtio device, and implementation > notes. Please find sources and binaries at [2]. > > * Added a probe request to describe to the guest different properties that > do not fit in firmware or in the virtio config space. This is a > necessary stepping stone for extending the virtio-iommu. > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > Bhushan. > > You can find the Linux driver and kvmtool device at [4] and [5]. I > plan to rework driver and kvmtool device slightly before sending the > patches. > > To understand the virtio-iommu, I advise to first read introduction and > motivation, then skim through implementation notes and finally look at the > device specification. > > I wasn't sure how to organize the review. For those who prefer to comment > inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex > to this thread. They are the biggest chunks of the document. But LaTeX > isn't very pleasant to read, so you can simply send a list of comments in > relation to section numbers and a few words of context, we'll manage. > > --- > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare > more easily differences since the RFC (see [6]), but haven't been made > public so far. This is the first public posting since initial proposal > [1], and the following describes all changes. > > ## v0.1 ## > > Content is the same as the RFC, but formatted to LaTeX. 'make' generates > one PDF and one HTML document. > > ## v0.2 ## > > Add introductions, improve topology example and firmware description based > on feedback and a number of useful discussions. > > ## v0.3 ## > > Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten > the device and driver behaviour. Unmap semantics are consolidated; they > are now closer to VFIO Type1 v2 semantics. > > ## v0.4 ## > > Introduce PROBE requests. They provide per-endpoint information to the > driver that couldn't be described otherwise. > > For the moment, they allow to handle MSIs on x86 virtual platforms (see > 3.2). To do that we communicate reserved IOVA regions, that will also be > useful for describing regions that cannot be mapped for a given endpoint, > for instance addresses that correspond to a PCI bridge window. > > Introducing such a large framework for this tiny feature may seem > overkill, but it is needed for future extensions of the virtio-iommu and I > believe it really is worth the effort. 2.6.7 - As I am currently integrating v0.4 in QEMU here are some other comments: At the moment struct virtio_iommu_req_probe flags is missing in your header. As such I understood the ACK protocol was not implemented by the driver in your branch. - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your header too. 2.6.8.2: - I am really confused about what the device should report as resv regions depending on the PE nature (VFIO or not VFIO) In other iommu drivers, the resv regions are populated by the iommu driver through its get_resv_regions callback. They are usually composed of an iommu specific MSI region (mapped or bypassed) and non IOMMU specific (device specific) reserved regions: iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those are the guest reserved regions. First in the current virtio-iommu driver I don't see the iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu driver should compute the non IOMMU specific MSI regions. ie. this is not the responsability of the virtio-iommu device. Then why is it more the job of the device to return the guest iommu specific region rather than the driver itself? Then I understand this is the responsability of the virtio-iommu device to gather information about the host resv regions in case of VFIO EP. Typically the host PCIe host bridge windows cannot be used for IOVA. Also the host MSI reserved IOVA window cannot be used. Do you agree. I really think the spec should clarify what exact resv regions the device should return in case of VFIO device and non VFIO device. Thanks Eric > > ## Future ## > > Other extensions are in preparation. I won't detail them here because v0.4 > already is a lot to digest, but in short, building on top of PROBE: > > * First, since the IOMMU is paravirtualized, the device can expose some > properties of the physical topology to the guest, and let it allocate > resources more efficiently. For example, when the virtio-iommu manages > both physical and emulated endpoints, with different underlying IOMMUs, > we now have a way to describe multiple page and block granularities, > instead of forcing the guest to use the
Re: [RFC] virtio-iommu version 0.4
Hi Kevin, On 28/08/17 08:39, Tian, Kevin wrote: > Here comes some comments: > > 1.1 Motivation > > You describe I/O page faults handling as future work. Seems you considered > only recoverable fault (since "aka. PCI PRI" being used). What about other > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find > a valid mapping? Even when there is no PRI support, we need some basic > form of fault reporting mechanism to indicate such errors to guest. I am considering recoverable faults as the end goal, but reporting unrecoverable faults should use the same queue, with slightly different fields and no need for the driver to reply to the device. > 2.6.8.2 Property RESV_MEM > > I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT > should be explicitly reported. Is there any real example on bare metal IOMMU? > usually reserved memory is reported to CPU through other method (e.g. e820 > on x86 platform). Of course MSI is a special case which is covered by BYPASS > and MSI flag... If yes, maybe you can also include an example in > implementation > notes. The RESV_MEM regions only describes IOVA space for the moment, not guest-physical, so I guess it provides different information than e820. I think a useful example is the PCI bridge windows reported by the Linux host to userspace using RESV_RESERVED regions (see iommu_dma_get_resv_regions). If I understand correctly, they represent DMA addresses that shouldn't be accessed by endpoints because they won't reach the IOMMU. These are specific to the physical topology: a device will have different reserved regions depending on the PCI slot it occupies. When handled properly, PCI bridge windows quickly become a nuisance. With kvmtool we observed that carving out their addresses globally removes a lot of useful GPA space from the guest. Without a virtual IOMMU we can either ignore them and hope everything will be fine, or remove all reserved regions from the GPA space (which currently means editing by hand the static guest-physical map...) That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes reserved IOVAs for a specific endpoint, and therefore removes the need to carve the window out of the whole guest. > Another thing I want to ask your opinion, about whether there is value of > adding another subtype (MEM_T_IDENTITY), asking for identity mapping > in the address space. It's similar to Reserved Memory Region Reporting > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved > memory ranges which may be DMA target and has to be identity mapped > when DMA remapping is enabled. I'm not sure whether ARM has similar > capability and whether there might be a general usage beyond VT-d. For > now the only usage in my mind is to assign a device with RMRR associated > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs > propagated to the guest (since identity mapping also means reservation > of virtual address space). Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are used for both iGPU and USB controllers on my x86 machines. Do you know more precisely what they are used for by the firmware? It's not necessary with the base virtio-iommu device though (v0.4), because the device can create the identity mappings itself and report them to the guest as MEM_T_BYPASS. However, when we start handing page table control over to the guest, the host won't be in control of IOVA->GPA mappings and will need to gracefully ask the guest to do it. I'm not aware of any firmware description resembling Intel RMRR or AMD IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY for requesting the guest to map MSI windows when page-table handover is in use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA mapping must be installed by the guest). But since a vSMMU would need a solution as well, I think I'll try to implement something more generic. > 2.6.8.2.3 Device Requirements: Property RESV_MEM > > --citation start-- > If an endpoint is attached to an address space, the device SHOULD leave > any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS > regions pass through untranslated. In other words, the device SHOULD > handle such a region as if it was identity-mapped (virtual address equal to > physical address). If the endpoint is not attached to any address space, > then the device MAY abort the transaction. > --citation end > > I have a question for the last sentence. From definition of BYPASS, it's > orthogonal to whether there is an address space attached, then should > we still allow "May abort" behavior? The behavior is left as an implementation choice, and I'm not sure it's worth enforcing in the architecture. If the endpoint isn't attached to any domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't necessarily able to do DMA at all. The virtio-iommu device may setup DMA mastering lazily, in which
RE: [RFC] virtio-iommu version 0.4
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] > Sent: Wednesday, August 23, 2017 6:01 PM > > On 04/08/17 19:19, Jean-Philippe Brucker wrote: > > Other extensions are in preparation. I won't detail them here because > v0.4 > > already is a lot to digest, but in short, building on top of PROBE: > > > > * First, since the IOMMU is paravirtualized, the device can expose some > > properties of the physical topology to the guest, and let it allocate > > resources more efficiently. For example, when the virtio-iommu > manages > > both physical and emulated endpoints, with different underlying > IOMMUs, > > we now have a way to describe multiple page and block granularities, > > instead of forcing the guest to use the most restricted one for all > > endpoints. This will most likely be in v0.5. > > In order to extend requests with PASIDs and (later) nested mode, I intend > to rename "address_space" field to "domain", since it is a lot more > precise about what the field is referring to and the current name would > make these extensions confusing. Please find the rationale at [1]. > "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" > will > be "VIRTIO_IOMMU_F_DOMAIN_BITS". > > For those that had time to read this version, do you have other comments > and suggestions about v0.4? Otherwise it is the only update I have for > v0.5 (along with fine-grained address range and page size properties from > the quoted text) and I will send it soon. > > In particular, please tell me now if you see the need for other > destructive changes like this one. They will be impossible to introduce > once a driver or device is upstream. > > Thanks, > Jean > > [1] https://www.spinics.net/lists/kvm/msg154573.html Here comes some comments: 1.1 Motivation You describe I/O page faults handling as future work. Seems you considered only recoverable fault (since "aka. PCI PRI" being used). What about other unrecoverable faults e.g. what to do if a virtual DMA request doesn't find a valid mapping? Even when there is no PRI support, we need some basic form of fault reporting mechanism to indicate such errors to guest. 2.6.8.2 Property RESV_MEM I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT should be explicitly reported. Is there any real example on bare metal IOMMU? usually reserved memory is reported to CPU through other method (e.g. e820 on x86 platform). Of course MSI is a special case which is covered by BYPASS and MSI flag... If yes, maybe you can also include an example in implementation notes. Another thing I want to ask your opinion, about whether there is value of adding another subtype (MEM_T_IDENTITY), asking for identity mapping in the address space. It's similar to Reserved Memory Region Reporting (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved memory ranges which may be DMA target and has to be identity mapped when DMA remapping is enabled. I'm not sure whether ARM has similar capability and whether there might be a general usage beyond VT-d. For now the only usage in my mind is to assign a device with RMRR associated on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs propagated to the guest (since identity mapping also means reservation of virtual address space). 2.6.8.2.3 Device Requirements: Property RESV_MEM --citation start-- If an endpoint is attached to an address space, the device SHOULD leave any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS regions pass through untranslated. In other words, the device SHOULD handle such a region as if it was identity-mapped (virtual address equal to physical address). If the endpoint is not attached to any address space, then the device MAY abort the transaction. --citation end I have a question for the last sentence. From definition of BYPASS, it's orthogonal to whether there is an address space attached, then should we still allow "May abort" behavior? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] virtio-iommu version 0.4
On 04/08/17 19:19, Jean-Philippe Brucker wrote: > Other extensions are in preparation. I won't detail them here because v0.4 > already is a lot to digest, but in short, building on top of PROBE: > > * First, since the IOMMU is paravirtualized, the device can expose some > properties of the physical topology to the guest, and let it allocate > resources more efficiently. For example, when the virtio-iommu manages > both physical and emulated endpoints, with different underlying IOMMUs, > we now have a way to describe multiple page and block granularities, > instead of forcing the guest to use the most restricted one for all > endpoints. This will most likely be in v0.5. In order to extend requests with PASIDs and (later) nested mode, I intend to rename "address_space" field to "domain", since it is a lot more precise about what the field is referring to and the current name would make these extensions confusing. Please find the rationale at [1]. "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will be "VIRTIO_IOMMU_F_DOMAIN_BITS". For those that had time to read this version, do you have other comments and suggestions about v0.4? Otherwise it is the only update I have for v0.5 (along with fine-grained address range and page size properties from the quoted text) and I will send it soon. In particular, please tell me now if you see the need for other destructive changes like this one. They will be impossible to introduce once a driver or device is upstream. Thanks, Jean [1] https://www.spinics.net/lists/kvm/msg154573.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC] virtio-iommu version 0.4
On 14/08/17 09:27, Tian, Kevin wrote: >> * First, since the IOMMU is paravirtualized, the device can expose some >> properties of the physical topology to the guest, and let it allocate >> resources more efficiently. For example, when the virtio-iommu manages >> both physical and emulated endpoints, with different underlying IOMMUs, >> we now have a way to describe multiple page and block granularities, >> instead of forcing the guest to use the most restricted one for all >> endpoints. This will most likely be in v0.5. > > emulated IOMMU has similar requirement, e.g. available PASID bits, > address widths, etc. which may break guest usage if not aligned to > physical limitation. Suppose we can introduce a general interface > through VFIO for all vIOMMU incarnations. A nice location for this kind of info would be sysfs, as discussed in the SVM virtualization thread [1]. Properties of an IOMMU could be described in /sys/class/iommu/. Properties of a PCI device are available in its PASID/PRI capabilities. For platform devices we'll have to look at DT and ACPI properties in /sys/firmware. >> * Then on top of that, a major improvement will describe hardware >> acceleration features available to the guest. There is what I call "Page >> Table Handover" (or simply, from the host POV, "Nested"), the ability >> for the guest to manipulate its own page tables instead of sending >> MAP/UNMAP requests to the host. This, along with IO Page Fault >> reporting, will also permit SVM virtualization on different platforms. > > what's your planned cadence for future versions? :-) Hard to say, it depends on a number of things. I have various other tasks eating up my bandwidth at the moment and I may have to considerably rework this version depending on the feedback it gets. Ideally, I would like to get the base driver merged and a proposal for hardware acceleration out by the end of the year, but I obviously can't make any guarantee. Thanks, Jean [1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC] virtio-iommu version 0.4
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com] > Sent: Saturday, August 5, 2017 2:19 AM > > This is the continuation of my proposal for virtio-iommu, the para- > virtualized IOMMU. Here is a summary of the changes since last time [1]: > > * The virtio-iommu document now resembles an actual specification. It is > split into a formal description of the virtio device, and implementation > notes. Please find sources and binaries at [2]. > > * Added a probe request to describe to the guest different properties that > do not fit in firmware or in the virtio config space. This is a > necessary stepping stone for extending the virtio-iommu. > > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat > Bhushan. > > You can find the Linux driver and kvmtool device at [4] and [5]. I > plan to rework driver and kvmtool device slightly before sending the > patches. > > To understand the virtio-iommu, I advise to first read introduction and > motivation, then skim through implementation notes and finally look at the > device specification. > > I wasn't sure how to organize the review. For those who prefer to comment > inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex > to this thread. They are the biggest chunks of the document. But LaTeX > isn't very pleasant to read, so you can simply send a list of comments in > relation to section numbers and a few words of context, we'll manage. > > --- > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare > more easily differences since the RFC (see [6]), but haven't been made > public so far. This is the first public posting since initial proposal > [1], and the following describes all changes. > > ## v0.1 ## > > Content is the same as the RFC, but formatted to LaTeX. 'make' generates > one PDF and one HTML document. > > ## v0.2 ## > > Add introductions, improve topology example and firmware description > based > on feedback and a number of useful discussions. > > ## v0.3 ## > > Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten > the device and driver behaviour. Unmap semantics are consolidated; they > are now closer to VFIO Type1 v2 semantics. > > ## v0.4 ## > > Introduce PROBE requests. They provide per-endpoint information to the > driver that couldn't be described otherwise. > > For the moment, they allow to handle MSIs on x86 virtual platforms (see > 3.2). To do that we communicate reserved IOVA regions, that will also be > useful for describing regions that cannot be mapped for a given endpoint, > for instance addresses that correspond to a PCI bridge window. > > Introducing such a large framework for this tiny feature may seem > overkill, but it is needed for future extensions of the virtio-iommu and I > believe it really is worth the effort. > > ## Future ## > > Other extensions are in preparation. I won't detail them here because v0.4 > already is a lot to digest, but in short, building on top of PROBE: > > * First, since the IOMMU is paravirtualized, the device can expose some > properties of the physical topology to the guest, and let it allocate > resources more efficiently. For example, when the virtio-iommu manages > both physical and emulated endpoints, with different underlying IOMMUs, > we now have a way to describe multiple page and block granularities, > instead of forcing the guest to use the most restricted one for all > endpoints. This will most likely be in v0.5. emulated IOMMU has similar requirement, e.g. available PASID bits, address widths, etc. which may break guest usage if not aligned to physical limitation. Suppose we can introduce a general interface through VFIO for all vIOMMU incarnations. > > * Then on top of that, a major improvement will describe hardware > acceleration features available to the guest. There is what I call "Page > Table Handover" (or simply, from the host POV, "Nested"), the ability > for the guest to manipulate its own page tables instead of sending > MAP/UNMAP requests to the host. This, along with IO Page Fault > reporting, will also permit SVM virtualization on different platforms. what's your planned cadence for future versions? :-) > > Thanks, > Jean > > [1] http://www.spinics.net/lists/kvm/msg147990.html > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4 > http://www.linux-arm.org/git?p=virtio- > iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf > I reiterate the disclaimers: don't use this document as a reference, > it's a draft. It's also not an OASIS document yet. It may be riddled > with mistakes. As this is a working draft, it is unstable and I do not > guarantee backward compatibility of future versions. > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg4.html > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4 > Warning: UAPI headers have changed! They didn't follow the spec, > please upda