[virtio-dev] RE: [PATCH v3] virtio-iommu: Rework the bypass feature

2021-11-02 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Friday, October 22, 2021 8:12 PM
> 
> The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
> Although it is implemented by QEMU, it is not supported by any driver as
> far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
> feature.
> 
> Two features are missing from virtio-iommu:
> 
> * The ability for an hypervisor to start the device in bypass mode. The
>   wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
>   the moment, because it only specifies the behavior after feature
>   negotiation.
> 
> * The ability for a guest to set individual endpoints in bypass mode
>   when bypass is globally disabled. An OS should have the ability to
>   allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
>   disabled for endpoints it isn't even aware of. At the moment this can
>   only be emulated by creating identity mappings.
> 
> The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
> that allows to enable and disable bypass globally. It also adds a new
> flag for the ATTACH request.
> 
> * The hypervisor can start the VM with bypass enabled or, if it knows
>   that the software stack supports it, disabled. The 'bypass' config
>   fields is initialized to 0 or 1. It is sticky and isn't affected by
>   device reset.
> 
> * Generally the firmware won't have an IOMMU driver and will need to be
>   started in bypass mode, so the bootloader and kernel can be loaded
>   from storage endpoint.
> 
>   For more security, the firmware could implement a minimal virtio-iommu
>   driver that reuses existing virtio support and only touches the config
>   space. It could enable PCI bus mastering in bridges only for the
>   endpoints that need it, enable global IOMMU bypass by flipping a bit,
>   then tear everything down before handing control over to the OS. This
>   prevents vulnerability windows where a malicious endpoint reprograms
>   the IOMMU while the OS is configuring it [1].
> 
>   The isolation provided by vIOMMUs has mainly been used for securely
>   assigning endpoints to untrusted applications so far, while kernel DMA
>   bypasses the IOMMU. But we can expect boot security to become as
>   important in virtualization as it presently is on bare-metal systems,
>   where some devices are untrusted and must never be able to access
>   memory that wasn't assigned to them.
> 
> * The OS can enable and disable bypass globally. It can then enable
>   bypass for individual endpoints by attaching them to bypass domains,
>   using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable
> bypass
>   by attaching them to normal domains.
> 
> [1] IOMMU protection against I/O attacks: a vulnerability and a proof of
> concept
> Morgan, B., Alata, É., Nicomette, V. et al.
> https://link.springer.com/article/10.1186/s13173-017-0066-7
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119
> Signed-off-by: Jean-Philippe Brucker 

Reviewed-by: Kevin Tian 

> ---
> 
> The virtio-iommu spec with colored diff is available at
> https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-
> config-v3-diff.pdf
> 
> v1: https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg07817.html
> v2: https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg07898.html
> v3:
> * Normative statement about device reset vs. system reset - the bypass
>   bit is sticky across device reset to avoid the vulnerability described
>   above, but restored on system reset (a term also used by the virtio
>   memory device).
> * Explain that the field bypass is in effect as long as the new feature
>   is offered, even when not accepted by the driver
> * Another clarification about the state of an endpoint after detach.
> ---
>  conformance.tex  |   1 -
>  virtio-iommu.tex | 111 ++-
>  2 files changed, 90 insertions(+), 22 deletions(-)
> 
> diff --git a/conformance.tex b/conformance.tex
> index c2d7959..5c10ab9 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -504,7 +504,6 @@ \section{Conformance
> Targets}\label{sec:Conformance / Conformance Targets}
>  \begin{itemize}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> configuration layout}
> -\item \ref{devicenormative:Device Types / IOMMU Device / Device
> Initialization}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations / ATTACH request}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations / DETACH request}
> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> index 08b358a..f8cbe88 100644
> --- a/virtio-iommu.tex
> +++ b/virtio-iommu.tex
> @@ -59,14 +59,21 @@ \subsection{Feature bits}\label{sec:Device Types /
> IOMMU Device / Feature bits}
>VIRTIO_IOMMU_F_MAP_UNMAP is s

RE: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature

2021-10-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Wednesday, October 13, 2021 7:11 PM
> 
> >
> > >
> > > * Generally the firmware won't have an IOMMU driver and will need to
> be
> > >   started in bypass mode, so the bootloader and kernel can be loaded
> > >   from storage endpoint.
> >
> > How does the firmware know whether storage load is allowed w/o looking
> > at the bypass info in virtio-iommu?
> 
> Either the firmware can't drive virtio-iommu, and must be booted in
> bypass, or it can and then it sets/clears bypass. That's why we're going
> to default qemu to boot-bypass [1], because in general the firmware won't
> support anything else. An admin that has more knowledge in the software
> stack, and knows that the firmware is able to drive the IOMMU, can then
> disable boot-bypass for more security.
> 
> [1] https://lore.kernel.org/qemu-devel/20210930185050.262759-1-jean-
> phili...@linaro.org/

ok. Earlier I was thinking how the firmware w/o IOMMU driver knows 
the fact when boot-bypass is disabled inadvertently and then directly 
throws out warning to the user. But looks this is an egg-chicken problem.
possibly the user will have to do some diagnostic upon such condition
to figure out the reason.

> 
> >
> > >
> > >   For more security, the firmware could implement a minimal virtio-
> iommu
> > >   driver that reuses existing virtio support and only touches the config
> > >   space. It could enable PCI bus mastering in bridges only for the
> > >   endpoints that need it, enable global IOMMU bypass by flipping a bit,
> >
> > then untrusted devices can do DMA after global bypass is enabled?
> 
> Not necessarily, I think firmware could configure PCI switch ports
> upstream of untrusted devices, by keeping Bus Master Enable disabled in
> their config space which will block Memory and I/O Requests.

guest firmware only manages virtual switch ports. Physically they are
supposed to be already enabled in the process of device passthrough. 
Given a device is untrusted, it may ignore Bus Master Enable bit 
inadvertently or deliberately to issue DMA. I feel concept-wise it's
not good for the firmware to open this window when boot-bypass is
disabled...

> 
> > suppose
> > here just needs attach storage device to a normal domain, leaving other
> > devices still blocked from doing DMA
> 
> Yes that's the safest route, but also more complicated to implement.
> If firmware can setup the PCI topology as above, its virtio-iommu driver
> just needs to poke the config space for feature negotiation and enabling
> bypass, no need to setup virtqueues.

I see the value of this simplified approach. But if it cannot isolate 
untrusted devices then we may have to follow the complicated route

> 
> 
> [...]
> > >  Detach an endpoint from a domain. When this request completes, the
> > > -endpoint cannot access any mapping from that domain anymore. If
> feature
> > > -VIRTIO_IOMMU_F_BYPASS has been negotiated, then once this request
> > > -completes all accesses from the endpoint are allowed and translated by
> the
> > > -IOMMU using the identity function.
> > > +endpoint cannot access any mapping from that domain anymore.
> >
> > Related to my earlier question, what about global bypass is enabled? If
> > domain attach is allowed in that configuration, then above statement
> > is not accurate.
> 
> The statement only refers to domain mappings, so when global bypass is
> enabled, after detach the endpoint can access everything. Eric also
> commented on this, I could add:
> 
>   However the endpoint may then be in bypass mode and access the
>   guest-physical address space.
> 
> So the reader can refer to the definition of bypass mode and see what
> happens to unattached endpoints.
> 

sounds good.

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] [PATCH v2] virtio-iommu: Rework the bypass feature

2021-10-13 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Monday, October 11, 2021 11:09 PM
> 
> The VIRTIO_IOMMU_F_BYPASS feature is awkward to use and incomplete.
> Although it is implemented by QEMU, it is not supported by any driver as
> far as I know. Replace it with a new VIRTIO_IOMMU_F_BYPASS_CONFIG
> feature.
> 
> Two features are missing from virtio-iommu:
> 
> * The ability for an hypervisor to start the device in bypass mode. The
>   wording for VIRTIO_IOMMU_F_BYPASS is not clear enough to allow it at
>   the moment.

unclear because it's worded around driver negotiation thus the behavior
is undefined before the driver is loaded?

> 
> * The ability for a guest to set individual endpoints in bypass mode
>   when bypass is globally disabled. An OS should have the ability to
>   allow only endpoints it trusts to bypass the IOMMU, while keeping DMA
>   disabled for endpoints it isn't even aware of. At the moment this can
>   only be emulated by creating identity mappings.

Is driver required to first disable global bypass before it can attach
endpoints to normal domains?

> 
> The VIRTIO_IOMMU_F_BYPASS_CONFIG feature adds a 'bypass' config field
> that allows to enable and disable bypass globally. It also adds a new
> flag for the ATTACH request.
> 
> * The hypervisor can start the VM with bypass enabled or, if it knows
>   that the software stack supports it, disabled. The 'bypass' config
>   fields resets to 0 or 1.

is sticky and unchanged by reset

> 
> * Generally the firmware won't have an IOMMU driver and will need to be
>   started in bypass mode, so the bootloader and kernel can be loaded
>   from storage endpoint.

How does the firmware know whether storage load is allowed w/o looking
at the bypass info in virtio-iommu? 

> 
>   For more security, the firmware could implement a minimal virtio-iommu
>   driver that reuses existing virtio support and only touches the config
>   space. It could enable PCI bus mastering in bridges only for the
>   endpoints that need it, enable global IOMMU bypass by flipping a bit,

then untrusted devices can do DMA after global bypass is enabled? suppose
here just needs attach storage device to a normal domain, leaving other
devices still blocked from doing DMA

>   then tear everything down before handing control over to the OS. This
>   prevents vulnerability windows where a malicious endpoint reprograms
>   the IOMMU while the OS is configuring it [1].
> 
>   The isolation provided by vIOMMUs has mainly been used for securely
>   assigning endpoints to untrusted applications so far, while kernel DMA
>   bypasses the IOMMU. But we can expect boot security to become as
>   important in virtualization as it presently is on bare-metal systems,
>   where some devices are untrusted and must never be able to access
>   memory that wasn't assigned to them.
> 
> * The OS can enable and disable bypass globally. It can then enable
>   bypass for individual endpoints by attaching them to bypass domains,
>   using the new VIRTIO_IOMMU_ATTACH_F_BYPASS flag. It can disable
> bypass
>   by attaching them to normal domains.
> 
> [1] IOMMU protection against I/O attacks: a vulnerability and a proof of
> concept
> Morgan, B., Alata, É., Nicomette, V. et al.
> https://link.springer.com/article/10.1186/s13173-017-0066-7
> 
> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/119
> Signed-off-by: Jean-Philippe Brucker 
> ---
> 
> The virtio-iommu spec with colored diff is available at
> https://jpbrucker.net/virtio-iommu/spec-bypass/virtio-iommu-f-bypass-
> config-v2-diff.pdf
> 
> v1: https://www.mail-archive.com/virtio-dev@lists.oasis-
> open.org/msg07817.html
> v2:
> * Keep description of VIRTIO_IOMMU_F_BYPASS, moved to Device
> operations
>   and merged with VIRTIO_IOMMU_F_BYPASS_CONFIG.
>   Added normative statements about which feature devices should offer.
> * Describe initial and reset bypass value with normative statements.
> * Addressed the other comments.
> ---
>  conformance.tex  |   1 -
>  virtio-iommu.tex | 102 +--
>  2 files changed, 82 insertions(+), 21 deletions(-)
> 
> diff --git a/conformance.tex b/conformance.tex
> index c52f1a4..7f70fc1 100644
> --- a/conformance.tex
> +++ b/conformance.tex
> @@ -503,7 +503,6 @@ \section{Conformance
> Targets}\label{sec:Conformance / Conformance Targets}
>  \begin{itemize}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Feature bits}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> configuration layout}
> -\item \ref{devicenormative:Device Types / IOMMU Device / Device
> Initialization}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations / ATTACH request}
>  \item \ref{devicenormative:Device Types / IOMMU Device / Device
> operations / DETACH request}
> diff --git a/virtio-iommu.tex b/virtio-iommu.tex
> index 08b358a..fcda138 100644
> --- a/virtio-i

RE: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification

2019-05-14 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Wednesday, May 15, 2019 1:08 PM
> To: Jean-Philippe Brucker ; Michael S.
> Tsirkin 
> >
> >
> > >>>> +\begin{description}
> > >>>> +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
> > >>>> +Accesses to virtual addresses in this region have undefined
> behavior.
> > >>>
> > >>> undefined?  how can platforms with such a property coexist with
> > untrusted
> > >>> devices?
> > >>
> > >> It is "undefined" because I can't enumerate all cases (don't know about
> > >> all of them). This would contain for example the RMRR regions of VT-d
> > >> (the BIOS reserving some DMA regions for itself), regions allocated for
> > >> host-managed MSIs on arm platforms, and some PCI bridge windows
> > >> (although they removed those from the Linux resv regions recently, not
> > >> certain why). Ideally we wouldn't have any, but some special cases make
> > >> it necessary.

+Alex.

Actually I don't think RMRR is right example here (sorry if it's my input in
our previous discussion). There was concern on RMRR providing a shared
memory region between device and platform controller, thus may cause
isolation issue:
https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf

So the basic policy for devices with RMRR is, excluding them from device
assignment, except for Intel GPU and USB. The latter two has well-understood
behavior on RMRR, thus they can be safely assigned with RMRR ignored
(i.e. RMRR region can be reused as valid IOVA region).  

As I commented below, I think it's safe to describe reserved region behavior
as either succeed with undefined behavior (e.g. in ARM MSI doorbell case, 
where a spurious interrupt on allocated doorbell for this device is triggered 
instead of desired DMA access), or cause vIOMMU unrecoverable fault due 
to invalid mappings. An untrusted device can never use reserved region to 
escape since physically IOMMU only contains verified mapping to granted
resource (MSI) or nothing.

> > >>
> > >> In general having reserved regions is incompatible with untrusted
> > >> devices, and in some cases incompatible with untrusted guests. But
> > >> choosing the policy about which devices to present to guest is up to the
> > >> platform. The host knows what the resv regions are used for, and if they
> > >> are safe enough. The guest just need to knows what to avoid.
> > >
> > > Yes but ... if you trust driver and device then what's the
> > > point of the iommu?
> >
> > Improving memory allocation. The guest can do scatter-gather for large
> > buffers, instead of allocating big contiguous chunks of guest-physical
> > addresses. And with device pass-through, any memory used for DMA has to
> > be pinned (since devices generally don't support page faults). So when
> > assigning an endpoint to a guest, without a vIOMMU all of the guest
> > memory has to be pinned upfront. With an IOMMU only the memory that
> will
> > actually be used for DMA is pinned.
> 
> I'd not vote using current vIOMMU just for avoiding pin instead of for
> isolation purpose... the map/unmap overhead is high except you only
> talking about static mapping e.g. in guest user space driver (e.g. DPDK)
> (but then it's actually for isolation purpose between user and kernel).
>  btw we are working on a lighter-weight approach (aim <5% perf drop
> for most cases) while allowing host to fully introspect guest DMA activities.
> Likely we'll introduce a new virtio-iommu capability for this purpose,
> instead of using current MAP/UNMAP primitives. Stay tuned and we'll
> send out RFC soon once getting good result. :-)
> 
> And I don't think having reserved regions is incompatible with untrusted
> devices. Favoring reserved region is a functional requirement so guest
> driver can operate assigned device in a good shape. Violating reserved
> region requirement (e.g. map guest memory page to reserved region) just
> means device access to this page doesn't provide desired behavior, since
> host will always guarantee safe behavior in physical IOMMU on those
> regions:
> 
> - it is either mapped to fixed resource (e.g ARM MSI doorbell)
> which is associated with this device and thus safe
> - or map to nothing
> 
> I don't think architecturally we break any isolation implication on
> IOMMU here.
> 
> Thanks
> Kevin


RE: [virtio-dev] Re: [PATCH v3] Add virtio-iommu device specification

2019-05-14 Thread Tian, Kevin
> From: Jean-Philippe Brucker 
> Sent: Wednesday, May 15, 2019 12:08 AM
> 
> On 10/05/2019 17:52, Michael S. Tsirkin wrote:
>  +/* Flags are: */
>  +#define VIRTIO_IOMMU_MAP_F_READ   (1 << 0)
>  +#define VIRTIO_IOMMU_MAP_F_WRITE  (1 << 1)
>  +#define VIRTIO_IOMMU_MAP_F_EXEC   (1 << 2)
> >>>
> >>> what is exec? pls add some comments near all flags.
> >>
> >> Exec means instruction fetch. Some systems have different transaction
> >> flags for read and exec (PCI only supports that in the PASID TLP prefix)
> >> and some page tables have a "no exec" bit.
> >
> > So let's say I don't set EXEC on such a system. Is EXEC allowed or not?
> >
> > Maybe we can ask user to always set EXEC is exec is needed,
> > but then say we can not promise instruction fetch will fail
> > if exec is not set.
> 
> So the semantics could be:
> * device sets EXEC feature bit
>   - driver sets EXEC flag in MAP, instruction fetch succeeds
>   - driver doesn't set EXEC flag in MAP, instructions fetch will fail
> * device doesn't set EXEC feature bit, then driver cannot set EXEC flag
> but instruction fetch may not fail.
> 
> But I'm tempted to just remove the EXEC flag for now, I don't think
> anyone needs it at the moment (and VFIO doesn't support it). I'd add it
> back as NO_EXEC which is closer to what IOMMUs offer.

Agree. It's more for PASID-based usage e.g. sharing CPU page table
with device in SVA. Now this version is purely about MAP/UNMAP.
On Intel VT-d, EXEC is ignored for request-without-PASID.

> 
> 
>  +\begin{description}
>  +  \item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
>  +Accesses to virtual addresses in this region have undefined 
>  behavior.
> >>>
> >>> undefined?  how can platforms with such a property coexist with
> untrusted
> >>> devices?
> >>
> >> It is "undefined" because I can't enumerate all cases (don't know about
> >> all of them). This would contain for example the RMRR regions of VT-d
> >> (the BIOS reserving some DMA regions for itself), regions allocated for
> >> host-managed MSIs on arm platforms, and some PCI bridge windows
> >> (although they removed those from the Linux resv regions recently, not
> >> certain why). Ideally we wouldn't have any, but some special cases make
> >> it necessary.
> >>
> >> In general having reserved regions is incompatible with untrusted
> >> devices, and in some cases incompatible with untrusted guests. But
> >> choosing the policy about which devices to present to guest is up to the
> >> platform. The host knows what the resv regions are used for, and if they
> >> are safe enough. The guest just need to knows what to avoid.
> >
> > Yes but ... if you trust driver and device then what's the
> > point of the iommu?
> 
> Improving memory allocation. The guest can do scatter-gather for large
> buffers, instead of allocating big contiguous chunks of guest-physical
> addresses. And with device pass-through, any memory used for DMA has to
> be pinned (since devices generally don't support page faults). So when
> assigning an endpoint to a guest, without a vIOMMU all of the guest
> memory has to be pinned upfront. With an IOMMU only the memory that will
> actually be used for DMA is pinned.

I'd not vote using current vIOMMU just for avoiding pin instead of for
isolation purpose... the map/unmap overhead is high except you only 
talking about static mapping e.g. in guest user space driver (e.g. DPDK) 
(but then it's actually for isolation purpose between user and kernel).
 btw we are working on a lighter-weight approach (aim <5% perf drop
for most cases) while allowing host to fully introspect guest DMA activities. 
Likely we'll introduce a new virtio-iommu capability for this purpose, 
instead of using current MAP/UNMAP primitives. Stay tuned and we'll 
send out RFC soon once getting good result. :-)

And I don't think having reserved regions is incompatible with untrusted
devices. Favoring reserved region is a functional requirement so guest
driver can operate assigned device in a good shape. Violating reserved
region requirement (e.g. map guest memory page to reserved region) just
means device access to this page doesn't provide desired behavior, since
host will always guarantee safe behavior in physical IOMMU on those
regions:

- it is either mapped to fixed resource (e.g ARM MSI doorbell)
which is associated with this device and thus safe
- or map to nothing

I don't think architecturally we break any isolation implication on
IOMMU here.

Thanks
Kevin


RE: [virtio-dev] Would there be virtio-audio spec in future for clean audio interface

2019-03-26 Thread Tian, Kevin
CC acrn community. 

There are quite some new virtio devices (audio, gpio, i2c, etc) included in
acrn project, as required in automotive/embedded usages. I know they 
have plan to upstream those bits, and now might be a good time to talk
about the plan. :-)

Thanks
Kevin

> -Original Message-
> From: virtio-dev@lists.oasis-open.org [mailto:virtio-dev@lists.oasis-open.org]
> On Behalf Of Paolo Bonzini
> Sent: Tuesday, March 26, 2019 5:39 PM
> To: vlse 
> Cc: virtio-dev@lists.oasis-open.org; xionghu@intel.com
> Subject: Re: [virtio-dev] Would there be virtio-audio spec in future for clean
> audio interface
> 
> On 26/03/19 01:48, vlse wrote:
> > On Mon, Mar 25, 2019 at 02:13:51PM +0100, Paolo Bonzini wrote:
> > Hi,
> >
> > No I don't have a contact with Project ACRN.
> >
> > ACRN is a Linux Foundation Project. Initially it was by Intel.
> > It is a hypervisor for embedded devices.
> > Website is https://projectacrn.org/.
> >
> > Their supported virtio-audio details:
> >  Vendor ID: 0x8086
> >  Device ID: 0x8603
> >  Subvendor ID: 0x8086
> >  Subdevice ID: 0xFFFD
> >
> > Another implementation from another developer from Intel.
> > Website: https://patchwork.freedesktop.org/patch/174013/
> > Author: Luo Xionghu 
> >
> > Comments by Author: this patch only add audio device pci registration.
> > need add audio part to call audio Backend driver later.
> 
> Thanks, that's useful.  Xionghu, is there a public spec for
> virtio-audio, and/or does Intel intend to contribute it to the virtio
> specification?
> 
> Unfortunately device id 21 is already taken.  I'll send a patch to the
> specification that reserves id 25 for virtio-audio.
> 
> Thanks,
> 
> Paolo
> 
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Tian, Kevin
> From: Liang, Cunming
> Sent: Tuesday, April 10, 2018 10:24 PM
> 
[...]
> >
> > As others said, we do not need to go overeboard. A couple of small
> vendor-
> > specific quirks in qemu isn't a big deal.
> 
> It's quite challenge to identify it's small or not, there's no uniform metric.
> 
> It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is
> to build the entire device driver. We don't object to do that in QEMU, but
> wanna make sure to understand the boundary size clearly.
> 

It might be helpful if you can post some sample code using proposed 
framework - then people have a clear feeling about what size is talked 
about here and whether it falls into the concept of 'small quirks'.

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-23 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, March 22, 2018 6:06 PM
> 
> > From: Robin Murphy [mailto:robin.mur...@arm.com]
> > Sent: Wednesday, March 21, 2018 10:24 PM
> >
> > On 21/03/18 13:14, Jean-Philippe Brucker wrote:
> > > On 21/03/18 06:43, Tian, Kevin wrote:
> > > [...]
> > >>> +
> > >>> +#include 
> > >>> +
> > >>> +#define MSI_IOVA_BASE  0x800
> > >>> +#define MSI_IOVA_LENGTH0x10
> > >>
> > >> this is ARM specific, and according to virtio-iommu spec isn't it
> > >> better probed on the endpoint instead of hard-coding here?
> > >
> > > These values are arbitrary, not really ARM-specific even if ARM is the
> > > only user yet: we're just reserving a random IOVA region for mapping
> > MSIs.
> > > It is hard-coded because of the way iommu-dma.c works, but I don't
> > quite
> > > remember why that allocation isn't dynamic.
> >
> > The host kernel needs to have *some* MSI region in place before the
> > guest can start configuring interrupts, otherwise it won't know what
> > address to give to the underlying hardware. However, as soon as the host
> > kernel has picked a region, host userspace needs to know that it can no
> > longer use addresses in that region for DMA-able guest memory. It's a
> > lot easier when the address is fixed in hardware and the host userspace
> > will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
> > the
> > more general case where MSI writes undergo IOMMU address translation
> > so
> > it's an arbitrary IOVA, this has the potential to conflict with stuff
> > like guest memory hotplug.
> >
> > What we currently have is just the simplest option, with the host kernel
> > just picking something up-front and pretending to host userspace that
> > it's a fixed hardware address. There's certainly scope for it to be a
> > bit more dynamic in the sense of adding an interface to let userspace
> > move it around (before attaching any devices, at least), but I don't
> > think it's feasible for the host kernel to second-guess userspace enough
> > to make it entirely transparent like it is in the DMA API domain case.
> >
> > Of course, that's all assuming the host itself is using a virtio-iommu
> > (e.g. in a nested virt or emulation scenario). When it's purely within a
> > guest then an MSI reservation shouldn't matter so much, since the guest
> > won't be anywhere near the real hardware configuration anyway.
> >
> > Robin.
> 
> Curious since anyway we are defining a new iommu architecture
> is it possible to avoid those ARM-specific burden completely?
> 

OK, after some study around those tricks below is my learning:

- MSI_IOVA window is used only on request (iommu_dma_get
_msi_page), not meant to take effect on all architectures once 
initialized. e.g. ARM GIC does it but not x86. So it is reasonable 
for virtio-iommu driver to implement such capability;

- I thought whether hardware MSI doorbell can be always reported
on virtio-iommu since it's newly defined. Looks there is a problem
if underlying IOMMU is sw-managed MSI style - valid mapping is
expected in all level of translations, meaning guest has to manage
stage-1 mapping in nested configuration since stage-1 is owned
by guest. 

Then virtio-iommu is naturally expected to report the same MSI 
model as supported by underlying hardware. Below are some
further thoughts along this route (use 'IOMMU' to represent the
physical one and 'virtio-iommu' for virtual one):



In the scope of current virtio-iommu spec v.6, there is no nested
consideration yet. Guest driver is expected to use MAP/UNMAP
interface on assigned endpoints. In this case the MAP requests
(IOVA->GPA) is caught and maintained within Qemu which then 
further talks to VFIO to map IOVA->HPA in IOMMU.

Qemu can learn the MSI model of IOMMU from sysfs.

For hardware MSI doorbell (x86 and some ARM):
* Host kernel reports to Qemu as IOMMU_RESV_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_MSI
* Guest takes the range as IOMMU_RESV_MSI. reserved
* Qemu MAP database has no mapping for the doorbell
* Physical IOMMU page table has no mapping for the doorbell
* MSI from passthrough device bypass IOMMU
* MSI from emulated device bypass virtio-iommu

For software MSI doorbell (most ARM):
* Host kernel reports to Qemu as IOMMU_RESV_SW_MSI
* Qemu report to guest as VIRTIO_IOMMU_RESV_MEM_T_RESERVED
* Guest takes the range as IOMMU_RESV_RESERVED
* vGIC requests to map 'GPA of the virtual doorbell'
* a map

[virtio-dev] RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-22 Thread Tian, Kevin
> From: Robin Murphy [mailto:robin.mur...@arm.com]
> Sent: Wednesday, March 21, 2018 10:24 PM
> 
> On 21/03/18 13:14, Jean-Philippe Brucker wrote:
> > On 21/03/18 06:43, Tian, Kevin wrote:
> > [...]
> >>> +
> >>> +#include 
> >>> +
> >>> +#define MSI_IOVA_BASE0x800
> >>> +#define MSI_IOVA_LENGTH  0x10
> >>
> >> this is ARM specific, and according to virtio-iommu spec isn't it
> >> better probed on the endpoint instead of hard-coding here?
> >
> > These values are arbitrary, not really ARM-specific even if ARM is the
> > only user yet: we're just reserving a random IOVA region for mapping
> MSIs.
> > It is hard-coded because of the way iommu-dma.c works, but I don't
> quite
> > remember why that allocation isn't dynamic.
> 
> The host kernel needs to have *some* MSI region in place before the
> guest can start configuring interrupts, otherwise it won't know what
> address to give to the underlying hardware. However, as soon as the host
> kernel has picked a region, host userspace needs to know that it can no
> longer use addresses in that region for DMA-able guest memory. It's a
> lot easier when the address is fixed in hardware and the host userspace
> will never be stupid enough to try and VFIO_IOMMU_DMA_MAP it, but in
> the
> more general case where MSI writes undergo IOMMU address translation
> so
> it's an arbitrary IOVA, this has the potential to conflict with stuff
> like guest memory hotplug.
> 
> What we currently have is just the simplest option, with the host kernel
> just picking something up-front and pretending to host userspace that
> it's a fixed hardware address. There's certainly scope for it to be a
> bit more dynamic in the sense of adding an interface to let userspace
> move it around (before attaching any devices, at least), but I don't
> think it's feasible for the host kernel to second-guess userspace enough
> to make it entirely transparent like it is in the DMA API domain case.
> 
> Of course, that's all assuming the host itself is using a virtio-iommu
> (e.g. in a nested virt or emulation scenario). When it's purely within a
> guest then an MSI reservation shouldn't matter so much, since the guest
> won't be anywhere near the real hardware configuration anyway.
> 
> Robin.

Curious since anyway we are defining a new iommu architecture
is it possible to avoid those ARM-specific burden completely? 

Thanks
Kevin



RE: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-03-22 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, March 21, 2018 9:14 PM
> 
> Hi Kevin,
> 
> Thanks for the comments
> 
> On 19/03/18 10:03, Tian, Kevin wrote:
> > BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it
> > intended?
> 
> In my opinion BYPASS is a bit different from the other features: while the
> others are needed for correctness, this one is optional and even if the
> guest supports BYPASS, it should be allowed not to accept it. For security
> reasons it may not want to let endpoints access the whole guest-physical
> address space.

ok, possibly because I'm not familiar with virtio spec convention.
My original feeling is that each feature bit will have a behavior
description regarding to whether device reports and whether 
driver accepts. If no need to cover optional feature, then it's fine
to me. :-)

> 
> > --2.6.2 Device Requirements: Device operations--
> >
> > "If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered,
> > the device MUST truncate the range described by virt_start
> > and virt_end in requests to ft in the range described by
> > input_range."
> >
> > "If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device
> > MUST ignore bits above domain_bits in field domain of requests."
> >
> > shouldn't device return error upon above situation instead
> > of continuing operation with modified value?
> 
> The intent was to allow device and driver to pass metadata in the top bits
> of pointers and domain IDs, perhaps for a future extension or for
> debugging. I'm not convinced it's useful anymore (and do wonder what my
> initial reasoning was...) Such extension would be determined by a new
> feature bit and the device would know that the fields contain additional
> info, if the driver accepted that feature.
> 
> > --2.6.4 DETACH request--
> >
> > " Detach an endpoint from its domain. When this request
> > completes, the endpoint cannot access any mapping from that
> > domain anymore "
> >
> > Does it make sense to mention BYPASS situation upon this request?
> 
> Sure, I'll add something to clarify that situation
> 
> > --2.6.8.2 Property RESV_MEM --
> >
> > "VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to
> > virtual addresses in this region are not translated by the device.
> > They may either be aborted by the device (or the underlying
> > IOMMU), bypass it, or never even reach it"
> >
> > in 3.3 hardware device assignment you described an example
> > that a reserved range is stolen by host to map the MSI
> > doorbell... such map behavior is not described above.
> 
> Right, we can say that accesses to the region may be used for host
> translation
> 
> > Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
> > I know there were quite some discussion around this flag before,
> > but my mental picture now still is a bit difficult to understand its
> > usage based on examples in implementation notes:
> >
> > - for x86, you describe it as indicating MSI bypass for
> > oxfeex. However guest doesn't need to know this fact. Only
> > requirement is to treat it as reserved range (as on bare metal)
> > then T_RESERVED is sufficient for this purpose>
> > - for ARM, either let guest or let host to choose a virtual
> > address for mapping to MSI doorbell. the former doesn't require
> > a reserved range. for the latter also T_RESERVED is enough as
> > the example in hardware device assignment section.
> 
> It might be nicer to have the host decide it, but when the physical IOMMU
> is ARM SMMU, nesting translation complicates things, because the guest
> *has* to create a mapping:

confirm one thing first. v0.6 doesn't support binding to guest IOVA
page table yet. So based on current map/unmap interface, there is 
no such complicity right? stage-1 is just a shadowed translation (IOVA
->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case
the default behavior is host-reserved style.

Then comes nested scenario:

> 
> * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the
>   MSI doorbell. The GPA is mapped to the physical MSI doorbell at
>   stage-2 by the host.

Is it a must that above GPA is mapped to physical MSI doorbell?

Ideally:

1) Host reserves some IOVA range for mapping MSI doorbell
2) the range is reported to user space
3) Qemu reported the range as reserved on endpoints, marked
as T_IDENTITY (a new type to be introduced), meaning guest
needs setup identity mapping in stage-1
4) Then device should be able to ri

[virtio-dev] RE: [PATCH 1/4] iommu: Add virtio-iommu driver

2018-03-20 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, February 14, 2018 10:54 PM
> 
> The virtio IOMMU is a para-virtualized device, allowing to send IOMMU
> requests such as map/unmap over virtio-mmio transport without
> emulating
> page tables. This implementation handles ATTACH, DETACH, MAP and
> UNMAP
> requests.
> 
> The bulk of the code transforms calls coming from the IOMMU API into
> corresponding virtio requests. Mappings are kept in an interval tree
> instead of page tables.
> 
> Signed-off-by: Jean-Philippe Brucker 

[...]
> diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
> new file mode 100644
> index ..a9c9245e8ba2
> --- /dev/null
> +++ b/drivers/iommu/virtio-iommu.c
> @@ -0,0 +1,960 @@
> +/*
> + * Virtio driver for the paravirtualized IOMMU
> + *
> + * Copyright (C) 2018 ARM Limited
> + * Author: Jean-Philippe Brucker 
> + *
> + * SPDX-License-Identifier: GPL-2.0
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +#define MSI_IOVA_BASE0x800
> +#define MSI_IOVA_LENGTH  0x10

this is ARM specific, and according to virtio-iommu spec isn't it
better probed on the endpoint instead of hard-coding here?

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-03-19 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, February 7, 2018 2:11 AM
> 
> Please find version 0.6 of the virtio-iommu specification at the following
> locations.
> 
> Document: http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.pdf
>   http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.html
> 
> Sources: git://linux-arm.org/virtio-iommu.git viommu/v0.6
> 
> I didn't receive any comment for v0.5 [1], so this update is fairly light:
> * Changed range definition in map and unmap requests
> * Use virtio-iommu device ID
> * Update IORT node ID and simplify layout
> 
> The following document shows the difference between v0.5 and v0.6:
> http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-pdf-diff-v0.5-
> v0.6.pdf
> 
> Next version will hopefully add vSVA (PASID+PRI) and/or architectural
> optimizations, but I can't provide a timeline yet. I'll probably send a
> small draft first.
> 
> I will send the Linux driver soon. In the meantime you can fetch it on my
> development branches, based on next-20180206.
> 
> git://linux-arm.org/linux-jpb.git virtio-iommu/devel
> git://linux-arm.org/kvmtool-jpb.git virtio-iommu/devel
> 
> Any comments welcome!
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg157402.html

some comments here:

--

BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it 
intended?

--2.6.2 Device Requirements: Device operations--

"If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, 
the device MUST truncate the range described by virt_start 
and virt_end in requests to ft in the range described by 
input_range."

"If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device 
MUST ignore bits above domain_bits in field domain of requests."

shouldn't device return error upon above situation instead
of continuing operation with modified value?

--2.6.4 DETACH request--

" Detach an endpoint from its domain. When this request 
completes, the endpoint cannot access any mapping from that 
domain anymore "

Does it make sense to mention BYPASS situation upon this request?

--2.6.8.2 Property RESV_MEM --

"VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0) Accesses to 
virtual addresses in this region are not translated by the device. 
They may either be aborted by the device (or the underlying 
IOMMU), bypass it, or never even reach it"

in 3.3 hardware device assignment you described an example 
that a reserved range is stolen by host to map the MSI 
doorbell... such map behavior is not described above.

Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
I know there were quite some discussion around this flag before,
but my mental picture now still is a bit difficult to understand its
usage based on examples in implementation notes:

- for x86, you describe it as indicating MSI bypass for
oxfeex. However guest doesn't need to know this fact. Only
requirement is to treat it as reserved range (as on bare metal)
then T_RESERVED is sufficient for this purpose

- for ARM, either let guest or let host to choose a virtual
address for mapping to MSI doorbell. the former doesn't require
a reserved range. for the latter also T_RESERVED is enough as
the example in hardware device assignment section.

Then what's the real point of differentiating MSI bypass from
normal reserved range? Is there other example where guest
may use such info to do special thing?

-- 3.2 Message Signaled Interrupts --

" Section 3.2.2 describes the added complexity (from the host 
point of view) of translating the IRQ chip doorbell "

however later 3.2.2 only says one sentence about host
complexity - " However, this mode of operations may add 
significant complexity in the host implementation". If you plan
to elaborate that part in the future, please add a note. :-)

Thanks
Kevin


[virtio-dev] RE: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-02 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Saturday, March 3, 2018 2:14 AM
> 
> On Fri, 2 Mar 2018 06:54:17 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson
> > > Sent: Friday, March 2, 2018 4:22 AM
> > > >
> > > > I am pretty sure that you are describing is true of some, but not for
> > > > all. I think the Amazon solutions and the virtio solution are doing
> > > > hard partitioning of the part. I will leave it to those guys to speak
> > > > for themselves since I don't know anything about the hardware design
> > > > of those parts.
> > >
> > > I think we'd need device specific knowledge and enablement to be able
> > > to take advantage of any hardware partitioning, otherwise we need to
> > > assume the pf is privileged, as implemented in other sriov devices.
> > >
> > > I'm also trying to imagine whether there's a solution via the new
> > > vfio/mdev interface, where the mdev vendor driver would bind to the
> pf
> > > and effectively present itself as the mdev device.  The vendor driver
> > > could provide sriov capabilities and bear the burden of ensuring that
> > > the pf is used cooperatively.  The only existing mdev vendor drivers are
> > > vGPUs and rely on on-device DMA translation and isolation, such as
> > > through GTTs, but there have been some thoughts on providing IOMMU
> > > based
> > > isolation of mdev/sriov mixed devices (assuming DMA is even required
> > > for userspace management of the pf in this use case).  [Cc Kirti]
> > > Thanks,
> > >
> >
> > Hope not distracting this thread, but above sounds like an interesting
> > idea. Actually we ever brainstormed similar thought for another
> > potential usage - supporting VF live migration. We are already working
> > on an generic extension to allow state save/restore of mdev instance.
> > If vendor driver could further wrap pf as a mdev instance, it could
> > leverage the same framework for a clean state migration on VF. based
> > on mmap callback the vendor driver can easily switch back-and-forth
> > between pass through and trap/emulation of the VF resources. Of
> > course doing so alone doesn't address all the demands of VF live
> > migration (e.g. dirty page tracking still requires other techniques),
> > but it does pave a way toward a general framework to support VF
> > live migration.
> >
> > If above is feasible, finally we could use one mdev framework to
> > manage both mdev and pf/vf assignment, while providing added
> > values which are difficult to achieve today. :-)
> 
> mdev drivers may be the first to support migration, but I wonder if a
> full mdev implementation is necessary for it.  Once the vfio api is
> define, device specific extensions to vfio-pci might be able to
> implement migration more directly.  Thanks,
> 
> Alex

yes technically a full mdev implementation is not necessary. If
device specific extensions will be placed within vfio module, it's
obviously straightforward. What I thought earlier was in case vfio
wants to stay device-agnostic then we probably want device
specific extensions in vendor driver which is loaded but in a 
dummy mode which simply do basic PCI initialization as vfio-pci
and then wrap vf as mdev (since vfio-pci is not the vf driver in
this scenario). It's especially useful for vendor drivers which aim
to support both mdev and sr-iov by sharing common state mgmt.
knowledge, but looks an overkill to other vendor drivers. Possibly
finally we'll allow both - simple device extensions in vfio-pci and 
complex device extensions in vendor drivers through vfio-mdev.

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-01 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, March 2, 2018 2:54 PM
> 
> > From: Alex Williamson
> > Sent: Friday, March 2, 2018 4:22 AM
> > >
> > > I am pretty sure that you are describing is true of some, but not for
> > > all. I think the Amazon solutions and the virtio solution are doing
> > > hard partitioning of the part. I will leave it to those guys to speak
> > > for themselves since I don't know anything about the hardware design
> > > of those parts.
> >
> > I think we'd need device specific knowledge and enablement to be able
> > to take advantage of any hardware partitioning, otherwise we need to
> > assume the pf is privileged, as implemented in other sriov devices.
> >
> > I'm also trying to imagine whether there's a solution via the new
> > vfio/mdev interface, where the mdev vendor driver would bind to the pf
> > and effectively present itself as the mdev device.  The vendor driver
> > could provide sriov capabilities and bear the burden of ensuring that
> > the pf is used cooperatively.  The only existing mdev vendor drivers are
> > vGPUs and rely on on-device DMA translation and isolation, such as
> > through GTTs, but there have been some thoughts on providing IOMMU
> > based
> > isolation of mdev/sriov mixed devices (assuming DMA is even required
> > for userspace management of the pf in this use case).  [Cc Kirti]
> > Thanks,
> >
> 
> Hope not distracting this thread, but above sounds like an interesting
> idea. Actually we ever brainstormed similar thought for another
> potential usage - supporting VF live migration. We are already working
> on an generic extension to allow state save/restore of mdev instance.
> If vendor driver could further wrap pf as a mdev instance, it could

I meant "wrap vf as a mdev instance" here.

> leverage the same framework for a clean state migration on VF. based
> on mmap callback the vendor driver can easily switch back-and-forth
> between pass through and trap/emulation of the VF resources. Of
> course doing so alone doesn't address all the demands of VF live
> migration (e.g. dirty page tracking still requires other techniques),
> but it does pave a way toward a general framework to support VF
> live migration.
> 
> If above is feasible, finally we could use one mdev framework to
> manage both mdev and pf/vf assignment, while providing added
> values which are difficult to achieve today. :-)
> 
> Thanks
> Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [PATCH] pci-iov: Add support for unmanaged SR-IOV

2018-03-01 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Friday, March 2, 2018 4:22 AM
> >
> > I am pretty sure that you are describing is true of some, but not for
> > all. I think the Amazon solutions and the virtio solution are doing
> > hard partitioning of the part. I will leave it to those guys to speak
> > for themselves since I don't know anything about the hardware design
> > of those parts.
> 
> I think we'd need device specific knowledge and enablement to be able
> to take advantage of any hardware partitioning, otherwise we need to
> assume the pf is privileged, as implemented in other sriov devices.
> 
> I'm also trying to imagine whether there's a solution via the new
> vfio/mdev interface, where the mdev vendor driver would bind to the pf
> and effectively present itself as the mdev device.  The vendor driver
> could provide sriov capabilities and bear the burden of ensuring that
> the pf is used cooperatively.  The only existing mdev vendor drivers are
> vGPUs and rely on on-device DMA translation and isolation, such as
> through GTTs, but there have been some thoughts on providing IOMMU
> based
> isolation of mdev/sriov mixed devices (assuming DMA is even required
> for userspace management of the pf in this use case).  [Cc Kirti]
> Thanks,
> 

Hope not distracting this thread, but above sounds like an interesting
idea. Actually we ever brainstormed similar thought for another 
potential usage - supporting VF live migration. We are already working
on an generic extension to allow state save/restore of mdev instance.
If vendor driver could further wrap pf as a mdev instance, it could 
leverage the same framework for a clean state migration on VF. based
on mmap callback the vendor driver can easily switch back-and-forth
between pass through and trap/emulation of the VF resources. Of
course doing so alone doesn't address all the demands of VF live
migration (e.g. dirty page tracking still requires other techniques), 
but it does pave a way toward a general framework to support VF
live migration.

If above is feasible, finally we could use one mdev framework to
manage both mdev and pf/vf assignment, while providing added
values which are difficult to achieve today. :-)

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



RE: [virtio-dev] [RFC] virtio-iommu version 0.4

2017-09-20 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>   | N | R | M | I
> --
> N | N | R | M | ?
> --
> R |   | R | R | I
> --
> M |   |   | M | I
> --
> I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin

-
To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org



[virtio-dev] RE: [RFC] virtio-iommu version 0.4

2017-09-20 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for

[virtio-dev] RE: [RFC] virtio-iommu version 0.4

2017-08-28 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 


[virtio-dev] RE: [RFC 2/3] virtio-iommu: device probing and operations

2017-08-22 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Tuesday, August 22, 2017 10:19 PM
> 
> On 22/08/17 07:24, Tian, Kevin wrote:
> >>> (sorry to pick up this old thread, as the .tex one is not good for review
> >>> and this thread provides necessary background for IOASID).
> >>>
> >>> Hi, Jean,
> >>>
> >>> I'd like to hear more clarification regarding the relationship between
> >>> IOASID and PASID. When reading back above explanation, it looks
> >>> confusing to me now (though I might get the meaning months ago :/).
> >>> At least Intel VT-d only understands PASID (or you can think IOASID
> >>> =PASID). There is no such layered address space concept. Then for
> >>> map/unmap type request, do you intend to steal some PASIDs for
> >>> that purpose on such architecture (since IOASID is a mandatory field
> >>> in map/unmap request)?
> >>
> >> IOASID is a logical ID, it isn't used by hardware. The address space
> >> concept in virtio-iommu allows to group endpoints together, so that they
> >> have the same address space. I thought it was pretty much the same as
> >> "domains" in VT-d? In any case, it is the same as domains in Linux. An
> >> IOASID provides a handle for communication between virtio-iommu
> device
> >> and
> >> driver, but unlike PASID, the IOASID number doesn't mean anything
> outside
> >> of virtio-iommu.
> >
> > Thanks. It's clear to me then.
> >
> > btw does it make more sense to use "domain id" instead of "IO address
> > space id"? For one, when talking about layered address spaces
> > usually parent address space is a superset of all child address spaces
> > which doesn't apply to this case, since either anonymous address
> > space or PASID-tagged address spaces are completely isolated. Instead>
> 'domain' is a more inclusive terminology to embrace multiple address
> > spaces. For two, 'domain' is better aligned to software terminology (e.g.
> > iommu_domain) is easier for people to catch up. :-)
> 
> I do agree that the naming isn't great. I didn't use "domain" for various
> reasons (it also has a different meanings in ARM) but I keep regretting
> it. As there is no virtio-iommu code upstream yet, it is still possible to
> change this one.
> 
> I find that "address space" was a good fit for the baseline device, but
> the name doesn't scale. When introducing PASIDs, the address space
> moves
> one level down in the programming model. It is contexts, anonymous or
> PASID-tagged, that should be called address spaces. I was considering
> replacing it with "domain", "container", "partition"...
> 
> Even though I don't want to use too much Linux terminology (virtio isn't
> just Linux), "domain" is better fitted, somewhat neutral, and gets the
> point across. A domain has one or more input address spaces and a single
> output address space.
> 
> When introducing nested translation to virtio-iommu (for the guest to have
> virtual machines itself), there will be one or more intermediate address
> spaces. Domains will be nested, with the terminology "parent domain" and
> "child domain". I only briefly looked at a programming model for this but
> I think we can nest virtio-iommus without much hassle.
> 
> If there is no objection the next version will use "domain" in place of
> "address_space". The change is quite invasive at this point, but I believe
> that it will makes things more clear down the road.
> 

Sounds good to me. Thanks.