Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Christoffer Dall
On Wed, Feb 03, 2016 at 01:10:58PM +, Will Deacon wrote:
> On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> > On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> > > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > > >>> hardware that helps the situation.  It's not just that x86 has a 
> > > > >>> fixed
> > > > >>> range for MSI, it's how it manages that range when interrupt 
> > > > >>> remapping
> > > > >>> hardware is enabled.  A device table indexed by source-ID 
> > > > >>> references a
> > > > >>> per device table indexed by data from the MSI write itself.  So we 
> > > > >>> get
> > > > >>> much, much finer granularity,
> > > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > > >> the MSI MSG write transaction, the device outputs a DeviceID 
> > > > >> conveyed on
> > > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > > >> table. The entry in the device table points to a DeviceId interrupt
> > > > >> translation table indexed by the EventID found in the msi msg. So the
> > > > >> entry in the interrupt translation table eventually gives you the
> > > > >> eventual interrupt ID targeted by the MSI MSG.
> > > > >> This translation capability if not available in GICv2M though, ie. 
> > > > >> the
> > > > >> one I am currently using.
> > > > >>  
> > > > >> Those tables currently are built by the ITS irqchip 
> > > > >> (irq-gic-v3-its.c)
> > > 
> > > That's right. GICv3/ITS disambiguates the interrupt source using the
> > > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > > GICv2m is less flexible and requires a separate physical frame per guest
> > > to achieve isolation.
> > > 
> > We should still support MSI passthrough with a single MSI frame host
> > system though, right?
> 
> I think we should treat the frame as an exclusive resource and assign it
> to a single VM.

so on a single frame GICv2m system, either your host or a single VM gets
to do MSIs...

> 
> > (Users should just be aware that guests are not fully protected against
> > misbehaving hardware in that case).
> 
> Is it confined to misbehaving hardware? What if a malicious/buggy guest
> configures its device to DMA all over the doorbell?
> 
I guess not, I suppose we can't trap any configuration access and
mediate that for any device.  Bummer.

-Christoffer


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Will Deacon
On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > > >>> range for MSI, it's how it manages that range when interrupt remapping
> > > >>> hardware is enabled.  A device table indexed by source-ID references a
> > > >>> per device table indexed by data from the MSI write itself.  So we get
> > > >>> much, much finer granularity,
> > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed 
> > > >> on
> > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > >> table. The entry in the device table points to a DeviceId interrupt
> > > >> translation table indexed by the EventID found in the msi msg. So the
> > > >> entry in the interrupt translation table eventually gives you the
> > > >> eventual interrupt ID targeted by the MSI MSG.
> > > >> This translation capability if not available in GICv2M though, ie. the
> > > >> one I am currently using.
> > > >>  
> > > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> > 
> > That's right. GICv3/ITS disambiguates the interrupt source using the
> > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > GICv2m is less flexible and requires a separate physical frame per guest
> > to achieve isolation.
> > 
> We should still support MSI passthrough with a single MSI frame host
> system though, right?

I think we should treat the frame as an exclusive resource and assign it
to a single VM.

> (Users should just be aware that guests are not fully protected against
> misbehaving hardware in that case).

Is it confined to misbehaving hardware? What if a malicious/buggy guest
configures its device to DMA all over the doorbell?

Will


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Christoffer Dall
On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > >>> We know that x86 handles MSI vectors specially, so there is some
> > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > >>> range for MSI, it's how it manages that range when interrupt remapping
> > >>> hardware is enabled.  A device table indexed by source-ID references a
> > >>> per device table indexed by data from the MSI write itself.  So we get
> > >>> much, much finer granularity,
> > >> About the granularity, I think ARM GICv3 now provides a similar
> > >> capability with GICv3 ITS (interrupt translation service). Along with
> > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > >> table. The entry in the device table points to a DeviceId interrupt
> > >> translation table indexed by the EventID found in the msi msg. So the
> > >> entry in the interrupt translation table eventually gives you the
> > >> eventual interrupt ID targeted by the MSI MSG.
> > >> This translation capability if not available in GICv2M though, ie. the
> > >> one I am currently using.
> > >>  
> > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> That's right. GICv3/ITS disambiguates the interrupt source using the
> DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> GICv2m is less flexible and requires a separate physical frame per guest
> to achieve isolation.
> 
We should still support MSI passthrough with a single MSI frame host
system though, right?

(Users should just be aware that guests are not fully protected against
misbehaving hardware in that case).

-Christoffer


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Will Deacon
On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > > >>> range for MSI, it's how it manages that range when interrupt remapping
> > > >>> hardware is enabled.  A device table indexed by source-ID references a
> > > >>> per device table indexed by data from the MSI write itself.  So we get
> > > >>> much, much finer granularity,
> > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed 
> > > >> on
> > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > >> table. The entry in the device table points to a DeviceId interrupt
> > > >> translation table indexed by the EventID found in the msi msg. So the
> > > >> entry in the interrupt translation table eventually gives you the
> > > >> eventual interrupt ID targeted by the MSI MSG.
> > > >> This translation capability if not available in GICv2M though, ie. the
> > > >> one I am currently using.
> > > >>  
> > > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> > 
> > That's right. GICv3/ITS disambiguates the interrupt source using the
> > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > GICv2m is less flexible and requires a separate physical frame per guest
> > to achieve isolation.
> > 
> We should still support MSI passthrough with a single MSI frame host
> system though, right?

I think we should treat the frame as an exclusive resource and assign it
to a single VM.

> (Users should just be aware that guests are not fully protected against
> misbehaving hardware in that case).

Is it confined to misbehaving hardware? What if a malicious/buggy guest
configures its device to DMA all over the doorbell?

Will


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Christoffer Dall
On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > >>> We know that x86 handles MSI vectors specially, so there is some
> > >>> hardware that helps the situation.  It's not just that x86 has a fixed
> > >>> range for MSI, it's how it manages that range when interrupt remapping
> > >>> hardware is enabled.  A device table indexed by source-ID references a
> > >>> per device table indexed by data from the MSI write itself.  So we get
> > >>> much, much finer granularity,
> > >> About the granularity, I think ARM GICv3 now provides a similar
> > >> capability with GICv3 ITS (interrupt translation service). Along with
> > >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > >> table. The entry in the device table points to a DeviceId interrupt
> > >> translation table indexed by the EventID found in the msi msg. So the
> > >> entry in the interrupt translation table eventually gives you the
> > >> eventual interrupt ID targeted by the MSI MSG.
> > >> This translation capability if not available in GICv2M though, ie. the
> > >> one I am currently using.
> > >>  
> > >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> That's right. GICv3/ITS disambiguates the interrupt source using the
> DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> GICv2m is less flexible and requires a separate physical frame per guest
> to achieve isolation.
> 
We should still support MSI passthrough with a single MSI frame host
system though, right?

(Users should just be aware that guests are not fully protected against
misbehaving hardware in that case).

-Christoffer


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-03 Thread Christoffer Dall
On Wed, Feb 03, 2016 at 01:10:58PM +, Will Deacon wrote:
> On Wed, Feb 03, 2016 at 01:50:47PM +0100, Christoffer Dall wrote:
> > On Mon, Feb 01, 2016 at 02:03:51PM +, Will Deacon wrote:
> > > On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> > > > On 01/29/2016 08:33 PM, Alex Williamson wrote:
> > > > >>> We know that x86 handles MSI vectors specially, so there is some
> > > > >>> hardware that helps the situation.  It's not just that x86 has a 
> > > > >>> fixed
> > > > >>> range for MSI, it's how it manages that range when interrupt 
> > > > >>> remapping
> > > > >>> hardware is enabled.  A device table indexed by source-ID 
> > > > >>> references a
> > > > >>> per device table indexed by data from the MSI write itself.  So we 
> > > > >>> get
> > > > >>> much, much finer granularity,
> > > > >> About the granularity, I think ARM GICv3 now provides a similar
> > > > >> capability with GICv3 ITS (interrupt translation service). Along with
> > > > >> the MSI MSG write transaction, the device outputs a DeviceID 
> > > > >> conveyed on
> > > > >> the bus. This DeviceID (~ your source-ID) enables to index a device
> > > > >> table. The entry in the device table points to a DeviceId interrupt
> > > > >> translation table indexed by the EventID found in the msi msg. So the
> > > > >> entry in the interrupt translation table eventually gives you the
> > > > >> eventual interrupt ID targeted by the MSI MSG.
> > > > >> This translation capability if not available in GICv2M though, ie. 
> > > > >> the
> > > > >> one I am currently using.
> > > > >>  
> > > > >> Those tables currently are built by the ITS irqchip 
> > > > >> (irq-gic-v3-its.c)
> > > 
> > > That's right. GICv3/ITS disambiguates the interrupt source using the
> > > DeviceID, which for PCI is derived from the Requester ID of the endpoint.
> > > GICv2m is less flexible and requires a separate physical frame per guest
> > > to achieve isolation.
> > > 
> > We should still support MSI passthrough with a single MSI frame host
> > system though, right?
> 
> I think we should treat the frame as an exclusive resource and assign it
> to a single VM.

so on a single frame GICv2m system, either your host or a single VM gets
to do MSIs...

> 
> > (Users should just be aware that guests are not fully protected against
> > misbehaving hardware in that case).
> 
> Is it confined to misbehaving hardware? What if a malicious/buggy guest
> configures its device to DMA all over the doorbell?
> 
I guess not, I suppose we can't trap any configuration access and
mediate that for any device.  Bummer.

-Christoffer


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-01 Thread Will Deacon
On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> On 01/29/2016 08:33 PM, Alex Williamson wrote:
> >>> We know that x86 handles MSI vectors specially, so there is some
> >>> hardware that helps the situation.  It's not just that x86 has a fixed
> >>> range for MSI, it's how it manages that range when interrupt remapping
> >>> hardware is enabled.  A device table indexed by source-ID references a
> >>> per device table indexed by data from the MSI write itself.  So we get
> >>> much, much finer granularity,
> >> About the granularity, I think ARM GICv3 now provides a similar
> >> capability with GICv3 ITS (interrupt translation service). Along with
> >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> >> the bus. This DeviceID (~ your source-ID) enables to index a device
> >> table. The entry in the device table points to a DeviceId interrupt
> >> translation table indexed by the EventID found in the msi msg. So the
> >> entry in the interrupt translation table eventually gives you the
> >> eventual interrupt ID targeted by the MSI MSG.
> >> This translation capability if not available in GICv2M though, ie. the
> >> one I am currently using.
> >>  
> >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

That's right. GICv3/ITS disambiguates the interrupt source using the
DeviceID, which for PCI is derived from the Requester ID of the endpoint.
GICv2m is less flexible and requires a separate physical frame per guest
to achieve isolation.

Will


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-02-01 Thread Will Deacon
On Fri, Jan 29, 2016 at 10:25:52PM +0100, Eric Auger wrote:
> On 01/29/2016 08:33 PM, Alex Williamson wrote:
> >>> We know that x86 handles MSI vectors specially, so there is some
> >>> hardware that helps the situation.  It's not just that x86 has a fixed
> >>> range for MSI, it's how it manages that range when interrupt remapping
> >>> hardware is enabled.  A device table indexed by source-ID references a
> >>> per device table indexed by data from the MSI write itself.  So we get
> >>> much, much finer granularity,
> >> About the granularity, I think ARM GICv3 now provides a similar
> >> capability with GICv3 ITS (interrupt translation service). Along with
> >> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> >> the bus. This DeviceID (~ your source-ID) enables to index a device
> >> table. The entry in the device table points to a DeviceId interrupt
> >> translation table indexed by the EventID found in the msi msg. So the
> >> entry in the interrupt translation table eventually gives you the
> >> eventual interrupt ID targeted by the MSI MSG.
> >> This translation capability if not available in GICv2M though, ie. the
> >> one I am currently using.
> >>  
> >> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

That's right. GICv3/ITS disambiguates the interrupt source using the
DeviceID, which for PCI is derived from the Requester ID of the endpoint.
GICv2m is less flexible and requires a separate physical frame per guest
to achieve isolation.

Will


Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Eric Auger
Hi Alex,
On 01/29/2016 08:33 PM, Alex Williamson wrote:
> On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
>> Hi Alex,
>> On 01/28/2016 10:51 PM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
 This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
 It pursues the efforts done on [1], [2], [3]. It also aims at covering the
 same need on some PowerPC platforms.
  
 On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
 directed
 as interrupt messages: accesses to this special PA window directly target 
 the
 APIC configuration space and not DRAM, meaning the downstream IOMMU is 
 bypassed.
  
 This is not the case on above mentionned platforms where MSI messages 
 emitted
 by devices are conveyed through the IOMMU. This means an IOVA/host PA 
 mapping
 must exist for the MSI to reach the MSI controller. Normal way to create
 IOVA bindings consists in using VFIO DMA MAP API. However in this case
 the MSI IOVA is not mapped onto guest RAM but on host physical page (the 
 MSI
 controller frame).
  
 Following first comments, the spirit of [2] is kept: the guest registers
 an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
 its MSI vectors, it overwrites the MSI controller physical address with an 
 IOVA,
 allocated within the window provided by the userspace. This IOVA is mapped
 onto the MSI controller frame physical page.
  
 The series does not address yet the problematic of telling the userspace 
 how
 much IOVA he should provision.
>>>  
>>> I'm sort of on a think-different approach today, so bear with me; how is
>>> it that x86 can make interrupt remapping so transparent to drivers like
>>> vfio-pci while for ARM and ppc we seem to be stuck with doing these
>>> fixups of the physical vector ourselves, implying ugly (no offense)
>>> paths bouncing through vfio to connect the driver and iommu backends?
>>>  
>>> We know that x86 handles MSI vectors specially, so there is some
>>> hardware that helps the situation.  It's not just that x86 has a fixed
>>> range for MSI, it's how it manages that range when interrupt remapping
>>> hardware is enabled.  A device table indexed by source-ID references a
>>> per device table indexed by data from the MSI write itself.  So we get
>>> much, much finer granularity,
>> About the granularity, I think ARM GICv3 now provides a similar
>> capability with GICv3 ITS (interrupt translation service). Along with
>> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
>> the bus. This DeviceID (~ your source-ID) enables to index a device
>> table. The entry in the device table points to a DeviceId interrupt
>> translation table indexed by the EventID found in the msi msg. So the
>> entry in the interrupt translation table eventually gives you the
>> eventual interrupt ID targeted by the MSI MSG.
>> This translation capability if not available in GICv2M though, ie. the
>> one I am currently using.
>>  
>> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> So it sounds like the interrupt remapping plumbing needs to be
> implemented for those chips.  How does ITS identify an MSI versus any
> other DMA write?  Does it need to be within a preconfigured address
> space like on x86 or does it know this implicitly by the transaction
> (which doesn't seem possible on PCIe)?

It seems there is a kind of misunderstanding here. Assuming a "simple"
system with a single ITS, all devices likely to produce MSI must write
those messages in a single register, located in the ITS MSI 64kB frame
(this register is called GITS_TRANSLATER). Then the ITS discriminates
between senders using the DeviceID conveyed out-of-band on the bus (or
by other implementation defined means). For those DeviceId, a deviceId
interrupt translation table is supposed to exist, else it is going to
fault. If any "undeclared" device is writing into that register, its
deviceid will be unknown. It looks like on Intel the interrupt remapping
HW rather is abstracted on the IOMMU side; I did not take time yet to
carefully read the VT-d spec but maybe the Intel interrupt remapping HW
rather acts as an IOMMU that takes an input MSI address within the
famous window and apply a translation scheme based on the MSI address &
data? On ARM the input MSI address always is the GITS_TRANSLATER and
then the translation scheme is based on out-of-band info (deviceid) +
data content(eventid). I Hope this clarifies.
> 
> Along with this discussion, we should probably be revisiting whether
> existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
> capability.

so according to the above explanation not sure it is relevant. Will/Marc
might correct me if I told some wrong things.
  This capability is meant to indicate interrupt isolation,
> but if an entire page of IOVA space is 

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Alex Williamson
On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
> Hi Alex,
> On 01/28/2016 10:51 PM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
> > > This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> > > It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> > > same need on some PowerPC platforms.
> > >  
> > > On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
> > > directed
> > > as interrupt messages: accesses to this special PA window directly target 
> > > the
> > > APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> > > bypassed.
> > >  
> > > This is not the case on above mentionned platforms where MSI messages 
> > > emitted
> > > by devices are conveyed through the IOMMU. This means an IOVA/host PA 
> > > mapping
> > > must exist for the MSI to reach the MSI controller. Normal way to create
> > > IOVA bindings consists in using VFIO DMA MAP API. However in this case
> > > the MSI IOVA is not mapped onto guest RAM but on host physical page (the 
> > > MSI
> > > controller frame).
> > >  
> > > Following first comments, the spirit of [2] is kept: the guest registers
> > > an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver 
> > > allocates
> > > its MSI vectors, it overwrites the MSI controller physical address with 
> > > an IOVA,
> > > allocated within the window provided by the userspace. This IOVA is mapped
> > > onto the MSI controller frame physical page.
> > >  
> > > The series does not address yet the problematic of telling the userspace 
> > > how
> > > much IOVA he should provision.
> > 
> > I'm sort of on a think-different approach today, so bear with me; how is
> > it that x86 can make interrupt remapping so transparent to drivers like
> > vfio-pci while for ARM and ppc we seem to be stuck with doing these
> > fixups of the physical vector ourselves, implying ugly (no offense)
> > paths bouncing through vfio to connect the driver and iommu backends?
> > 
> > We know that x86 handles MSI vectors specially, so there is some
> > hardware that helps the situation.  It's not just that x86 has a fixed
> > range for MSI, it's how it manages that range when interrupt remapping
> > hardware is enabled.  A device table indexed by source-ID references a
> > per device table indexed by data from the MSI write itself.  So we get
> > much, much finer granularity,
> About the granularity, I think ARM GICv3 now provides a similar
> capability with GICv3 ITS (interrupt translation service). Along with
> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> the bus. This DeviceID (~ your source-ID) enables to index a device
> table. The entry in the device table points to a DeviceId interrupt
> translation table indexed by the EventID found in the msi msg. So the
> entry in the interrupt translation table eventually gives you the
> eventual interrupt ID targeted by the MSI MSG.
> This translation capability if not available in GICv2M though, ie. the
> one I am currently using.
> 
> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

So it sounds like the interrupt remapping plumbing needs to be
implemented for those chips.  How does ITS identify an MSI versus any
other DMA write?  Does it need to be within a preconfigured address
space like on x86 or does it know this implicitly by the transaction
(which doesn't seem possible on PCIe)?

Along with this discussion, we should probably be revisiting whether
existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
capability.  This capability is meant to indicate interrupt isolation,
but if an entire page of IOVA space is mapped through the IOMMU to a
range of interrupts and some of those interrupts are shared with host
devices or other VMs, then we really don't have that isolation and the
system is susceptible to one VM interfering with another or with the
host.  If that's the case, the SMMU should not be claiming
IOMMU_CAP_INTR_REMAP.

>  but there's still effectively an interrupt
> > domain per device that's being transparently managed under the covers
> > whenever we request an MSI vector for a device.
> > 
> > So why can't we do something more like that here?  There's no predefined
> > MSI vector range, so defining an interface for the user to specify that
> > is unavoidable.
> Do you confirm that VFIO user API still still is the good choice to
> provide that IOVA range?

I don't see that we have an option there unless ARM wants to
retroactively reserve a range of IOVA space in the spec, which is
certainly not going to happen.  The only other thing that comes to mind
would be if there was an existing address space which could never be
backed by RAM or other DMA capable targets.  But that seems far fetched
as well.

>   But why shouldn't everything else be transparent?  We
> > could add an interface to the IOMMU API that allows us to register that
> > reserved range for the 

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Eric Auger
Hi Alex,
On 01/28/2016 10:51 PM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
>> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
>> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
>> same need on some PowerPC platforms.
>>  
>> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
>> directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
>> bypassed.
>>  
>> This is not the case on above mentionned platforms where MSI messages emitted
>> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
>> must exist for the MSI to reach the MSI controller. Normal way to create
>> IOVA bindings consists in using VFIO DMA MAP API. However in this case
>> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
>> controller frame).
>>  
>> Following first comments, the spirit of [2] is kept: the guest registers
>> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
>> its MSI vectors, it overwrites the MSI controller physical address with an 
>> IOVA,
>> allocated within the window provided by the userspace. This IOVA is mapped
>> onto the MSI controller frame physical page.
>>  
>> The series does not address yet the problematic of telling the userspace how
>> much IOVA he should provision.
> 
> I'm sort of on a think-different approach today, so bear with me; how is
> it that x86 can make interrupt remapping so transparent to drivers like
> vfio-pci while for ARM and ppc we seem to be stuck with doing these
> fixups of the physical vector ourselves, implying ugly (no offense)
> paths bouncing through vfio to connect the driver and iommu backends?
> 
> We know that x86 handles MSI vectors specially, so there is some
> hardware that helps the situation.  It's not just that x86 has a fixed
> range for MSI, it's how it manages that range when interrupt remapping
> hardware is enabled.  A device table indexed by source-ID references a
> per device table indexed by data from the MSI write itself.  So we get
> much, much finer granularity,
About the granularity, I think ARM GICv3 now provides a similar
capability with GICv3 ITS (interrupt translation service). Along with
the MSI MSG write transaction, the device outputs a DeviceID conveyed on
the bus. This DeviceID (~ your source-ID) enables to index a device
table. The entry in the device table points to a DeviceId interrupt
translation table indexed by the EventID found in the msi msg. So the
entry in the interrupt translation table eventually gives you the
eventual interrupt ID targeted by the MSI MSG.
This translation capability if not available in GICv2M though, ie. the
one I am currently using.

Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

 but there's still effectively an interrupt
> domain per device that's being transparently managed under the covers
> whenever we request an MSI vector for a device.
> 
> So why can't we do something more like that here?  There's no predefined
> MSI vector range, so defining an interface for the user to specify that
> is unavoidable.
Do you confirm that VFIO user API still still is the good choice to
provide that IOVA range?
  But why shouldn't everything else be transparent?  We
> could add an interface to the IOMMU API that allows us to register that
> reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
> remapping) code might allocate an IOVA domain for this just as you've
> done in the type1 code here.
I have no objection to move that iova allocation scheme somewhere else.
I just need to figure out how to deal with the fact iova.c is not
compiled everywhere as I noticed too late ;-)

  But rather than having any interaction
> with vfio-pci, why not do this at lower levels such that the platform
> interrupt vector allocation code automatically uses one of those IOVA
> ranges and returns the IOVA rather than the physical address for the PCI
> code to program into the device?  I think we know what needs to be done,
> but we're taking the approach of managing the space ourselves and doing
> a fixup of the device after the core code has done its job when we
> really ought to be letting the core code manage a space that we define
> and programming the device so that it doesn't need a fixup in the
> vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
> with the device properly programmed or generate an error if there's not
> enough reserved mapping space in IOMMU domain?  Can it be done?
I agree with you on the fact it would be cleaner to manage that natively
at MSI controller level instead of patching the address value in
vfio_pci_intrs.c. I will investigate in that direction but I need some
more time to understand the links between the MSI controller, the PCI
device and the IOMMU.

Best Regards

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Eric Auger
Hi Alex,
On 01/28/2016 10:51 PM, Alex Williamson wrote:
> On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
>> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
>> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
>> same need on some PowerPC platforms.
>>  
>> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
>> directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
>> bypassed.
>>  
>> This is not the case on above mentionned platforms where MSI messages emitted
>> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
>> must exist for the MSI to reach the MSI controller. Normal way to create
>> IOVA bindings consists in using VFIO DMA MAP API. However in this case
>> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
>> controller frame).
>>  
>> Following first comments, the spirit of [2] is kept: the guest registers
>> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
>> its MSI vectors, it overwrites the MSI controller physical address with an 
>> IOVA,
>> allocated within the window provided by the userspace. This IOVA is mapped
>> onto the MSI controller frame physical page.
>>  
>> The series does not address yet the problematic of telling the userspace how
>> much IOVA he should provision.
> 
> I'm sort of on a think-different approach today, so bear with me; how is
> it that x86 can make interrupt remapping so transparent to drivers like
> vfio-pci while for ARM and ppc we seem to be stuck with doing these
> fixups of the physical vector ourselves, implying ugly (no offense)
> paths bouncing through vfio to connect the driver and iommu backends?
> 
> We know that x86 handles MSI vectors specially, so there is some
> hardware that helps the situation.  It's not just that x86 has a fixed
> range for MSI, it's how it manages that range when interrupt remapping
> hardware is enabled.  A device table indexed by source-ID references a
> per device table indexed by data from the MSI write itself.  So we get
> much, much finer granularity,
About the granularity, I think ARM GICv3 now provides a similar
capability with GICv3 ITS (interrupt translation service). Along with
the MSI MSG write transaction, the device outputs a DeviceID conveyed on
the bus. This DeviceID (~ your source-ID) enables to index a device
table. The entry in the device table points to a DeviceId interrupt
translation table indexed by the EventID found in the msi msg. So the
entry in the interrupt translation table eventually gives you the
eventual interrupt ID targeted by the MSI MSG.
This translation capability if not available in GICv2M though, ie. the
one I am currently using.

Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

 but there's still effectively an interrupt
> domain per device that's being transparently managed under the covers
> whenever we request an MSI vector for a device.
> 
> So why can't we do something more like that here?  There's no predefined
> MSI vector range, so defining an interface for the user to specify that
> is unavoidable.
Do you confirm that VFIO user API still still is the good choice to
provide that IOVA range?
  But why shouldn't everything else be transparent?  We
> could add an interface to the IOMMU API that allows us to register that
> reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
> remapping) code might allocate an IOVA domain for this just as you've
> done in the type1 code here.
I have no objection to move that iova allocation scheme somewhere else.
I just need to figure out how to deal with the fact iova.c is not
compiled everywhere as I noticed too late ;-)

  But rather than having any interaction
> with vfio-pci, why not do this at lower levels such that the platform
> interrupt vector allocation code automatically uses one of those IOVA
> ranges and returns the IOVA rather than the physical address for the PCI
> code to program into the device?  I think we know what needs to be done,
> but we're taking the approach of managing the space ourselves and doing
> a fixup of the device after the core code has done its job when we
> really ought to be letting the core code manage a space that we define
> and programming the device so that it doesn't need a fixup in the
> vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
> with the device properly programmed or generate an error if there's not
> enough reserved mapping space in IOMMU domain?  Can it be done?
I agree with you on the fact it would be cleaner to manage that natively
at MSI controller level instead of patching the address value in
vfio_pci_intrs.c. I will investigate in that direction but I need some
more time to understand the links between the MSI controller, the PCI
device and the IOMMU.

Best Regards

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Alex Williamson
On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
> Hi Alex,
> On 01/28/2016 10:51 PM, Alex Williamson wrote:
> > On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
> > > This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> > > It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> > > same need on some PowerPC platforms.
> > >  
> > > On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
> > > directed
> > > as interrupt messages: accesses to this special PA window directly target 
> > > the
> > > APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> > > bypassed.
> > >  
> > > This is not the case on above mentionned platforms where MSI messages 
> > > emitted
> > > by devices are conveyed through the IOMMU. This means an IOVA/host PA 
> > > mapping
> > > must exist for the MSI to reach the MSI controller. Normal way to create
> > > IOVA bindings consists in using VFIO DMA MAP API. However in this case
> > > the MSI IOVA is not mapped onto guest RAM but on host physical page (the 
> > > MSI
> > > controller frame).
> > >  
> > > Following first comments, the spirit of [2] is kept: the guest registers
> > > an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver 
> > > allocates
> > > its MSI vectors, it overwrites the MSI controller physical address with 
> > > an IOVA,
> > > allocated within the window provided by the userspace. This IOVA is mapped
> > > onto the MSI controller frame physical page.
> > >  
> > > The series does not address yet the problematic of telling the userspace 
> > > how
> > > much IOVA he should provision.
> > 
> > I'm sort of on a think-different approach today, so bear with me; how is
> > it that x86 can make interrupt remapping so transparent to drivers like
> > vfio-pci while for ARM and ppc we seem to be stuck with doing these
> > fixups of the physical vector ourselves, implying ugly (no offense)
> > paths bouncing through vfio to connect the driver and iommu backends?
> > 
> > We know that x86 handles MSI vectors specially, so there is some
> > hardware that helps the situation.  It's not just that x86 has a fixed
> > range for MSI, it's how it manages that range when interrupt remapping
> > hardware is enabled.  A device table indexed by source-ID references a
> > per device table indexed by data from the MSI write itself.  So we get
> > much, much finer granularity,
> About the granularity, I think ARM GICv3 now provides a similar
> capability with GICv3 ITS (interrupt translation service). Along with
> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
> the bus. This DeviceID (~ your source-ID) enables to index a device
> table. The entry in the device table points to a DeviceId interrupt
> translation table indexed by the EventID found in the msi msg. So the
> entry in the interrupt translation table eventually gives you the
> eventual interrupt ID targeted by the MSI MSG.
> This translation capability if not available in GICv2M though, ie. the
> one I am currently using.
> 
> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)

So it sounds like the interrupt remapping plumbing needs to be
implemented for those chips.  How does ITS identify an MSI versus any
other DMA write?  Does it need to be within a preconfigured address
space like on x86 or does it know this implicitly by the transaction
(which doesn't seem possible on PCIe)?

Along with this discussion, we should probably be revisiting whether
existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
capability.  This capability is meant to indicate interrupt isolation,
but if an entire page of IOVA space is mapped through the IOMMU to a
range of interrupts and some of those interrupts are shared with host
devices or other VMs, then we really don't have that isolation and the
system is susceptible to one VM interfering with another or with the
host.  If that's the case, the SMMU should not be claiming
IOMMU_CAP_INTR_REMAP.

>  but there's still effectively an interrupt
> > domain per device that's being transparently managed under the covers
> > whenever we request an MSI vector for a device.
> > 
> > So why can't we do something more like that here?  There's no predefined
> > MSI vector range, so defining an interface for the user to specify that
> > is unavoidable.
> Do you confirm that VFIO user API still still is the good choice to
> provide that IOVA range?

I don't see that we have an option there unless ARM wants to
retroactively reserve a range of IOVA space in the spec, which is
certainly not going to happen.  The only other thing that comes to mind
would be if there was an existing address space which could never be
backed by RAM or other DMA capable targets.  But that seems far fetched
as well.

>   But why shouldn't everything else be transparent?  We
> > could add an interface to the IOMMU API that allows us to register that
> > reserved range for the 

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-29 Thread Eric Auger
Hi Alex,
On 01/29/2016 08:33 PM, Alex Williamson wrote:
> On Fri, 2016-01-29 at 15:35 +0100, Eric Auger wrote:
>> Hi Alex,
>> On 01/28/2016 10:51 PM, Alex Williamson wrote:
>>> On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
 This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
 It pursues the efforts done on [1], [2], [3]. It also aims at covering the
 same need on some PowerPC platforms.
  
 On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
 directed
 as interrupt messages: accesses to this special PA window directly target 
 the
 APIC configuration space and not DRAM, meaning the downstream IOMMU is 
 bypassed.
  
 This is not the case on above mentionned platforms where MSI messages 
 emitted
 by devices are conveyed through the IOMMU. This means an IOVA/host PA 
 mapping
 must exist for the MSI to reach the MSI controller. Normal way to create
 IOVA bindings consists in using VFIO DMA MAP API. However in this case
 the MSI IOVA is not mapped onto guest RAM but on host physical page (the 
 MSI
 controller frame).
  
 Following first comments, the spirit of [2] is kept: the guest registers
 an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
 its MSI vectors, it overwrites the MSI controller physical address with an 
 IOVA,
 allocated within the window provided by the userspace. This IOVA is mapped
 onto the MSI controller frame physical page.
  
 The series does not address yet the problematic of telling the userspace 
 how
 much IOVA he should provision.
>>>  
>>> I'm sort of on a think-different approach today, so bear with me; how is
>>> it that x86 can make interrupt remapping so transparent to drivers like
>>> vfio-pci while for ARM and ppc we seem to be stuck with doing these
>>> fixups of the physical vector ourselves, implying ugly (no offense)
>>> paths bouncing through vfio to connect the driver and iommu backends?
>>>  
>>> We know that x86 handles MSI vectors specially, so there is some
>>> hardware that helps the situation.  It's not just that x86 has a fixed
>>> range for MSI, it's how it manages that range when interrupt remapping
>>> hardware is enabled.  A device table indexed by source-ID references a
>>> per device table indexed by data from the MSI write itself.  So we get
>>> much, much finer granularity,
>> About the granularity, I think ARM GICv3 now provides a similar
>> capability with GICv3 ITS (interrupt translation service). Along with
>> the MSI MSG write transaction, the device outputs a DeviceID conveyed on
>> the bus. This DeviceID (~ your source-ID) enables to index a device
>> table. The entry in the device table points to a DeviceId interrupt
>> translation table indexed by the EventID found in the msi msg. So the
>> entry in the interrupt translation table eventually gives you the
>> eventual interrupt ID targeted by the MSI MSG.
>> This translation capability if not available in GICv2M though, ie. the
>> one I am currently using.
>>  
>> Those tables currently are built by the ITS irqchip (irq-gic-v3-its.c)
> 
> So it sounds like the interrupt remapping plumbing needs to be
> implemented for those chips.  How does ITS identify an MSI versus any
> other DMA write?  Does it need to be within a preconfigured address
> space like on x86 or does it know this implicitly by the transaction
> (which doesn't seem possible on PCIe)?

It seems there is a kind of misunderstanding here. Assuming a "simple"
system with a single ITS, all devices likely to produce MSI must write
those messages in a single register, located in the ITS MSI 64kB frame
(this register is called GITS_TRANSLATER). Then the ITS discriminates
between senders using the DeviceID conveyed out-of-band on the bus (or
by other implementation defined means). For those DeviceId, a deviceId
interrupt translation table is supposed to exist, else it is going to
fault. If any "undeclared" device is writing into that register, its
deviceid will be unknown. It looks like on Intel the interrupt remapping
HW rather is abstracted on the IOMMU side; I did not take time yet to
carefully read the VT-d spec but maybe the Intel interrupt remapping HW
rather acts as an IOMMU that takes an input MSI address within the
famous window and apply a translation scheme based on the MSI address &
data? On ARM the input MSI address always is the GITS_TRANSLATER and
then the translation scheme is based on out-of-band info (deviceid) +
data content(eventid). I Hope this clarifies.
> 
> Along with this discussion, we should probably be revisiting whether
> existing ARM SMMUs should be exposing the IOMMU_CAP_INTR_REMAP
> capability.

so according to the above explanation not sure it is relevant. Will/Marc
might correct me if I told some wrong things.
  This capability is meant to indicate interrupt isolation,
> but if an entire page of IOVA space is 

Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-28 Thread Alex Williamson
On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> same need on some PowerPC platforms.
> 
> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> bypassed.
> 
> This is not the case on above mentionned platforms where MSI messages emitted
> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
> must exist for the MSI to reach the MSI controller. Normal way to create
> IOVA bindings consists in using VFIO DMA MAP API. However in this case
> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
> controller frame).
> 
> Following first comments, the spirit of [2] is kept: the guest registers
> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
> its MSI vectors, it overwrites the MSI controller physical address with an 
> IOVA,
> allocated within the window provided by the userspace. This IOVA is mapped
> onto the MSI controller frame physical page.
> 
> The series does not address yet the problematic of telling the userspace how
> much IOVA he should provision.

I'm sort of on a think-different approach today, so bear with me; how is
it that x86 can make interrupt remapping so transparent to drivers like
vfio-pci while for ARM and ppc we seem to be stuck with doing these
fixups of the physical vector ourselves, implying ugly (no offense)
paths bouncing through vfio to connect the driver and iommu backends?

We know that x86 handles MSI vectors specially, so there is some
hardware that helps the situation.  It's not just that x86 has a fixed
range for MSI, it's how it manages that range when interrupt remapping
hardware is enabled.  A device table indexed by source-ID references a
per device table indexed by data from the MSI write itself.  So we get
much, much finer granularity, but there's still effectively an interrupt
domain per device that's being transparently managed under the covers
whenever we request an MSI vector for a device.

So why can't we do something more like that here?  There's no predefined
MSI vector range, so defining an interface for the user to specify that
is unavoidable.  But why shouldn't everything else be transparent?  We
could add an interface to the IOMMU API that allows us to register that
reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
remapping) code might allocate an IOVA domain for this just as you've
done in the type1 code here.  But rather than having any interaction
with vfio-pci, why not do this at lower levels such that the platform
interrupt vector allocation code automatically uses one of those IOVA
ranges and returns the IOVA rather than the physical address for the PCI
code to program into the device?  I think we know what needs to be done,
but we're taking the approach of managing the space ourselves and doing
a fixup of the device after the core code has done its job when we
really ought to be letting the core code manage a space that we define
and programming the device so that it doesn't need a fixup in the
vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
with the device properly programmed or generate an error if there's not
enough reserved mapping space in IOMMU domain?  Can it be done?  Thanks,

Alex



Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-28 Thread Eric Auger
Hi Pavel,
On 01/28/2016 08:13 AM, Pavel Fedin wrote:
>  Hello!
> 
>> x86 isn't problem-free in this space.  An x86 VM is going to know that
>> the 0xfee0 address range is special, it won't be backed by RAM and
>> won't be a DMA target, thus we'll never attempt to map it for an iova
>> address.  However, if we run a non-x86 VM or a userspace driver, it
>> doesn't necessarily know that there's anything special about that range
>> of iovas.  I intend to resolve this with an extension to the iommu info
>> ioctl that describes the available iova space for the iommu.  The
>> interrupt region would simply be excluded.
> 
>  I see now, but i still don't understand how it would work. How can we tell 
> the guest OS that we cannot do DMA to this particular
> area? Just exclude it from RAM at all? But this means we would have to modify 
> machine's model...
>  I know that this is a bit different story from what we are implementing now. 
> Just curious.

Well in QEMU mach-virt we have a static guest PA memory map. Maybe in
some other virt machines this is different and it is possible to take
into account the fact an IOVA range cannot be used?

Regards

Eric
> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 



Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-28 Thread Alex Williamson
On Tue, 2016-01-26 at 13:12 +, Eric Auger wrote:
> This series addresses KVM PCIe passthrough with MSI enabled on ARM/ARM64.
> It pursues the efforts done on [1], [2], [3]. It also aims at covering the
> same need on some PowerPC platforms.
> 
> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> bypassed.
> 
> This is not the case on above mentionned platforms where MSI messages emitted
> by devices are conveyed through the IOMMU. This means an IOVA/host PA mapping
> must exist for the MSI to reach the MSI controller. Normal way to create
> IOVA bindings consists in using VFIO DMA MAP API. However in this case
> the MSI IOVA is not mapped onto guest RAM but on host physical page (the MSI
> controller frame).
> 
> Following first comments, the spirit of [2] is kept: the guest registers
> an IOVA range reserved for MSI mapping. When the VFIO-PCIe driver allocates
> its MSI vectors, it overwrites the MSI controller physical address with an 
> IOVA,
> allocated within the window provided by the userspace. This IOVA is mapped
> onto the MSI controller frame physical page.
> 
> The series does not address yet the problematic of telling the userspace how
> much IOVA he should provision.

I'm sort of on a think-different approach today, so bear with me; how is
it that x86 can make interrupt remapping so transparent to drivers like
vfio-pci while for ARM and ppc we seem to be stuck with doing these
fixups of the physical vector ourselves, implying ugly (no offense)
paths bouncing through vfio to connect the driver and iommu backends?

We know that x86 handles MSI vectors specially, so there is some
hardware that helps the situation.  It's not just that x86 has a fixed
range for MSI, it's how it manages that range when interrupt remapping
hardware is enabled.  A device table indexed by source-ID references a
per device table indexed by data from the MSI write itself.  So we get
much, much finer granularity, but there's still effectively an interrupt
domain per device that's being transparently managed under the covers
whenever we request an MSI vector for a device.

So why can't we do something more like that here?  There's no predefined
MSI vector range, so defining an interface for the user to specify that
is unavoidable.  But why shouldn't everything else be transparent?  We
could add an interface to the IOMMU API that allows us to register that
reserved range for the IOMMU domain.  IOMMU-core (or maybe interrupt
remapping) code might allocate an IOVA domain for this just as you've
done in the type1 code here.  But rather than having any interaction
with vfio-pci, why not do this at lower levels such that the platform
interrupt vector allocation code automatically uses one of those IOVA
ranges and returns the IOVA rather than the physical address for the PCI
code to program into the device?  I think we know what needs to be done,
but we're taking the approach of managing the space ourselves and doing
a fixup of the device after the core code has done its job when we
really ought to be letting the core code manage a space that we define
and programming the device so that it doesn't need a fixup in the
vfio-pci code.  Wouldn't it be nicer if pci_enable_msix_range() returned
with the device properly programmed or generate an error if there's not
enough reserved mapping space in IOMMU domain?  Can it be done?  Thanks,

Alex



Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-28 Thread Eric Auger
Hi Pavel,
On 01/28/2016 08:13 AM, Pavel Fedin wrote:
>  Hello!
> 
>> x86 isn't problem-free in this space.  An x86 VM is going to know that
>> the 0xfee0 address range is special, it won't be backed by RAM and
>> won't be a DMA target, thus we'll never attempt to map it for an iova
>> address.  However, if we run a non-x86 VM or a userspace driver, it
>> doesn't necessarily know that there's anything special about that range
>> of iovas.  I intend to resolve this with an extension to the iommu info
>> ioctl that describes the available iova space for the iommu.  The
>> interrupt region would simply be excluded.
> 
>  I see now, but i still don't understand how it would work. How can we tell 
> the guest OS that we cannot do DMA to this particular
> area? Just exclude it from RAM at all? But this means we would have to modify 
> machine's model...
>  I know that this is a bit different story from what we are implementing now. 
> Just curious.

Well in QEMU mach-virt we have a static guest PA memory map. Maybe in
some other virt machines this is different and it is possible to take
into account the fact an IOVA range cannot be used?

Regards

Eric
> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 



RE: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-27 Thread Pavel Fedin
 Hello!

> x86 isn't problem-free in this space.  An x86 VM is going to know that
> the 0xfee0 address range is special, it won't be backed by RAM and
> won't be a DMA target, thus we'll never attempt to map it for an iova
> address.  However, if we run a non-x86 VM or a userspace driver, it
> doesn't necessarily know that there's anything special about that range
> of iovas.  I intend to resolve this with an extension to the iommu info
> ioctl that describes the available iova space for the iommu.  The
> interrupt region would simply be excluded.

 I see now, but i still don't understand how it would work. How can we tell the 
guest OS that we cannot do DMA to this particular
area? Just exclude it from RAM at all? But this means we would have to modify 
machine's model...
 I know that this is a bit different story from what we are implementing now. 
Just curious.

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia




Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-27 Thread Eric Auger
Hi Pavel,
On 01/26/2016 06:25 PM, Pavel Fedin wrote:
>  Hello!
>  I'd like just to clarify some things for myself and better wrap my head 
> around it...
> 
>> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
>> directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
>> bypassed.
> 
>  So, this is effectively the same as always having hardwired 1:1 mappings on 
> all IOMMUs, isn't it ?
>  If so, then we can't we just do the same, just by forcing similar 1:1 
> mapping? This is what i tried to do in my patchset. All of
> you are talking about a situation which arises when we are emulating 
> different machine with different physical addresses layout. And
> e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at 
> the same location, and we need to handle it somehow,
> therefore we have to move our MSI window out of target's RAM. But how does 
> this work on a PC then? What if our host is PC, and we
> want to emulate some ARM board, which has RAM at FE00  ? Or does it mean 
> that PC architecture is flawed and can reliably handle
> PCI passthrough only for itself ?
Alex answered to this I think:
"
x86 isn't problem-free in this space.  An x86 VM is going to know that
the 0xfee0 address range is special, it won't be backed by RAM and
won't be a DMA target, thus we'll never attempt to map it for an iova
address.  However, if we run a non-x86 VM or a userspace driver, it
doesn't necessarily know that there's anything special about that range
of iovas.  I intend to resolve this with an extension to the iommu info
ioctl that describes the available iova space for the iommu.  The
interrupt region would simply be excluded.
"

I am not sure I've addressed this requirement yet but it seems more
future proof to have an IOMMU mapping for those addresses.

For the ARM use case I think Marc gave guidance:
"
We want userspace to be in control of the memory map, and it
is the kernel's job to tell us whether or not this matches the HW
capabilities or not. A fixed mapping may completely clash with the
memory map I want (think emulating HW x on platform y), and there is no
reason why we should have the restrictions x86 has.
"

That's the rationale behind respining that way.

Waiting for other comments & discussions, I am going to address the iova
and dma_addr_t kbuilt reported compilation issues. Please apologize for
those.

Best Regards

Eric


> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 



RE: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-27 Thread Pavel Fedin
 Hello!

> x86 isn't problem-free in this space.  An x86 VM is going to know that
> the 0xfee0 address range is special, it won't be backed by RAM and
> won't be a DMA target, thus we'll never attempt to map it for an iova
> address.  However, if we run a non-x86 VM or a userspace driver, it
> doesn't necessarily know that there's anything special about that range
> of iovas.  I intend to resolve this with an extension to the iommu info
> ioctl that describes the available iova space for the iommu.  The
> interrupt region would simply be excluded.

 I see now, but i still don't understand how it would work. How can we tell the 
guest OS that we cannot do DMA to this particular
area? Just exclude it from RAM at all? But this means we would have to modify 
machine's model...
 I know that this is a bit different story from what we are implementing now. 
Just curious.

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia




Re: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-27 Thread Eric Auger
Hi Pavel,
On 01/26/2016 06:25 PM, Pavel Fedin wrote:
>  Hello!
>  I'd like just to clarify some things for myself and better wrap my head 
> around it...
> 
>> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are 
>> directed
>> as interrupt messages: accesses to this special PA window directly target the
>> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
>> bypassed.
> 
>  So, this is effectively the same as always having hardwired 1:1 mappings on 
> all IOMMUs, isn't it ?
>  If so, then we can't we just do the same, just by forcing similar 1:1 
> mapping? This is what i tried to do in my patchset. All of
> you are talking about a situation which arises when we are emulating 
> different machine with different physical addresses layout. And
> e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at 
> the same location, and we need to handle it somehow,
> therefore we have to move our MSI window out of target's RAM. But how does 
> this work on a PC then? What if our host is PC, and we
> want to emulate some ARM board, which has RAM at FE00  ? Or does it mean 
> that PC architecture is flawed and can reliably handle
> PCI passthrough only for itself ?
Alex answered to this I think:
"
x86 isn't problem-free in this space.  An x86 VM is going to know that
the 0xfee0 address range is special, it won't be backed by RAM and
won't be a DMA target, thus we'll never attempt to map it for an iova
address.  However, if we run a non-x86 VM or a userspace driver, it
doesn't necessarily know that there's anything special about that range
of iovas.  I intend to resolve this with an extension to the iommu info
ioctl that describes the available iova space for the iommu.  The
interrupt region would simply be excluded.
"

I am not sure I've addressed this requirement yet but it seems more
future proof to have an IOMMU mapping for those addresses.

For the ARM use case I think Marc gave guidance:
"
We want userspace to be in control of the memory map, and it
is the kernel's job to tell us whether or not this matches the HW
capabilities or not. A fixed mapping may completely clash with the
memory map I want (think emulating HW x on platform y), and there is no
reason why we should have the restrictions x86 has.
"

That's the rationale behind respining that way.

Waiting for other comments & discussions, I am going to address the iova
and dma_addr_t kbuilt reported compilation issues. Please apologize for
those.

Best Regards

Eric


> 
> Kind regards,
> Pavel Fedin
> Senior Engineer
> Samsung Electronics Research center Russia
> 
> 



RE: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-26 Thread Pavel Fedin
 Hello!
 I'd like just to clarify some things for myself and better wrap my head around 
it...

> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> bypassed.

 So, this is effectively the same as always having hardwired 1:1 mappings on 
all IOMMUs, isn't it ?
 If so, then we can't we just do the same, just by forcing similar 1:1 mapping? 
This is what i tried to do in my patchset. All of
you are talking about a situation which arises when we are emulating different 
machine with different physical addresses layout. And
e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at the 
same location, and we need to handle it somehow,
therefore we have to move our MSI window out of target's RAM. But how does this 
work on a PC then? What if our host is PC, and we
want to emulate some ARM board, which has RAM at FE00  ? Or does it mean 
that PC architecture is flawed and can reliably handle
PCI passthrough only for itself ?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia




RE: [PATCH 00/10] KVM PCIe/MSI passthrough on ARM/ARM64

2016-01-26 Thread Pavel Fedin
 Hello!
 I'd like just to clarify some things for myself and better wrap my head around 
it...

> On x86 all accesses to the 1MB PA region [FEE0_h - FEF0_000h] are directed
> as interrupt messages: accesses to this special PA window directly target the
> APIC configuration space and not DRAM, meaning the downstream IOMMU is 
> bypassed.

 So, this is effectively the same as always having hardwired 1:1 mappings on 
all IOMMUs, isn't it ?
 If so, then we can't we just do the same, just by forcing similar 1:1 mapping? 
This is what i tried to do in my patchset. All of
you are talking about a situation which arises when we are emulating different 
machine with different physical addresses layout. And
e. g. if our host has MSI at 0xABADCAFE, our target could have valid RAM at the 
same location, and we need to handle it somehow,
therefore we have to move our MSI window out of target's RAM. But how does this 
work on a PC then? What if our host is PC, and we
want to emulate some ARM board, which has RAM at FE00  ? Or does it mean 
that PC architecture is flawed and can reliably handle
PCI passthrough only for itself ?

Kind regards,
Pavel Fedin
Senior Engineer
Samsung Electronics Research center Russia