RE: [virtio-dev] [RFC] virtio-iommu version 0.4

2017-09-20 Thread Tian, Kevin
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>   | N | R | M | I
> --
> N | N | R | M | ?
> --
> R |   | R | R | I
> --
> M |   |   | M | I
> --
> I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [RFC] virtio-iommu version 0.4

2017-09-20 Thread Tian, Kevin
> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think 

RE: bind pasid table API

2017-09-20 Thread Liu, Yi L
Hi Jean,

> -Original Message-
> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
> Sent: Wednesday, September 20, 2017 8:10 PM
> To: Pan, Jacob jun ; iommu@lists.linux-
> foundation.org
> Cc: Liu, Yi L ; Raj, Ashok ; David
> Woodhouse ; Joerg Roedel ; Tian,
> Kevin ; Auger Eric 
> Subject: Re: bind pasid table API
> 
> Hi Jacob,
> 
> [Adding Eric as he might need pasid_table_info for vSVM at some point]
> 
> On 19/09/17 04:45, Jacob Pan wrote:
> > Hi Jean and All,
> >
> > This is a follow-up on the LPC discussion we had last week.
> > (https://linuxplumbersconf.org/2017/ocw/proposals/4748)
> >
> > My understanding is that the data structure below can satisfy the
> > needs from Intel (pointer + size) and AMD (pointer only). But ARM
> > pvIOMMU would need additional info to indicate the page table format.
> > Could you share your idea of the right addition for ARM such that we
> > can have a unified API?
> >
> > /**
> >  * PASID table data used to bind guest PASID table to the host IOMMU.
> > This will
> >  * enable guest managed first level page tables.
> >  * @ptr:PASID table pointer
> >  * @size_order: number of bits supported in the guest PASID table, must
> be less
> >  *  or equal than the host table size.
> >  */
> > struct pasid_table_info {
> > __u64   ptr;
> > __u64   size_order;
> > };
> 
> For the PASID table, Arm SMMUv3 would need two additional fields:
> * 'format' telling whether the table has 1 or 2 levels and their
>   dimensions,
> * 'default_substream' telling if PASID0 is reserved for non-pasid traffic.
> 
> I think that's it for the moment, but it does require to leave space for a 
> vendor-
> specific structure at the end. It is one reason why I'd prefer having a 
> 'model' field
> in the pasid_table_info structure telling what fields the whole structure 
> actually
> contains.
> 
> Another reason is if some IOMMU is able to support multiple PASID table
> formats, it could advertise them all in sysfs and Qemu could tell which one it
> chose in 'model'. I'm not sure we'll ever see that in practice.

Regards to your idea, I think the whole flow may be:
* Qemu queries underlying IOMMU for the capability(through sysfs)
* Combined with requirement from user(the guy who starts the VM), Qemu chose a
   suitable model or exit if HW capability is incompatible with user's 
requirement
* In the coming bind_pasid_table() calling, Qemu pass the chosen model info to 
host
* Host checks the "model" and use the correct model specific structure to parse
   the model specific data. may be kind of structure intel_pasid_table_info,
   amd_pasid_table_info or arm_pasid_table_info_v#

Does this flow show what you want? This would be a "model" + "model specific 
data"
proposal. And my concern is the model specific field may look like to be kind 
of "opaque".

Besides the comments above, is there also possibility for us to put all the 
possible info
in a super-set just as what we plan to do for the tlb_invalidate() API?

> 
> For binding page tables instead of PASID tables (e.g. virtio-iommu), the 
> generic
> data would be:
> 
> struct pgtable_info {
>   __u32   pasid;
>   __u64   ptr;
>   __u32   model;
>   __u8model_data[];
> };

Besides bind_pasid_table API, you would also want to propose an extra API
which likely to be named as bind_pgtable() for this page table binding?

What would be the "model" field indicate? "vendor" or "vendor+version" ? You
may want a length field to indicate the size of "model_data" field.

And same with the bind_pasid_table API, would model_data look like an "opaque"?

> Followed by a few arch-specific configuration values. For Arm we can summarize
> this to three registers, defined in the Armv8 Architecture Reference Manual:
> 
> struct arm_lpae_pgtable_info {
>   __u64   tcr;/* Translation Control Register */
>   __u64   mair;   /* Memory Attributes Indirection Register */
>   __u64   asid;   /* Address Space ID */
> };

Hmmm, just for curious. What is the difference between "pasid" and "asid"?

Thanks,
Yi L

> Some data packed in the TCR might be common to most architectures, like page
> granularity and max VA size. Most fields of the TCR won't be used but it 
> provides
> a nice architected way to communicate Arm page table configuration.
> 
> Note that there might be an additional page directory in the arch-specific 
> info, as
> we can split the address space in two. I'm not sure whether we should allow it
> yet.
> 
> Thanks,
> Jean

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: bind pasid table API

2017-09-20 Thread Jacob Pan
On Wed, 20 Sep 2017 13:09:47 +0100
Jean-Philippe Brucker  wrote:

> Hi Jacob,
> 
> [Adding Eric as he might need pasid_table_info for vSVM at some point]
> 
> On 19/09/17 04:45, Jacob Pan wrote:
> > Hi Jean and All,
> > 
> > This is a follow-up on the LPC discussion we had last week.
> > (https://linuxplumbersconf.org/2017/ocw/proposals/4748)
> > 
> > My understanding is that the data structure below can satisfy the
> > needs from Intel (pointer + size) and AMD (pointer only). But ARM
> > pvIOMMU would need additional info to indicate the page table
> > format. Could you share your idea of the right addition for ARM
> > such that we can have a unified API?
> > 
> > /**
> >  * PASID table data used to bind guest PASID table to the host
> > IOMMU. This will
> >  * enable guest managed first level page tables.
> >  * @ptr:PASID table pointer
> >  * @size_order: number of bits supported in the guest PASID
> > table, must be less
> >  *  or equal than the host table size.
> >  */
> > struct pasid_table_info {
> > __u64   ptr;
> > __u64   size_order;
> > };  
> 
> For the PASID table, Arm SMMUv3 would need two additional fields:
> * 'format' telling whether the table has 1 or 2 levels and their
>   dimensions,
just wondering if the format will be different than whatever is used on
the host? i.e. could there be a mismatch? I am ok with this field just
wondering if this can be resolved via the sysfs query interface if that
is a global thing.
> * 'default_substream' telling if PASID0 is reserved for non-pasid
> traffic.
> 
could it be part of a feature flag? i.e.

struct pasid_table_info {
__u64   ptr;
__u64   size_order;
__u64   flags;
#define PASID0_FOR_DMA_WITHOUT_PASID;
enum pasid_table_format format;
};  

> I think that's it for the moment, but it does require to leave space
> for a vendor-specific structure at the end. It is one reason why I'd
> prefer having a 'model' field in the pasid_table_info structure
> telling what fields the whole structure actually contains.
> 
I think we have been there before. the downside is that model specific
knowledge is required in the generic VFIO layer, if its content is to
be inspected.

> Another reason is if some IOMMU is able to support multiple PASID
> table formats, it could advertise them all in sysfs and Qemu could
> tell which one it chose in 'model'. I'm not sure we'll ever see that
> in practice.
> 
I would expect when query interface between QEMU and sysfs would ensure
matching format prior to issue bind_pasid_table call.
> 
> 
> For binding page tables instead of PASID tables (e.g. virtio-iommu),
> the generic data would be:
> 
> struct pgtable_info {
>   __u32   pasid;
>   __u64   ptr;
>   __u32   model;
>   __u8model_data[];
> };
> 
> Followed by a few arch-specific configuration values. For Arm we can
> summarize this to three registers, defined in the Armv8 Architecture
> Reference Manual:
> 
> struct arm_lpae_pgtable_info {
>   __u64   tcr;/* Translation Control Register */
>   __u64   mair;   /* Memory Attributes Indirection
> Register */ __u64 asid;   /* Address Space ID */
> };
> 
> Some data packed in the TCR might be common to most architectures,
> like page granularity and max VA size. Most fields of the TCR won't
> be used but it provides a nice architected way to communicate Arm
> page table configuration.
> 
> Note that there might be an additional page directory in the
> arch-specific info, as we can split the address space in two. I'm not
> sure whether we should allow it yet.
> 
This can be combined with bind mm with user tasks also (e.g. DPDK with
SVM in native case), right? In that case the pasid would be allocated by
the kernel instead of passdown in pgtable_info, and your 'ptr' filed is
harvested from the current task mm? There is also complexity w.r.t.
user task life cycle management. My immediate goal to enable vSVM does
not require bind page tables, try not to think too far ahead.

> Thanks,
> Jean

[Jacob Pan]
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[GIT PULL] a dma-mapping fix for 4.14-rc2

2017-09-20 Thread Christoph Hellwig
Hi Linus,

below is a single fix for an issue that crept in with the previous pull
request.


The following changes since commit 2bd6bf03f4c1c59381d62c61d03f6cc3fe71f66e:

  Linux 4.14-rc1 (2017-09-16 15:47:51 -0700)

are available in the git repository at:

  git://git.infradead.org/users/hch/dma-mapping.git tags/dma-mapping-4.14-2

for you to fetch changes up to 6d57339890c9a9dfcd4ba4f8931b911b962968e3:

  dma-coherent: fix rmem_dma_device_init regression (2017-09-17 08:20:02 -0700)


A fix for a fix that went in this merge window from Arnd.


Arnd Bergmann (1):
  dma-coherent: fix rmem_dma_device_init regression

 drivers/base/dma-coherent.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 5/6] iommu/iova: Extend rbtree node caching

2017-09-20 Thread Robin Murphy
On 20/09/17 13:59, Tomasz Nowicki wrote:
> Hi Robin,
> 
> On 19.09.2017 18:31, Robin Murphy wrote:
>> The cached node mechanism provides a significant performance benefit for
>> allocations using a 32-bit DMA mask, but in the case of non-PCI devices
>> or where the 32-bit space is full, the loss of this benefit can be
>> significant - on large systems there can be many thousands of entries in
>> the tree, such that walking all the way down to find free space every
>> time becomes increasingly awful.
>>
>> Maintain a similar cached node for the whole IOVA space as a superset of
>> the 32-bit space so that performance can remain much more consistent.
>>
>> Inspired by work by Zhen Lei .
>>
>> Tested-by: Ard Biesheuvel 
>> Tested-by: Zhen Lei 
>> Tested-by: Nate Watterson 
>> Signed-off-by: Robin Murphy 
>> ---
>>
>> v4:
>>   - Adjust to simplified __get_cached_rbnode() behaviour
>>   - Cosmetic tweaks
>>
>>   drivers/iommu/iova.c | 43 +--
>>   include/linux/iova.h |  3 ++-
>>   2 files changed, 23 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
>> index c93a6c46bcb1..a125a5786dbf 100644
>> --- a/drivers/iommu/iova.c
>> +++ b/drivers/iommu/iova.c
>> @@ -51,6 +51,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned
>> long granule,
>>     spin_lock_init(&iovad->iova_rbtree_lock);
>>   iovad->rbroot = RB_ROOT;
>> +    iovad->cached_node = NULL;
>>   iovad->cached32_node = NULL;
>>   iovad->granule = granule;
>>   iovad->start_pfn = start_pfn;
>> @@ -119,39 +120,38 @@ __get_cached_rbnode(struct iova_domain *iovad,
>> unsigned long limit_pfn)
>>   if (limit_pfn <= iovad->dma_32bit_pfn && iovad->cached32_node)
>>   return iovad->cached32_node;
>>   +    if (iovad->cached_node)
>> +    return iovad->cached_node;
>> +
>>   return &iovad->anchor.node;
>>   }
>>     static void
>> -__cached_rbnode_insert_update(struct iova_domain *iovad,
>> -    unsigned long limit_pfn, struct iova *new)
>> +__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova
>> *new)
>>   {
>> -    if (limit_pfn != iovad->dma_32bit_pfn)
>> -    return;
>> -    iovad->cached32_node = &new->node;
>> +    if (new->pfn_hi < iovad->dma_32bit_pfn)
>> +    iovad->cached32_node = &new->node;
>> +    else
>> +    iovad->cached_node = &new->node;
>>   }
>>     static void
>>   __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova
>> *free)
>>   {
>>   struct iova *cached_iova;
>> -    struct rb_node *curr;
>> +    struct rb_node **curr;
>>   -    if (!iovad->cached32_node)
>> +    if (free->pfn_hi < iovad->dma_32bit_pfn)
>> +    curr = &iovad->cached32_node;
>> +    else
>> +    curr = &iovad->cached_node;
>> +
>> +    if (!*curr)
>>   return;
>> -    curr = iovad->cached32_node;
>> -    cached_iova = rb_entry(curr, struct iova, node);
>>   -    if (free->pfn_lo >= cached_iova->pfn_lo) {
>> -    struct rb_node *node = rb_next(&free->node);
>> -    struct iova *iova = rb_entry(node, struct iova, node);
>> -
>> -    /* only cache if it's below 32bit pfn */
>> -    if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
>> -    iovad->cached32_node = node;
>> -    else
>> -    iovad->cached32_node = NULL;
>> -    }
>> +    cached_iova = rb_entry(*curr, struct iova, node);
>> +    if (free->pfn_lo >= cached_iova->pfn_lo)
>> +    *curr = rb_next(&free->node);
> 
> IMO, we may incorrectly update iovad->cached32_node here.
> 
>   32-bit boundary
>         |     
>     |    | |    |   | |    | |    |
>   IOVA0   -  IOVA1   -  IOVA3   -  anchor |
>     |    | |    |   | |    | |    |
>         |     
> 
> If we free last IOVA from 32-bit space (IOVA1) we will update
> iovad->cached32_node to IOVA3.

That in itself is intentional and correct - the cached node is where we
*start* searching for free space, so if the topmost 32-bit pfn is free,
we want to start from whichever node represents the next-highest pfn.
Even if more 64-bit IOVAs were allocated below IOVA3 before the next
32-bit allocation, walking through those is better that walking all the
way down from the anchor.

However, you have made me realise that if IOVA3 were freed before the
next 32-bit allocation things we'd only update cached_node and leave
cached32_node with a stale pointer...

I'll get a v5 out tomorrow (and keep the new stuff at the end of the
series this time)

Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH] iommu/iova: Improve alloc_iova performance

2017-09-20 Thread David Woods
When allocating pages with alloc_iova() where limit_pfn > dma_32bit_pfn
__alloc_and_insert_iova_range does a linear traversal of the tree to
find a free block.  In the worst case it makes the alloc O(n) for each
page, where n is the number of pages allocated so far.  The worst case
turns out to be common for drivers that allocate lots of pages at start
up and never free them.

There is an optimization for allocating 32-bit addresses where it
starts the search at the point where the previous allocated page was
inserted.  This is sensible, since otherwise it would have to always
search through all the 64-bit pages first.

To improve this situation, extend the optimization to also keep track
of the point were the previous 64-bit page was inserted.  With this
change, the search for an available slot can normally be done in
constant time and the whole alloc_iova only costs O(log n) due to the
rb tree insert.

Reviewed-by: Chris Metcalf 
Signed-off-by: David Woods 
---
 drivers/iommu/iova.c | 82 +++-
 include/linux/iova.h |  1 +
 2 files changed, 56 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 33edfa7..3c82694 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -108,15 +108,26 @@ int init_iova_flush_queue(struct iova_domain *iovad,
 EXPORT_SYMBOL_GPL(init_iova_flush_queue);
 
 static struct rb_node *
-__get_cached_rbnode(struct iova_domain *iovad, unsigned long *limit_pfn)
+__get_cached_rbnode(struct iova_domain *iovad, unsigned long limit_pfn)
 {
-   if ((*limit_pfn > iovad->dma_32bit_pfn) ||
-   (iovad->cached32_node == NULL))
+   if (limit_pfn <= iovad->dma_32bit_pfn)
+   return iovad->cached32_node;
+   else
+   return iovad->cached64_node;
+}
+
+static struct rb_node *
+__get_cached_rbnode_and_limit(struct iova_domain *iovad,
+ unsigned long *limit_pfn)
+{
+   struct rb_node *cached_node = __get_cached_rbnode(iovad, *limit_pfn);
+
+   if (cached_node == NULL) {
return rb_last(&iovad->rbroot);
-   else {
-   struct rb_node *prev_node = rb_prev(iovad->cached32_node);
+   } else {
+   struct rb_node *prev_node = rb_prev(cached_node);
struct iova *curr_iova =
-   rb_entry(iovad->cached32_node, struct iova, node);
+   rb_entry(cached_node, struct iova, node);
*limit_pfn = curr_iova->pfn_lo;
return prev_node;
}
@@ -124,33 +135,50 @@ __get_cached_rbnode(struct iova_domain *iovad, unsigned 
long *limit_pfn)
 
 static void
 __cached_rbnode_insert_update(struct iova_domain *iovad,
-   unsigned long limit_pfn, struct iova *new)
+   unsigned long limit_pfn, struct rb_node *node)
 {
-   if (limit_pfn != iovad->dma_32bit_pfn)
-   return;
-   iovad->cached32_node = &new->node;
+   if (limit_pfn == iovad->dma_32bit_pfn)
+   iovad->cached32_node = node;
+   else if (limit_pfn > iovad->dma_32bit_pfn)
+   iovad->cached64_node = node;
 }
 
 static void
 __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
 {
struct iova *cached_iova;
-   struct rb_node *curr;
-
-   if (!iovad->cached32_node)
-   return;
-   curr = iovad->cached32_node;
-   cached_iova = rb_entry(curr, struct iova, node);
-
-   if (free->pfn_lo >= cached_iova->pfn_lo) {
-   struct rb_node *node = rb_next(&free->node);
-   struct iova *iova = rb_entry(node, struct iova, node);
+   struct rb_node *cached_node;
 
-   /* only cache if it's below 32bit pfn */
-   if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-   iovad->cached32_node = node;
-   else
-   iovad->cached32_node = NULL;
+   if (free->pfn_lo <= iovad->dma_32bit_pfn) {
+   if (unlikely(iovad->cached64_node == &free->node))
+   iovad->cached64_node = NULL;
+   cached_node = iovad->cached32_node;
+   if (!cached_node)
+   return;
+   cached_iova = rb_entry(cached_node, struct iova, node);
+   if (free->pfn_lo >= cached_iova->pfn_lo) {
+   struct rb_node *next_node = rb_next(&free->node);
+
+   if (next_node &&
+   rb_entry(next_node, struct iova, node)->pfn_lo <=
+   iovad->dma_32bit_pfn)
+   iovad->cached32_node = next_node;
+   else
+   iovad->cached32_node = NULL;
+   }
+   } else {
+   cached_node = iovad->cached64_node;
+   if (!cached_node)
+   return;
+   cached_iova = container_of(cached_node, struct iova, nod

Re: [PATCH] pci: Add dummy for pci_acs_enabled() if CONFIG_PCI=n to fix iommmu build

2017-09-20 Thread Alex Williamson
On Mon, 11 Sep 2017 14:29:15 +0200
Geert Uytterhoeven  wrote:

> If CONFIG_PCI=n, and gcc (e.g. 4.1.2) decides not to inline
> get_pci_function_alias_group(), the build fails with:
> 
> drivers/iommu/iommu.o: In function `get_pci_function_alias_group':
> iommu.c:(.text+0xfdc): undefined reference to `pci_acs_enabled'
> 
> Due to the various dummies for PCI calls in the CONFIG_PCI=n case,
> pci_acs_enabled() isn't actually ever called, but not all versions of
> gcc are smart enough to realize that.
> 
> While explicitly marking get_pci_function_alias_group() inline would fix
> the build, this would inflate the code for the CONFIG_PCI=y case, as
> get_pci_function_alias_group() is a not-so-small function called from
> two places.
> 
> Hence fix the issue by introducing a dummy for pci_acs_enabled()
> instead.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  include/linux/pci.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index f68c58a93dd045b9..f4f8ee5a7362e982 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1685,6 +1685,8 @@ static inline int pci_get_new_domain_nr(void) { return 
> -ENOSYS; }
>  
>  #define dev_is_pci(d) (false)
>  #define dev_is_pf(d) (false)
> +static inline bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
> +{ return false; }
>  #endif /* CONFIG_PCI */
>  
>  /* Include architecture-dependent settings and functions */

Reviewed-by: Alex Williamson 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v2] iommu/ipmmu-vmsa: Do not replace bus IOMMU ops on driver init.

2017-09-20 Thread Liviu Dudau
If the IPMMU driver is compiled in the kernel it will replace the
platform bus IOMMU ops on running the ipmmu_init() function, regardless
if there is any IPMMU hardware present or not. This screws up systems
that just want to build a generic kernel that runs on multiple platforms
and use a different IOMMU implementation.

Move the bus_set_iommu() call at the end of the ipmmu_probe() function
when we know that hardware is present. With current IOMMU framework it
should be safe (at least for OF case).

Now that the ipmmu_init() and ipmmu_exit() functions are simple calls to
platform_driver_register() and platform_driver_unregister(), replace
them with the module_platform_driver() macro call.

Signed-off-by: Liviu Dudau 
Cc: Laurent Pinchart 
---
 drivers/iommu/ipmmu-vmsa.c | 29 +
 1 file changed, 5 insertions(+), 24 deletions(-)

diff --git a/drivers/iommu/ipmmu-vmsa.c b/drivers/iommu/ipmmu-vmsa.c
index 195d6e93ac718..31912997bffdf 100644
--- a/drivers/iommu/ipmmu-vmsa.c
+++ b/drivers/iommu/ipmmu-vmsa.c
@@ -966,10 +966,11 @@ static int ipmmu_probe(struct platform_device *pdev)
return ret;
 
/*
-* We can't create the ARM mapping here as it requires the bus to have
-* an IOMMU, which only happens when bus_set_iommu() is called in
-* ipmmu_init() after the probe function returns.
+* Now that we have validated the presence of the hardware, set
+* the bus IOMMU ops to enable future domain and device setup.
 */
+   if (!iommu_present(&platform_bus_type))
+   bus_set_iommu(&platform_bus_type, &ipmmu_ops);
 
platform_set_drvdata(pdev, mmu);
 
@@ -1006,27 +1007,7 @@ static struct platform_driver ipmmu_driver = {
.remove = ipmmu_remove,
 };
 
-static int __init ipmmu_init(void)
-{
-   int ret;
-
-   ret = platform_driver_register(&ipmmu_driver);
-   if (ret < 0)
-   return ret;
-
-   if (!iommu_present(&platform_bus_type))
-   bus_set_iommu(&platform_bus_type, &ipmmu_ops);
-
-   return 0;
-}
-
-static void __exit ipmmu_exit(void)
-{
-   return platform_driver_unregister(&ipmmu_driver);
-}
-
-subsys_initcall(ipmmu_init);
-module_exit(ipmmu_exit);
+module_platform_driver(ipmmu_driver);
 
 MODULE_DESCRIPTION("IOMMU API for Renesas VMSA-compatible IPMMU");
 MODULE_AUTHOR("Laurent Pinchart ");
-- 
2.14.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 5/6] iommu/iova: Extend rbtree node caching

2017-09-20 Thread Tomasz Nowicki

Hi Robin,

On 19.09.2017 18:31, Robin Murphy wrote:

The cached node mechanism provides a significant performance benefit for
allocations using a 32-bit DMA mask, but in the case of non-PCI devices
or where the 32-bit space is full, the loss of this benefit can be
significant - on large systems there can be many thousands of entries in
the tree, such that walking all the way down to find free space every
time becomes increasingly awful.

Maintain a similar cached node for the whole IOVA space as a superset of
the 32-bit space so that performance can remain much more consistent.

Inspired by work by Zhen Lei .

Tested-by: Ard Biesheuvel 
Tested-by: Zhen Lei 
Tested-by: Nate Watterson 
Signed-off-by: Robin Murphy 
---

v4:
  - Adjust to simplified __get_cached_rbnode() behaviour
  - Cosmetic tweaks

  drivers/iommu/iova.c | 43 +--
  include/linux/iova.h |  3 ++-
  2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c93a6c46bcb1..a125a5786dbf 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -51,6 +51,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
  
  	spin_lock_init(&iovad->iova_rbtree_lock);

iovad->rbroot = RB_ROOT;
+   iovad->cached_node = NULL;
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
@@ -119,39 +120,38 @@ __get_cached_rbnode(struct iova_domain *iovad, unsigned 
long limit_pfn)
if (limit_pfn <= iovad->dma_32bit_pfn && iovad->cached32_node)
return iovad->cached32_node;
  
+	if (iovad->cached_node)

+   return iovad->cached_node;
+
return &iovad->anchor.node;
  }
  
  static void

-__cached_rbnode_insert_update(struct iova_domain *iovad,
-   unsigned long limit_pfn, struct iova *new)
+__cached_rbnode_insert_update(struct iova_domain *iovad, struct iova *new)
  {
-   if (limit_pfn != iovad->dma_32bit_pfn)
-   return;
-   iovad->cached32_node = &new->node;
+   if (new->pfn_hi < iovad->dma_32bit_pfn)
+   iovad->cached32_node = &new->node;
+   else
+   iovad->cached_node = &new->node;
  }
  
  static void

  __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free)
  {
struct iova *cached_iova;
-   struct rb_node *curr;
+   struct rb_node **curr;
  
-	if (!iovad->cached32_node)

+   if (free->pfn_hi < iovad->dma_32bit_pfn)
+   curr = &iovad->cached32_node;
+   else
+   curr = &iovad->cached_node;
+
+   if (!*curr)
return;
-   curr = iovad->cached32_node;
-   cached_iova = rb_entry(curr, struct iova, node);
  
-	if (free->pfn_lo >= cached_iova->pfn_lo) {

-   struct rb_node *node = rb_next(&free->node);
-   struct iova *iova = rb_entry(node, struct iova, node);
-
-   /* only cache if it's below 32bit pfn */
-   if (node && iova->pfn_lo < iovad->dma_32bit_pfn)
-   iovad->cached32_node = node;
-   else
-   iovad->cached32_node = NULL;
-   }
+   cached_iova = rb_entry(*curr, struct iova, node);
+   if (free->pfn_lo >= cached_iova->pfn_lo)
+   *curr = rb_next(&free->node);


IMO, we may incorrectly update iovad->cached32_node here.

  32-bit boundary
    |     
|| ||   | || ||
  IOVA0   -  IOVA1   -  IOVA3   -  anchor |
|| ||   | || ||
    |     

If we free last IOVA from 32-bit space (IOVA1) we will update 
iovad->cached32_node to IOVA3.


Thanks,
Tomasz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] pci: Add dummy for pci_acs_enabled() if CONFIG_PCI=n to fix iommmu build

2017-09-20 Thread Joerg Roedel
Hey Bjorn,

On Mon, Sep 11, 2017 at 02:29:15PM +0200, Geert Uytterhoeven wrote:
> If CONFIG_PCI=n, and gcc (e.g. 4.1.2) decides not to inline
> get_pci_function_alias_group(), the build fails with:
> 
> drivers/iommu/iommu.o: In function `get_pci_function_alias_group':
> iommu.c:(.text+0xfdc): undefined reference to `pci_acs_enabled'
> 
> Due to the various dummies for PCI calls in the CONFIG_PCI=n case,
> pci_acs_enabled() isn't actually ever called, but not all versions of
> gcc are smart enough to realize that.
> 
> While explicitly marking get_pci_function_alias_group() inline would fix
> the build, this would inflate the code for the CONFIG_PCI=y case, as
> get_pci_function_alias_group() is a not-so-small function called from
> two places.
> 
> Hence fix the issue by introducing a dummy for pci_acs_enabled()
> instead.
> 
> Signed-off-by: Geert Uytterhoeven 
> ---
>  include/linux/pci.h | 2 ++
>  1 file changed, 2 insertions(+)

I am sending iommu-fixes upstream this week, and this would fit into it.
Can you have a look please and ack it if you are okay with that?

Thanks,

Joerg

> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index f68c58a93dd045b9..f4f8ee5a7362e982 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1685,6 +1685,8 @@ static inline int pci_get_new_domain_nr(void) { return 
> -ENOSYS; }
>  
>  #define dev_is_pci(d) (false)
>  #define dev_is_pf(d) (false)
> +static inline bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags)
> +{ return false; }
>  #endif /* CONFIG_PCI */
>  
>  /* Include architecture-dependent settings and functions */
> -- 
> 2.7.4
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: bind pasid table API

2017-09-20 Thread Jean-Philippe Brucker
Hi Jacob,

[Adding Eric as he might need pasid_table_info for vSVM at some point]

On 19/09/17 04:45, Jacob Pan wrote:
> Hi Jean and All,
> 
> This is a follow-up on the LPC discussion we had last week.
> (https://linuxplumbersconf.org/2017/ocw/proposals/4748)
> 
> My understanding is that the data structure below can satisfy the
> needs from Intel (pointer + size) and AMD (pointer only). But ARM
> pvIOMMU would need additional info to indicate the page table format.
> Could you share your idea of the right addition for ARM such that we
> can have a unified API?
> 
> /**
>  * PASID table data used to bind guest PASID table to the host IOMMU. This 
> will
>  * enable guest managed first level page tables.
>  * @ptr:  PASID table pointer
>  * @size_order:   number of bits supported in the guest PASID table, must 
> be less
>  *or equal than the host table size.
>  */
> struct pasid_table_info {
>   __u64   ptr;
>   __u64   size_order;
> };

For the PASID table, Arm SMMUv3 would need two additional fields:
* 'format' telling whether the table has 1 or 2 levels and their
  dimensions,
* 'default_substream' telling if PASID0 is reserved for non-pasid traffic.

I think that's it for the moment, but it does require to leave space
for a vendor-specific structure at the end. It is one reason why I'd
prefer having a 'model' field in the pasid_table_info structure telling
what fields the whole structure actually contains.

Another reason is if some IOMMU is able to support multiple PASID table
formats, it could advertise them all in sysfs and Qemu could tell which
one it chose in 'model'. I'm not sure we'll ever see that in practice.



For binding page tables instead of PASID tables (e.g. virtio-iommu), the
generic data would be:

struct pgtable_info {
__u32   pasid;
__u64   ptr;
__u32   model;
__u8model_data[];
};

Followed by a few arch-specific configuration values. For Arm we can
summarize this to three registers, defined in the Armv8 Architecture
Reference Manual:

struct arm_lpae_pgtable_info {
__u64   tcr;/* Translation Control Register */
__u64   mair;   /* Memory Attributes Indirection Register */
__u64   asid;   /* Address Space ID */
};

Some data packed in the TCR might be common to most architectures, like
page granularity and max VA size. Most fields of the TCR won't be used but
it provides a nice architected way to communicate Arm page table
configuration.

Note that there might be an additional page directory in the arch-specific
info, as we can split the address space in two. I'm not sure whether we
should allow it yet.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-20 Thread Harsh Jain


On 20-09-2017 13:31, Herbert Xu wrote:
> Harsh Jain  wrote:
>> While debugging DMA mapping error in chelsio crypto driver we observed that 
>> when scatter/gather list received by driver has some entry with page->offset 
>> > 4096 (PAGE_SIZE). It starts giving DMA error.  Without IOMMU it works fine.
> This is not a bug.  The network stack can and will feed us such
> SG lists.
>
>> 2) It cannot be driver's responsibilty to update received sg entries to 
>> adjust offset and page 
>> because we are not the only one who directly uses received sg list.
> No the driver must deal with this.  Having said that, if we can
> improve our driver helper interface to make this easier then we
> should do that too.  What we certainly shouldn't do is to take a
> whack-a-mole approach like this patch does.
Agreed,I added that patch for understanding purpose only. Today I referred 
other crypto driver for DMA related code. Most of them are using dma_map_sg 
except QAT. In QAT, They are first updating the Page address using offset then 
mapping each page in for loop with dma_map_single(0. I will try the same in 
chelsio driver will see the behavior.
>
> Cheers,

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-20 Thread Harsh Jain


On 20-09-2017 15:42, Robin Murphy wrote:
> On 20/09/17 09:01, Herbert Xu wrote:
>> Harsh Jain  wrote:
>>> While debugging DMA mapping error in chelsio crypto driver we observed that 
>>> when scatter/gather list received by driver has some entry with 
>>> page->offset > 4096 (PAGE_SIZE). It starts giving DMA error.  Without IOMMU 
>>> it works fine.
>> This is not a bug.  The network stack can and will feed us such
>> SG lists.
>>
>>> 2) It cannot be driver's responsibilty to update received sg entries to 
>>> adjust offset and page 
>>> because we are not the only one who directly uses received sg list.
>> No the driver must deal with this.  Having said that, if we can
>> improve our driver helper interface to make this easier then we
>> should do that too.  What we certainly shouldn't do is to take a
>> whack-a-mole approach like this patch does.
> AFAICS this is entirely on intel-iommu - from a brief look it appears
> that all the IOVA calculations would handle the offset correctly, but
> then __domain_mapping() blindly uses sg_page() for the physical address,
> so if offset is larger than a page it would end up with the DMA mapping
> covering the wrong part of the buffer.
>
> Does the diff below help?
>
> Robin.
>
> ->8-
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index b3914fce8254..2ed43d928135 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
> unsigned long iov_pfn,
>   sg_res = aligned_nrpages(sg->offset, sg->length);
>   sg->dma_address = ((dma_addr_t)iov_pfn << 
> VTD_PAGE_SHIFT) + sg->offset;
>   sg->dma_length = sg->length;
> - pteval = page_to_phys(sg_page(sg)) | prot;
> + pteval = (sg_phys(sg) & PAGE_MASK) | prot;
>   phys_pfn = pteval >> VTD_PAGE_SHIFT;
>   }
Robin,
Still having following error with above change.

[  429.645492] DMAR: DRHD: handling fault status reg 2
[  429.650847] DMAR: [DMA Write] Request device [02:00.4] fault addr f2682000 [t


>  

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 7/6] iommu/iova: Make cached_node always valid

2017-09-20 Thread Robin Murphy
With the anchor node at the top of the rbtree, there is always a valid
node for rb_next() to return, such that cached_node is only ever NULL
until the first allocation. Initialising it to point at the anchor node
gets rid of that window and makes the NULL checking entirely redundant.

Signed-off-by: Robin Murphy 
---

Oops, spotted this one slightly too late. This could be squashed into
patch #5 (which I'll do myself if I there's any cause to resend the
whole series again).

Robin.

 drivers/iommu/iova.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index a7af8273fa98..ec443c0a8319 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -51,7 +51,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
 
spin_lock_init(&iovad->iova_rbtree_lock);
iovad->rbroot = RB_ROOT;
-   iovad->cached_node = NULL;
+   iovad->cached_node = &iovad->anchor.node;
iovad->cached32_node = NULL;
iovad->granule = granule;
iovad->start_pfn = start_pfn;
@@ -120,10 +120,7 @@ __get_cached_rbnode(struct iova_domain *iovad, unsigned 
long limit_pfn)
if (limit_pfn <= iovad->dma_32bit_pfn && iovad->cached32_node)
return iovad->cached32_node;
 
-   if (iovad->cached_node)
-   return iovad->cached_node;
-
-   return &iovad->anchor.node;
+   return iovad->cached_node;
 }
 
 static void
@@ -141,14 +138,11 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
struct iova *cached_iova;
struct rb_node **curr;
 
-   if (free->pfn_hi < iovad->dma_32bit_pfn)
+   if (free->pfn_hi < iovad->dma_32bit_pfn && iovad->cached32_node)
curr = &iovad->cached32_node;
else
curr = &iovad->cached_node;
 
-   if (!*curr)
-   return;
-
cached_iova = rb_entry(*curr, struct iova, node);
if (free->pfn_lo >= cached_iova->pfn_lo)
*curr = rb_next(&free->node);
-- 
2.13.4.dirty

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/arm-smmu-v3: Avoid ILLEGAL setting of STE.S1STALLD

2017-09-20 Thread Yisheng Xie
Hi Jean-Philippe,

Thanks for reviewing.
On 2017/9/19 19:43, Jean-Philippe Brucker wrote:
> On 14/09/17 06:08, Yisheng Xie wrote:
>> According to Spec, it is ILLEGAL to set STE.S1STALLD if STALL_MODEL
>> is not 0b00, which means we should not disable stall mode if stall
>> or terminate mode is not configuable.
>>
>> As Jean-Philippe's suggestion, this patch introduce a feature bit
>> ARM_SMMU_FEAT_STALL_FORCE, which means smmu only supports stall force.
>> Therefore, we can avoid the ILLEGAL setting of STE.S1STALLD.by checking
>> ARM_SMMU_FEAT_STALL_FORCE.
>>
>> This patch keeps the ARM_SMMU_FEAT_STALLS as the meaning of stall supported
>> (force or configuable) to easy to expand the future function, i.e. we can
>> only use ARM_SMMU_FEAT_STALLS to check whether we should register fault
>> handle or enable master can_stall, etc to supporte platform SVM.
>>
>> After apply this patch, the feature bit and S1STALLD setting will be like:
>> STALL_MODEL  FEATURE  S1STALLD
>> 0b00 ARM_SMMU_FEAT_STALLS 0b1
>> 0b01 !ARM_SMMU_FEAT_STALLS && !ARM_SMMU_FEAT_STALL_FORCE  0b0
>> 0b10 ARM_SMMU_FEAT_STALLS && ARM_SMMU_FEAT_STALL_FORCE0b0
> 
> Thanks for the patch. Since it's the same problem, could you also fix the
> context descriptor value? The spec says, in 5.5 Fault configuration:
> 
> "A CD (Stage 1 translation enabled) is considered ILLEGAL if one of the
> following applies:
> * SMMU_(S_)IDR0.STALL_MODEL == 0b10 and CD.S == 0."
Sure, I will fix this it in next version.

> 
> So I think we should always set CD.S if the SMMU has STALL_FORCE.
> 
> As Will pointed out, more work is needed for STALL_FORCE. We can't enable
> translation at all for devices that don't support stalling (e.g. PCI). We
> should force them into bypass or abort mode depending on the config. Maybe
> we can fix that later, after the devicetree property is added.
It is fare.

> 
>> Signed-off-by: Yisheng Xie 
>> ---
>>  drivers/iommu/arm-smmu-v3.c | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>> index e67ba6c..d2a3627 100644
>> --- a/drivers/iommu/arm-smmu-v3.c
>> +++ b/drivers/iommu/arm-smmu-v3.c
>> @@ -603,7 +603,8 @@ struct arm_smmu_device {
>>  #define ARM_SMMU_FEAT_TRANS_S1  (1 << 9)
>>  #define ARM_SMMU_FEAT_TRANS_S2  (1 << 10)
>>  #define ARM_SMMU_FEAT_STALLS(1 << 11)
>> -#define ARM_SMMU_FEAT_HYP   (1 << 12)
>> +#define ARM_SMMU_FEAT_STALL_FORCE   (1 << 12)
>> +#define ARM_SMMU_FEAT_HYP   (1 << 13)
> 
> We probably should keep the feature bits backward compatible and only add
> new ones at the end. It's not ABI, but it's printed at boot time and I
> sometimes use them when inspecting the kernel output to see what an SMMU
> supports.
Take it, also will fixed it in next version.

Thanks
Yisheng Xie
> 
> Thanks,
> Jean
> 
>>  u32 features;
>>  
>>  #define ARM_SMMU_OPT_SKIP_PREFETCH  (1 << 0)
>> @@ -1112,7 +1113,8 @@ static void arm_smmu_write_strtab_ent(struct 
>> arm_smmu_device *smmu, u32 sid,
>>  #endif
>>   STRTAB_STE_1_STRW_NSEL1 << STRTAB_STE_1_STRW_SHIFT);
>>  
>> -if (smmu->features & ARM_SMMU_FEAT_STALLS)
>> +if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>> +   !(smmu->features & ARM_SMMU_FEAT_STALL_FORCE))
>>  dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>>  
>>  val |= (ste->s1_cfg->cdptr_dma & STRTAB_STE_0_S1CTXPTR_MASK
>> @@ -2536,9 +2538,10 @@ static int arm_smmu_device_hw_probe(struct 
>> arm_smmu_device *smmu)
>>   coherent ? "true" : "false");
>>  
>>  switch (reg & IDR0_STALL_MODEL_MASK << IDR0_STALL_MODEL_SHIFT) {
>> -case IDR0_STALL_MODEL_STALL:
>> -/* Fallthrough */
>>  case IDR0_STALL_MODEL_FORCE:
>> +smmu->features |= ARM_SMMU_FEAT_STALL_FORCE;
>> +/* Fallthrough */
>> +case IDR0_STALL_MODEL_STALL:
>>  smmu->features |= ARM_SMMU_FEAT_STALLS;
>>  }
>>  
>>
> 
> 
> .
> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-20 Thread Robin Murphy
On 20/09/17 09:01, Herbert Xu wrote:
> Harsh Jain  wrote:
>>
>> While debugging DMA mapping error in chelsio crypto driver we observed that 
>> when scatter/gather list received by driver has some entry with page->offset 
>> > 4096 (PAGE_SIZE). It starts giving DMA error.  Without IOMMU it works fine.
> 
> This is not a bug.  The network stack can and will feed us such
> SG lists.
> 
>> 2) It cannot be driver's responsibilty to update received sg entries to 
>> adjust offset and page 
>> because we are not the only one who directly uses received sg list.
> 
> No the driver must deal with this.  Having said that, if we can
> improve our driver helper interface to make this easier then we
> should do that too.  What we certainly shouldn't do is to take a
> whack-a-mole approach like this patch does.

AFAICS this is entirely on intel-iommu - from a brief look it appears
that all the IOVA calculations would handle the offset correctly, but
then __domain_mapping() blindly uses sg_page() for the physical address,
so if offset is larger than a page it would end up with the DMA mapping
covering the wrong part of the buffer.

Does the diff below help?

Robin.

->8-
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index b3914fce8254..2ed43d928135 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2253,7 +2253,7 @@ static int __domain_mapping(struct dmar_domain *domain, 
unsigned long iov_pfn,
sg_res = aligned_nrpages(sg->offset, sg->length);
sg->dma_address = ((dma_addr_t)iov_pfn << 
VTD_PAGE_SHIFT) + sg->offset;
sg->dma_length = sg->length;
-   pteval = page_to_phys(sg_page(sg)) | prot;
+   pteval = (sg_phys(sg) & PAGE_MASK) | prot;
phys_pfn = pteval >> VTD_PAGE_SHIFT;
}
 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC] virtio-iommu version 0.4

2017-09-20 Thread Auger Eric
Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V2 0/1] Optimise IOVA allocations for PCI devices

2017-09-20 Thread Tomasz Nowicki
Here is my test setup where I have stareted performance measurements.

   PCIe  -   TX   -  PCIe  -
| ThunderX2  |--| Intel XL710 | ---> | Intel XL710 |--| X86 |
| (128 cpus) |  |   40GbE |  |40GbE|   -
 --

As the reference lets take v4.13 host, SMMUv3 off and 1-thread iperf
taskset to one CPU. The performance results I got:

SMMU off -> 100%
SMMU on -> 0,02%

I followed down the DMA mapping path and found out IOVA 32-bit space
full so that kernel was flushing rcaches for all CPUs in (1).
For 128 CPUs, this kills the performance. Furthermore, for my case, rcaches
contained PFNs > 32-bit mostly so the second round of IOVA allocation failed
as well. As the consequence IOVA had to be allocated outside of 32-bit (2)
from scratch since all rcaches have been flushed in (1).

if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
(1)-->  iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

if (!iova)
(2)-->  iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

My fix simply introduces parameter for alloc_iova_fast() to decide whether
rcache flush has to be done or not. All users follow mentioned scenario
so they should let flush as the last chance to avoid time costly iteration
over all CPUs.

This bring my iperf performance back to 100% with SMMU on.

My bad feelings regarding this solution is that machines with relatively
small numbers of CPUs may get DAC addresses more frequently for PCI
devices. Please let me know your thoughts.

Changelog:

v1 --> v2
- add missing documentation
- fix typo

Tomasz Nowicki (1):
  iommu/iova: Make rcache flush optional on IOVA allocation failure

 drivers/iommu/amd_iommu.c   |  5 +++--
 drivers/iommu/dma-iommu.c   |  6 --
 drivers/iommu/intel-iommu.c |  5 +++--
 drivers/iommu/iova.c| 11 ++-
 include/linux/iova.h|  5 +++--
 5 files changed, 19 insertions(+), 13 deletions(-)

-- 
2.7.4

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH V2 1/1] iommu/iova: Make rcache flush optional on IOVA allocation failure

2017-09-20 Thread Tomasz Nowicki
Since IOVA allocation failure is not unusual case we need to flush
CPUs' rcache in hope we will succeed in next round.

However, it is useful to decide whether we need rcache flush step because
of two reasons:
- Not scalability. On large system with ~100 CPUs iterating and flushing
  rcache for each CPU becomes serious bottleneck so we may want to defer it.
- free_cpu_cached_iovas() does not care about max PFN we are interested in.
  Thus we may flush our rcaches and still get no new IOVA like in the
  commonly used scenario:

if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift);

if (!iova)
iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);

   1. First alloc_iova_fast() call is limited to DMA_BIT_MASK(32) to get
  PCI devices a SAC address
   2. alloc_iova() fails due to full 32-bit space
   3. rcaches contain PFNs out of 32-bit space so free_cpu_cached_iovas()
  throws entries away for nothing and alloc_iova() fails again
   4. Next alloc_iova_fast() call cannot take advantage of rcache since we
  have just defeated caches. In this case we pick the slowest option
  to proceed.

This patch reworks flushed_rcache local flag to be additional function
argument instead and control rcache flush step. Also, it updates all users
to do the flush as the last chance.

Signed-off-by: Tomasz Nowicki 
Reviewed-by: Robin Murphy 
Tested-by: Nate Watterson 
---
 drivers/iommu/amd_iommu.c   |  5 +++--
 drivers/iommu/dma-iommu.c   |  6 --
 drivers/iommu/intel-iommu.c |  5 +++--
 drivers/iommu/iova.c| 11 ++-
 include/linux/iova.h|  5 +++--
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 8d2ec60..ce68986 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1604,10 +1604,11 @@ static unsigned long dma_ops_alloc_iova(struct device 
*dev,
 
if (dma_mask > DMA_BIT_MASK(32))
pfn = alloc_iova_fast(&dma_dom->iovad, pages,
- IOVA_PFN(DMA_BIT_MASK(32)));
+ IOVA_PFN(DMA_BIT_MASK(32)), false);
 
if (!pfn)
-   pfn = alloc_iova_fast(&dma_dom->iovad, pages, 
IOVA_PFN(dma_mask));
+   pfn = alloc_iova_fast(&dma_dom->iovad, pages,
+ IOVA_PFN(dma_mask), true);
 
return (pfn << PAGE_SHIFT);
 }
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 191be9c..25914d3 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -370,10 +370,12 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,
 
/* Try to get PCI devices a SAC address */
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
-   iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> 
shift);
+   iova = alloc_iova_fast(iovad, iova_len,
+  DMA_BIT_MASK(32) >> shift, false);
 
if (!iova)
-   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift);
+   iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift,
+  true);
 
return (dma_addr_t)iova << shift;
 }
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 05c0c3a..75c8320 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3460,11 +3460,12 @@ static unsigned long intel_alloc_iova(struct device 
*dev,
 * from higher range
 */
iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
-  IOVA_PFN(DMA_BIT_MASK(32)));
+  IOVA_PFN(DMA_BIT_MASK(32)), false);
if (iova_pfn)
return iova_pfn;
}
-   iova_pfn = alloc_iova_fast(&domain->iovad, nrpages, IOVA_PFN(dma_mask));
+   iova_pfn = alloc_iova_fast(&domain->iovad, nrpages,
+  IOVA_PFN(dma_mask), true);
if (unlikely(!iova_pfn)) {
pr_err("Allocating %ld-page iova for %s failed",
   nrpages, dev_name(dev));
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index f88acad..e8c140c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -353,14 +353,15 @@ EXPORT_SYMBOL_GPL(free_iova);
  * @iovad: - iova domain in question
  * @size: - size of page frames to allocate
  * @limit_pfn: - max limit address
+ * @flush_rcache: - set to flush rcache on regular allocation failure
  * This function tries to satisfy an iova allocation from the rcache,
- * and falls back to regular allocation on failure.
+ * and falls back to regular allocation on failure. If regular allocation
+ * fails too and the flush_rcache flag is set then the rcache will be flushed.
 */
 unsigned long

Re: DMA error when sg->offset value is greater than PAGE_SIZE in Intel IOMMU

2017-09-20 Thread Herbert Xu
Harsh Jain  wrote:
> 
> While debugging DMA mapping error in chelsio crypto driver we observed that 
> when scatter/gather list received by driver has some entry with page->offset 
> > 4096 (PAGE_SIZE). It starts giving DMA error.  Without IOMMU it works fine.

This is not a bug.  The network stack can and will feed us such
SG lists.

> 2) It cannot be driver's responsibilty to update received sg entries to 
> adjust offset and page 
> because we are not the only one who directly uses received sg list.

No the driver must deal with this.  Having said that, if we can
improve our driver helper interface to make this easier then we
should do that too.  What we certainly shouldn't do is to take a
whack-a-mole approach like this patch does.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu