Re: [PATCH 31/37] iommu/arm-smmu-v3: Add support for PCI ATS

2018-03-14 Thread Jean-Philippe Brucker
On 08/03/18 16:17, Jonathan Cameron wrote:
>> +arm_smmu_enable_ats(master);
> It's a bit nasty not to handle the errors that this could output (other than
> the ENOSYS for when it's not available). Seems that it would be nice to at
> least add a note to the log if people are expecting it to work and it won't
> because some condition or other isn't met.

I agree it's not ideal. Last time this came up the problem was that
checking if ATS is supported requires an ugly ifdef. A proper
implementation requires more support in the PCI core, e.g. a
pci_ats_supported() function.

https://www.spinics.net/lists/kvm/msg145932.html

>> +
>>  group = iommu_group_get_for_dev(dev);
>> -if (!IS_ERR(group)) {
>> -arm_smmu_insert_master(smmu, master);
>> -iommu_group_put(group);
>> -iommu_device_link(>iommu, dev);
>> +if (IS_ERR(group)) {
>> +ret = PTR_ERR(group);
>> +goto err_disable_ats;
>>  }
>>  
>> -return PTR_ERR_OR_ZERO(group);
>> +iommu_group_put(group);
>> +arm_smmu_insert_master(smmu, master);
>> +iommu_device_link(>iommu, dev);
>> +
>> +return 0;
>> +
>> +err_disable_ats:
>> +arm_smmu_disable_ats(master);
> master is leaked here I think...
> Possibly other things as this doesn't line up with the
> remove which I'd have mostly expected it to do.

> There are some slightly fishy bits of ordering in the original code
> anyway that I'm not seeing justification for (why is
> the iommu_device_unlink later than one might expect for
> example).

Yeah, knowing the rest of the probing code, there may exist subtle legacy
reasons for not freeing the master here and the strange orderings. I try
to keep existing behaviors where possible since I barely even have the
bandwidth to fix my own code.

Thanks,
Jean
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 31/37] iommu/arm-smmu-v3: Add support for PCI ATS

2018-03-08 Thread Jonathan Cameron
On Mon, 12 Feb 2018 18:33:46 +
Jean-Philippe Brucker  wrote:

> PCIe devices can implement their own TLB, named Address Translation Cache
> (ATC). Enable Address Translation Service (ATS) for devices that support
> it and send them invalidation requests whenever we invalidate the IOTLBs.
> 
>   Range calculation
>   -
> 
> The invalidation packet itself is a bit awkward: range must be naturally
> aligned, which means that the start address is a multiple of the range
> size. In addition, the size must be a power of two number of 4k pages. We
> have a few options to enforce this constraint:
> 
> (1) Find the smallest naturally aligned region that covers the requested
> range. This is simple to compute and only takes one ATC_INV, but it
> will spill on lots of neighbouring ATC entries.
> 
> (2) Align the start address to the region size (rounded up to a power of
> two), and send a second invalidation for the next range of the same
> size. Still not great, but reduces spilling.
> 
> (3) Cover the range exactly with the smallest number of naturally aligned
> regions. This would be interesting to implement but as for (2),
> requires multiple ATC_INV.
> 
> As I suspect ATC invalidation packets will be a very scarce resource, I'll
> go with option (1) for now, and only send one big invalidation. We can
> move to (2), which is both easier to read and more gentle with the ATC,
> once we've observed on real systems that we can send multiple smaller
> Invalidation Requests for roughly the same price as a single big one.
> 
> Note that with io-pgtable, the unmap function is called for each page, so
> this doesn't matter. The problem shows up when sharing page tables with
> the MMU.
> 
>   Timeout
>   ---
> 
> ATC invalidation is allowed to take up to 90 seconds, according to the
> PCIe spec, so it is possible to hit the SMMU command queue timeout during
> normal operations.
> 
> Some SMMU implementations will raise a CERROR_ATC_INV_SYNC when a CMD_SYNC
> fails because of an ATC invalidation. Some will just abort the CMD_SYNC.
> Others might let CMD_SYNC complete and have an asynchronous IMPDEF
> mechanism to record the error. When we receive a CERROR_ATC_INV_SYNC, we
> could retry sending all ATC_INV since last successful CMD_SYNC. When a
> CMD_SYNC fails without CERROR_ATC_INV_SYNC, we could retry sending *all*
> commands since last successful CMD_SYNC.
> 
> We cannot afford to wait 90 seconds in iommu_unmap, let alone MMU
> notifiers. So we'd have to introduce a more clever system if this timeout
> becomes a problem, like keeping hold of mappings and invalidating in the
> background. Implementing safe delayed invalidations is a very complex
> problem and deserves a series of its own. We'll assess whether more work
> is needed to properly handle ATC invalidation timeouts once this code runs
> on real hardware.
> 
>   Misc
>   
> 
> I didn't put ATC and TLB invalidations in the same functions for three
> reasons:
> 
> * TLB invalidation by range is batched and committed with a single sync.
>   Batching ATC invalidation is inconvenient, endpoints limit the number of
>   inflight invalidations. We'd have to count the number of invalidations
>   queued and send a sync periodically. In addition, I suspect we always
>   need a sync between TLB and ATC invalidation for the same page.
> 
> * Doing ATC invalidation outside tlb_inv_range also allows to send less
>   requests, since TLB invalidations are done per page or block, while ATC
>   invalidations target IOVA ranges.
> 
> * TLB invalidation by context is performed when freeing the domain, at
>   which point there isn't any device attached anymore.
> 
> Signed-off-by: Jean-Philippe Brucker 
Few minor error path related comments inline..

> ---
>  drivers/iommu/arm-smmu-v3.c | 236 
> ++--
>  1 file changed, 226 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
> index 8b9f5dd06be0..76513135310f 100644
> --- a/drivers/iommu/arm-smmu-v3.c
> +++ b/drivers/iommu/arm-smmu-v3.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -109,6 +110,7 @@
>  #define IDR5_OAS_48_BIT  (5 << IDR5_OAS_SHIFT)
>  
>  #define ARM_SMMU_CR0 0x20
> +#define CR0_ATSCHK   (1 << 4)
>  #define CR0_CMDQEN   (1 << 3)
>  #define CR0_EVTQEN   (1 << 2)
>  #define CR0_PRIQEN   (1 << 1)
> @@ -304,6 +306,7 @@
>  #define CMDQ_ERR_CERROR_NONE_IDX 0
>  #define CMDQ_ERR_CERROR_ILL_IDX  1
>  #define CMDQ_ERR_CERROR_ABT_IDX  2
> +#define CMDQ_ERR_CERROR_ATC_INV_IDX  3
>  
>  #define CMDQ_0_OP_SHIFT  0
>  #define CMDQ_0_OP_MASK   0xffUL
> @@ -327,6 +330,15 @@
>  #define CMDQ_TLBI_1_VA_MASK  

[PATCH 31/37] iommu/arm-smmu-v3: Add support for PCI ATS

2018-02-12 Thread Jean-Philippe Brucker
PCIe devices can implement their own TLB, named Address Translation Cache
(ATC). Enable Address Translation Service (ATS) for devices that support
it and send them invalidation requests whenever we invalidate the IOTLBs.

  Range calculation
  -

The invalidation packet itself is a bit awkward: range must be naturally
aligned, which means that the start address is a multiple of the range
size. In addition, the size must be a power of two number of 4k pages. We
have a few options to enforce this constraint:

(1) Find the smallest naturally aligned region that covers the requested
range. This is simple to compute and only takes one ATC_INV, but it
will spill on lots of neighbouring ATC entries.

(2) Align the start address to the region size (rounded up to a power of
two), and send a second invalidation for the next range of the same
size. Still not great, but reduces spilling.

(3) Cover the range exactly with the smallest number of naturally aligned
regions. This would be interesting to implement but as for (2),
requires multiple ATC_INV.

As I suspect ATC invalidation packets will be a very scarce resource, I'll
go with option (1) for now, and only send one big invalidation. We can
move to (2), which is both easier to read and more gentle with the ATC,
once we've observed on real systems that we can send multiple smaller
Invalidation Requests for roughly the same price as a single big one.

Note that with io-pgtable, the unmap function is called for each page, so
this doesn't matter. The problem shows up when sharing page tables with
the MMU.

  Timeout
  ---

ATC invalidation is allowed to take up to 90 seconds, according to the
PCIe spec, so it is possible to hit the SMMU command queue timeout during
normal operations.

Some SMMU implementations will raise a CERROR_ATC_INV_SYNC when a CMD_SYNC
fails because of an ATC invalidation. Some will just abort the CMD_SYNC.
Others might let CMD_SYNC complete and have an asynchronous IMPDEF
mechanism to record the error. When we receive a CERROR_ATC_INV_SYNC, we
could retry sending all ATC_INV since last successful CMD_SYNC. When a
CMD_SYNC fails without CERROR_ATC_INV_SYNC, we could retry sending *all*
commands since last successful CMD_SYNC.

We cannot afford to wait 90 seconds in iommu_unmap, let alone MMU
notifiers. So we'd have to introduce a more clever system if this timeout
becomes a problem, like keeping hold of mappings and invalidating in the
background. Implementing safe delayed invalidations is a very complex
problem and deserves a series of its own. We'll assess whether more work
is needed to properly handle ATC invalidation timeouts once this code runs
on real hardware.

  Misc
  

I didn't put ATC and TLB invalidations in the same functions for three
reasons:

* TLB invalidation by range is batched and committed with a single sync.
  Batching ATC invalidation is inconvenient, endpoints limit the number of
  inflight invalidations. We'd have to count the number of invalidations
  queued and send a sync periodically. In addition, I suspect we always
  need a sync between TLB and ATC invalidation for the same page.

* Doing ATC invalidation outside tlb_inv_range also allows to send less
  requests, since TLB invalidations are done per page or block, while ATC
  invalidations target IOVA ranges.

* TLB invalidation by context is performed when freeing the domain, at
  which point there isn't any device attached anymore.

Signed-off-by: Jean-Philippe Brucker 
---
 drivers/iommu/arm-smmu-v3.c | 236 ++--
 1 file changed, 226 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 8b9f5dd06be0..76513135310f 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -109,6 +110,7 @@
 #define IDR5_OAS_48_BIT(5 << IDR5_OAS_SHIFT)
 
 #define ARM_SMMU_CR0   0x20
+#define CR0_ATSCHK (1 << 4)
 #define CR0_CMDQEN (1 << 3)
 #define CR0_EVTQEN (1 << 2)
 #define CR0_PRIQEN (1 << 1)
@@ -304,6 +306,7 @@
 #define CMDQ_ERR_CERROR_NONE_IDX   0
 #define CMDQ_ERR_CERROR_ILL_IDX1
 #define CMDQ_ERR_CERROR_ABT_IDX2
+#define CMDQ_ERR_CERROR_ATC_INV_IDX3
 
 #define CMDQ_0_OP_SHIFT0
 #define CMDQ_0_OP_MASK 0xffUL
@@ -327,6 +330,15 @@
 #define CMDQ_TLBI_1_VA_MASK~0xfffUL
 #define CMDQ_TLBI_1_IPA_MASK   0xf000UL
 
+#define CMDQ_ATC_0_SSID_SHIFT  12
+#define CMDQ_ATC_0_SSID_MASK   0xfUL
+#define CMDQ_ATC_0_SID_SHIFT   32
+#define CMDQ_ATC_0_SID_MASK0xUL
+#define CMDQ_ATC_0_GLOBAL  (1UL << 9)
+#define CMDQ_ATC_1_SIZE_SHIFT