[PATCH v2] iommu/amd: Recover from event log overflow

2021-10-04 Thread Lennert Buytenhek
The AMD IOMMU logs I/O page faults and such to a ring buffer in
system memory, and this ring buffer can overflow.  The AMD IOMMU
spec has the following to say about the interrupt status bit that
signals this overflow condition:

EventOverflow: Event log overflow. RW1C. Reset 0b. 1 = IOMMU
event log overflow has occurred. This bit is set when a new
event is to be written to the event log and there is no usable
entry in the event log, causing the new event information to
be discarded. An interrupt is generated when EventOverflow = 1b
and MMIO Offset 0018h[EventIntEn] = 1b. No new event log
entries are written while this bit is set. Software Note: To
resume logging, clear EventOverflow (W1C), and write a 1 to
MMIO Offset 0018h[EventLogEn].

The AMD IOMMU driver doesn't currently implement this recovery
sequence, meaning that if a ring buffer overflow occurs, logging
of EVT/PPR/GA events will cease entirely.

This patch implements the spec-mandated reset sequence, with the
minor tweak that the hardware seems to want to have a 0 written to
MMIO Offset 0018h[EventLogEn] first, before writing an 1 into this
field, or the IOMMU won't actually resume logging events.

Signed-off-by: Lennert Buytenhek 
---
Tested on v5.15-rc4 on a Ryzen 3700X.

Changes for v2:
- Rate limit event log overflow messages.

 drivers/iommu/amd/amd_iommu.h   |  1 +
 drivers/iommu/amd/amd_iommu_types.h |  1 +
 drivers/iommu/amd/init.c| 10 ++
 drivers/iommu/amd/iommu.c   | 10 --
 4 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h
index 416815a525d6..bb95edf74415 100644
--- a/drivers/iommu/amd/amd_iommu.h
+++ b/drivers/iommu/amd/amd_iommu.h
@@ -14,6 +14,7 @@
 extern irqreturn_t amd_iommu_int_thread(int irq, void *data);
 extern irqreturn_t amd_iommu_int_handler(int irq, void *data);
 extern void amd_iommu_apply_erratum_63(u16 devid);
+extern void amd_iommu_restart_event_logging(struct amd_iommu *iommu);
 extern void amd_iommu_reset_cmd_buffer(struct amd_iommu *iommu);
 extern int amd_iommu_init_devices(void);
 extern void amd_iommu_uninit_devices(void);
diff --git a/drivers/iommu/amd/amd_iommu_types.h 
b/drivers/iommu/amd/amd_iommu_types.h
index 8dbe61e2b3c1..095076029f8d 100644
--- a/drivers/iommu/amd/amd_iommu_types.h
+++ b/drivers/iommu/amd/amd_iommu_types.h
@@ -110,6 +110,7 @@
 #define PASID_MASK 0x
 
 /* MMIO status bits */
+#define MMIO_STATUS_EVT_OVERFLOW_INT_MASK  (1 << 0)
 #define MMIO_STATUS_EVT_INT_MASK   (1 << 1)
 #define MMIO_STATUS_COM_WAIT_INT_MASK  (1 << 2)
 #define MMIO_STATUS_PPR_INT_MASK   (1 << 6)
diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 2a822b229bd0..f1c5eb023bda 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -654,6 +654,16 @@ static int __init alloc_command_buffer(struct amd_iommu 
*iommu)
return iommu->cmd_buf ? 0 : -ENOMEM;
 }
 
+/*
+ * This function restarts event logging in case the IOMMU experienced
+ * an event log buffer overflow.
+ */
+void amd_iommu_restart_event_logging(struct amd_iommu *iommu)
+{
+   iommu_feature_disable(iommu, CONTROL_EVT_LOG_EN);
+   iommu_feature_enable(iommu, CONTROL_EVT_LOG_EN);
+}
+
 /*
  * This function resets the command buffer if the IOMMU stopped fetching
  * commands from it.
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 1722bb161841..f46eb7397021 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -742,7 +742,8 @@ amd_iommu_set_pci_msi_domain(struct device *dev, struct 
amd_iommu *iommu) { }
 #endif /* !CONFIG_IRQ_REMAP */
 
 #define AMD_IOMMU_INT_MASK \
-   (MMIO_STATUS_EVT_INT_MASK | \
+   (MMIO_STATUS_EVT_OVERFLOW_INT_MASK | \
+MMIO_STATUS_EVT_INT_MASK | \
 MMIO_STATUS_PPR_INT_MASK | \
 MMIO_STATUS_GALOG_INT_MASK)
 
@@ -752,7 +753,7 @@ irqreturn_t amd_iommu_int_thread(int irq, void *data)
u32 status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET);
 
while (status & AMD_IOMMU_INT_MASK) {
-   /* Enable EVT and PPR and GA interrupts again */
+   /* Enable interrupt sources again */
writel(AMD_IOMMU_INT_MASK,
iommu->mmio_base + MMIO_STATUS_OFFSET);
 
@@ -773,6 +774,11 @@ irqreturn_t amd_iommu_int_thread(int irq, void *data)
}
 #endif
 
+   if (status & MMIO_STATUS_EVT_OVERFLOW_INT_MASK) {
+   pr_info_ratelimited("IOMMU event log overflow\n");
+   amd_iommu_restart_event_logging(iommu);
+   }
+
/*
 * Hardware bug: ERBT1312
 * When re-enabling interrupt (by writing 1
-- 
2.31.1
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mail

Re: [PATCH 1/5] iova: Move fast alloc size roundup into alloc_iova_fast()

2021-10-04 Thread Will Deacon
On Fri, Sep 24, 2021 at 06:01:53PM +0800, John Garry wrote:
> It really is a property of the IOVA rcache code that we need to alloc a
> power-of-2 size, so relocate the functionality to resize into
> alloc_iova_fast(), rather than the callsites.
> 
> Signed-off-by: John Garry 
> ---
>  drivers/iommu/dma-iommu.c| 8 
>  drivers/iommu/iova.c | 9 +
>  drivers/vdpa/vdpa_user/iova_domain.c | 8 
>  3 files changed, 9 insertions(+), 16 deletions(-)

Acked-by: Will Deacon 

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 5/5] iommu/iova: Avoid double-negatives in magazine helpers

2021-10-04 Thread Will Deacon
On Fri, Sep 24, 2021 at 06:01:57PM +0800, John Garry wrote:
> A similar crash to the following could be observed if initial CPU rcache
> magazine allocations fail in init_iova_rcaches():
> 
> Unable to handle kernel NULL pointer dereference at virtual address 
> 
> Mem abort info:
> 
>   free_iova_fast+0xfc/0x280
>   iommu_dma_free_iova+0x64/0x70
>   __iommu_dma_unmap+0x9c/0xf8
>   iommu_dma_unmap_sg+0xa8/0xc8
>   dma_unmap_sg_attrs+0x28/0x50
>   cq_thread_v3_hw+0x2dc/0x528
>   irq_thread_fn+0x2c/0xa0
>   irq_thread+0x130/0x1e0
>   kthread+0x154/0x158
>   ret_from_fork+0x10/0x34
> 
> The issue is that expression !iova_magazine_full(NULL) evaluates true; this
> falls over in __iova_rcache_insert() when we attempt to cache a mag and
> cpu_rcache->loaded == NULL:
> 
> if (!iova_magazine_full(cpu_rcache->loaded)) {
>   can_insert = true;
> ...
> 
> if (can_insert)
>   iova_magazine_push(cpu_rcache->loaded, iova_pfn);
> 
> As above, can_insert is evaluated true, which it shouldn't be, and we try
> to insert pfns in a NULL mag, which is not safe.
> 
> To avoid this, stop using double-negatives, like !iova_magazine_full() and
> !iova_magazine_empty(), and use positive tests, like
> iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
> can safely deal with cpu_rcache->{loaded, prev} = NULL.

I don't understand why you're saying that things like !iova_magazine_empty()
are double-negatives. What about e.g. !list_empty() elsewhre in the kernel?

The crux of the fix seems to be:

> @@ -783,8 +787,9 @@ static bool __iova_rcache_insert(struct 
> iova_caching_domain *rcached,
>   if (new_mag) {
>   spin_lock(&rcache->lock);
>   if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> - rcache->depot[rcache->depot_size++] =
> - cpu_rcache->loaded;
> + if (cpu_rcache->loaded)
> + rcache->depot[rcache->depot_size++] =
> + cpu_rcache->loaded;

Which could be independent of the renaming?

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu: Some IOVA code reorganisation

2021-10-04 Thread Will Deacon
On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote:
> The IOVA domain structure is a bit overloaded, holding:
> - IOVA tree management
> - FQ control
> - IOVA rcache memories
> 
> Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c
> uses the FQ feature.
> 
> This series separates out that structure. In addition, it moves the FQ
> code into dma-iommu.c . This is not strictly necessary, but it does make
> it easier for the FQ domain lookup the rcache domain.
> 
> The rcache code stays where it is, as it may be reworked in future, so
> there is not much point in relocating and then discarding.
> 
> This topic was initially discussed and suggested (I think) by Robin here:
> https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/

It would be useful to have Robin's Ack on patches 2-4. The implementation
looks straightforward to me, but the thread above isn't very clear about
what is being suggested.

To play devil's advocate: there aren't many direct users of the iovad code:
either they'll die out entirely (and everybody will use the dma-iommu code)
and it's fine having the flush queue code where it is, or we'll get more
users and the likelihood of somebody else wanting flush queues increases.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] dt-bindings: arm-smmu: Add compatible for SM6350 SoC

2021-10-04 Thread Will Deacon
On Fri, 20 Aug 2021 22:29:04 +0200, Konrad Dybcio wrote:
> Add the SoC specific compatible for SM6350 implementing
> arm,mmu-500.
> 
> 

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/2] dt-bindings: arm-smmu: Add compatible for SM6350 SoC
  https://git.kernel.org/will/c/e4a40f15b031
[2/2] iommu/arm-smmu-qcom: Add SM6350 SMMU compatible
  https://git.kernel.org/will/c/bc53c8b8b087

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290

2021-10-04 Thread Will Deacon
On Fri, 1 Oct 2021 16:00:31 +0200, Loic Poulain wrote:
> 


Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/2] iommu: arm-smmu-qcom: Add compatible for qcm2290
  https://git.kernel.org/will/c/756a622c8f06
[2/2] dt-bindings: arm-smmu: Add qcm2290 compatible strings
  https://git.kernel.org/will/c/f1edce3db543

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/2] iommu/arm-smmu-v3: Perform some simple optimizations for arm_smmu_cmdq_build_cmd()

2021-10-04 Thread Will Deacon
On Wed, 18 Aug 2021 16:04:50 +0800, Zhen Lei wrote:
> v1 --> v2:
> 1. Add patch 1, Properly handle the return value of arm_smmu_cmdq_build_cmd()
> 2. Remove arm_smmu_cmdq_copy_cmd(). In addition, when build command fails, 
> out_cmd is not filled.
> 
> 
> Zhen Lei (2):
>   iommu/arm-smmu-v3: Properly handle the return value of
> arm_smmu_cmdq_build_cmd()
>   iommu/arm-smmu-v3: Simplify useless instructions in
> arm_smmu_cmdq_build_cmd()
> 
> [...]

Applied first patch only to will (for-joerg/arm-smmu/updates), thanks!

[1/2] iommu/arm-smmu-v3: Properly handle the return value of 
arm_smmu_cmdq_build_cmd()
  https://git.kernel.org/will/c/59d9bd727495

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/arm-smmu-v3: Stop pre-zeroing batch commands in arm_smmu_atc_inv_master()

2021-10-04 Thread Will Deacon
On Tue, 17 Aug 2021 19:34:11 +0800, Zhen Lei wrote:
> Pre-zeroing the batched commands structure is inefficient, as individual
> commands are zeroed later in arm_smmu_cmdq_build_cmd(). Therefore, only
> the member 'num' needs to be initialized to 0.
> 
> 

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] iommu/arm-smmu-v3: Stop pre-zeroing batch commands in 
arm_smmu_atc_inv_master()
  https://git.kernel.org/will/c/93f9f7958f12

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 2/2] iommu/arm-smmu-v3: Simplify useless instructions in arm_smmu_cmdq_build_cmd()

2021-10-04 Thread Will Deacon
On Wed, Aug 18, 2021 at 04:04:52PM +0800, Zhen Lei wrote:
> Although the parameter 'cmd' is always passed by a local array variable,
> and only this function modifies it, the compiler does not know this. The
> compiler almost always reads the value of cmd[i] from memory rather than
> directly using the value cached in the register. This generates many
> useless instruction operations and affects the performance to some extent.
> 
> To guide the compiler for proper optimization, 'cmd' is defined as a local
> array variable, marked as register, and copied to the output parameter at
> a time when the function is returned.
> 
> The optimization effect can be viewed by running the "size arm-smmu-v3.o"
> command.
> 
> Before:
>textdata bss dec hex
>   269541348  56   283586ec6
> 
> After:
>textdata bss dec hex
>   267621348  56   281666e06
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 01e95b56ffa07d1..7cec0c967f71d86 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -234,10 +234,12 @@ static int queue_remove_raw(struct arm_smmu_queue *q, 
> u64 *ent)
>  }
>  
>  /* High-level queue accessors */
> -static int arm_smmu_cmdq_build_cmd(u64 *cmd, struct arm_smmu_cmdq_ent *ent)
> +static int arm_smmu_cmdq_build_cmd(u64 *out_cmd, struct arm_smmu_cmdq_ent 
> *ent)
>  {
> - memset(cmd, 0, 1 << CMDQ_ENT_SZ_SHIFT);
> - cmd[0] |= FIELD_PREP(CMDQ_0_OP, ent->opcode);
> + register u64 cmd[CMDQ_ENT_DWORDS];

'register' is pretty badly specified outside of a handful of limited uses in
conjunction with inline asm, so I really don't think we should be using it
here.

> + cmd[0] = FIELD_PREP(CMDQ_0_OP, ent->opcode);

This generates a compiler warning about taking the address of a 'register'
variable.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-04 Thread Jason Gunthorpe
On Mon, Oct 04, 2021 at 08:58:35AM +0200, Christian König wrote:
> I'm not following this discussion to closely, but try to look into it from
> time to time.
> 
> Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:
> > On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:
> > 
> > > In device-dax, the refcount is only used to prevent the device, and
> > > therefore the pages, from going away on device unbind. Pages cannot be
> > > recycled, as you say, as they are mapped linearly within the device. The
> > > address space invalidation is done only when the device is unbound.
> > By address space invalidation I mean invalidation of the VMA that is
> > pointing to those pages.
> > 
> > device-dax may not have a issue with use-after-VMA-invalidation by
> > it's very nature since every PFN always points to the same
> > thing. fsdax and this p2p stuff are different though.
> > 
> > > Before the invalidation, an active flag is cleared to ensure no new
> > > mappings can be created while the unmap is proceeding.
> > > unmap_mapping_range() should sequence itself with the TLB flush and
> > AFIAK unmap_mapping_range() kicks off the TLB flush and then
> > returns. It doesn't always wait for the flush to fully finish. Ie some
> > cases use RCU to lock the page table against GUP fast and so the
> > put_page() doesn't happen until the call_rcu completes - after a grace
> > period. The unmap_mapping_range() does not wait for grace periods.
> 
> Wow, wait a second. That is quite a boomer. At least in all GEM/TTM based
> graphics drivers that could potentially cause a lot of trouble.
> 
> I've just double checked and we certainly have the assumption that when
> unmap_mapping_range() returns the pte is gone and the TLB flush completed in
> quite a number of places.
> 
> Do you have more information when and why that can happen?

There are two things to keep in mind, flushing the PTEs from the HW
and serializing against gup_fast.

If you start at unmap_mapping_range() the page is eventually
discovered in zap_pte_range() and the PTE cleared. It is then passed
into __tlb_remove_page() which puts it on the batch->pages list

The page free happens in tlb_batch_pages_flush() via
free_pages_and_swap_cache()

The tlb_batch_pages_flush() happens via zap_page_range() ->
tlb_finish_mmu(), presumably after the HW has wiped the TLB's on all
CPUs. On x86 this is done with an IPI and also serializes gup fast, so
OK

The interesting case is CONFIG_MMU_GATHER_RCU_TABLE_FREE which doesn't
rely on IPIs anymore to synchronize with gup-fast.

In this configuration it means when unmap_mapping_range() returns the
TLB will have been flushed, but no serialization with GUP fast was
done.

This is OK if the GUP fast cannot return the page at all. I assume
this generally describes the DRM caes?

However, if the GUP fast can return the page then something,
somewhere, needs to serialize the page free with the RCU as the GUP
fast can be observing the old PTE before it was zap'd until the RCU
grace expires.

Relying on the page ref being !0 to protect GUP fast is not safe
because the page ref can be incr'd immediately upon page re-use.

Interestingly I looked around for this on PPC and I only found RCU
delayed freeing of the page table level, not RCU delayed freeing of
pages themselves.. I wonder if it was missed? 

There is a path on PPC (tlb_remove_table_sync_one) that triggers an
IPI but it looks like an exception, and we wouldn't need the RCU at
all if we used IPI to serialize GUP fast...

It makes logical sense if the RCU also frees the pages on
CONFIG_MMU_GATHER_RCU_TABLE_FREE so anything returnable by GUP fast
must be refcounted and freed by tlb_batch_pages_flush(), not by the
caller of unmap_mapping_range().

If we expect to allow the caller of unmap_mapping_range() to free then
CONFIG_MMU_GATHER_RCU_TABLE_FREE can't really exist, we always need to
trigger a serializing IPI during tlb_batch_pages_flush()

AFAICT, at least

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-04 Thread Christian König via iommu

Am 04.10.21 um 15:11 schrieb Jason Gunthorpe:

On Mon, Oct 04, 2021 at 08:58:35AM +0200, Christian König wrote:

I'm not following this discussion to closely, but try to look into it from
time to time.

Am 01.10.21 um 19:45 schrieb Jason Gunthorpe:

On Fri, Oct 01, 2021 at 11:01:49AM -0600, Logan Gunthorpe wrote:


In device-dax, the refcount is only used to prevent the device, and
therefore the pages, from going away on device unbind. Pages cannot be
recycled, as you say, as they are mapped linearly within the device. The
address space invalidation is done only when the device is unbound.

By address space invalidation I mean invalidation of the VMA that is
pointing to those pages.

device-dax may not have a issue with use-after-VMA-invalidation by
it's very nature since every PFN always points to the same
thing. fsdax and this p2p stuff are different though.


Before the invalidation, an active flag is cleared to ensure no new
mappings can be created while the unmap is proceeding.
unmap_mapping_range() should sequence itself with the TLB flush and

AFIAK unmap_mapping_range() kicks off the TLB flush and then
returns. It doesn't always wait for the flush to fully finish. Ie some
cases use RCU to lock the page table against GUP fast and so the
put_page() doesn't happen until the call_rcu completes - after a grace
period. The unmap_mapping_range() does not wait for grace periods.

Wow, wait a second. That is quite a boomer. At least in all GEM/TTM based
graphics drivers that could potentially cause a lot of trouble.

I've just double checked and we certainly have the assumption that when
unmap_mapping_range() returns the pte is gone and the TLB flush completed in
quite a number of places.

Do you have more information when and why that can happen?

There are two things to keep in mind, flushing the PTEs from the HW
and serializing against gup_fast.

If you start at unmap_mapping_range() the page is eventually
discovered in zap_pte_range() and the PTE cleared. It is then passed
into __tlb_remove_page() which puts it on the batch->pages list

The page free happens in tlb_batch_pages_flush() via
free_pages_and_swap_cache()

The tlb_batch_pages_flush() happens via zap_page_range() ->
tlb_finish_mmu(), presumably after the HW has wiped the TLB's on all
CPUs. On x86 this is done with an IPI and also serializes gup fast, so
OK

The interesting case is CONFIG_MMU_GATHER_RCU_TABLE_FREE which doesn't
rely on IPIs anymore to synchronize with gup-fast.

In this configuration it means when unmap_mapping_range() returns the
TLB will have been flushed, but no serialization with GUP fast was
done.

This is OK if the GUP fast cannot return the page at all. I assume
this generally describes the DRM caes?


Yes, exactly that. GUP is completely forbidden for such mappings.

But what about accesses by other CPUs? In other words our use case is 
like the following:


1. We found that we need exclusive access to the higher level object a 
page belongs to.


2. The lock of the higher level object is taken. The lock is also taken 
in the fault handler for the VMA which inserts the PTE in the first place.


3. unmap_mapping_range() for the range of the object is called, the 
expectation is that when that function returns only the kernel can have 
a mapping of the pages backing the object.


4. The kernel has exclusive access to the pages and we know that 
userspace can't mess with them any more.


That use case is completely unrelated to GUP and when this doesn't work 
we have quite a problem.


I should probably note that we recently switched from VM_MIXEDMAP to 
using VM_PFNMAP because the former didn't prevented GUP on all 
architectures.


Christian.


However, if the GUP fast can return the page then something,
somewhere, needs to serialize the page free with the RCU as the GUP
fast can be observing the old PTE before it was zap'd until the RCU
grace expires.

Relying on the page ref being !0 to protect GUP fast is not safe
because the page ref can be incr'd immediately upon page re-use.

Interestingly I looked around for this on PPC and I only found RCU
delayed freeing of the page table level, not RCU delayed freeing of
pages themselves.. I wonder if it was missed?

There is a path on PPC (tlb_remove_table_sync_one) that triggers an
IPI but it looks like an exception, and we wouldn't need the RCU at
all if we used IPI to serialize GUP fast...

It makes logical sense if the RCU also frees the pages on
CONFIG_MMU_GATHER_RCU_TABLE_FREE so anything returnable by GUP fast
must be refcounted and freed by tlb_batch_pages_flush(), not by the
caller of unmap_mapping_range().

If we expect to allow the caller of unmap_mapping_range() to free then
CONFIG_MMU_GATHER_RCU_TABLE_FREE can't really exist, we always need to
trigger a serializing IPI during tlb_batch_pages_flush()

AFAICT, at least

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailm

Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-04 Thread Jason Gunthorpe
On Mon, Oct 04, 2021 at 03:22:22PM +0200, Christian König wrote:

> That use case is completely unrelated to GUP and when this doesn't work we
> have quite a problem.

My read is that unmap_mapping_range() guarentees the physical TLB
hardware is serialized across all CPUs upon return.

It also guarentees GUP slow is serialized due to the page table
spinlocks.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 5/5] iommu/iova: Avoid double-negatives in magazine helpers

2021-10-04 Thread John Garry

On 04/10/2021 12:38, Will Deacon wrote:

Hi Will,


To avoid this, stop using double-negatives, like !iova_magazine_full() and
!iova_magazine_empty(), and use positive tests, like
iova_magazine_has_space() and iova_magazine_has_pfns(), respectively; these
can safely deal with cpu_rcache->{loaded, prev} = NULL.

I don't understand why you're saying that things like !iova_magazine_empty()
are double-negatives. What about e.g. !list_empty() elsewhre in the kernel?


IMO, a check for an empty magazine is a negative check, as opposed to a 
check for availability.


But I'm not saying that patterns like !list_empty() are a bad practice. 
I'm just saying that they are not NULL safe, and that matters in this 
case as we can potentially pass a NULL pointer.




The crux of the fix seems to be:


@@ -783,8 +787,9 @@ static bool __iova_rcache_insert(struct iova_caching_domain 
*rcached,
if (new_mag) {
spin_lock(&rcache->lock);
if (rcache->depot_size < MAX_GLOBAL_MAGS) {
-   rcache->depot[rcache->depot_size++] =
-   cpu_rcache->loaded;
+   if (cpu_rcache->loaded)
+   rcache->depot[rcache->depot_size++] =
+   cpu_rcache->loaded;

Which could be independent of the renaming?


If cpu_rcache->loaded was NULL, then we crash before we reach this code.

Anyway, since I earlier reworked init_iova_rcaches() to properly handle 
failed mem allocations for rcache->cpu_rcaches, I can rework further to 
fail the init for failed mem allocations for cpu_rcaches->loaded, so we 
don't need this change.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu: Some IOVA code reorganisation

2021-10-04 Thread John Garry

On 04/10/2021 12:44, Will Deacon wrote:




On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote:

The IOVA domain structure is a bit overloaded, holding:
- IOVA tree management
- FQ control
- IOVA rcache memories

Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c
uses the FQ feature.

This series separates out that structure. In addition, it moves the FQ
code into dma-iommu.c . This is not strictly necessary, but it does make
it easier for the FQ domain lookup the rcache domain.

The rcache code stays where it is, as it may be reworked in future, so
there is not much point in relocating and then discarding.

This topic was initially discussed and suggested (I think) by Robin here:
https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/




Hi Will,


It would be useful to have Robin's Ack on patches 2-4. The implementation
looks straightforward to me, but the thread above isn't very clear about
what is being suggested.


Sure, I intentionally didn't add names to patches so avoid possible 
incorrect attribution.




To play devil's advocate: there aren't many direct users of the iovad code:
either they'll die out entirely (and everybody will use the dma-iommu code)
and it's fine having the flush queue code where it is, or we'll get more
users and the likelihood of somebody else wanting flush queues increases.



I make it 5x direct users (including vdpa).

Anyway, as I mentioned, I'm not totally determined to relocate the FQ 
code. It's just that dma-iommu is the only user today and co-locating 
makes the iova rcache domain info lookup easier from the FQ code.


Thanks,
John
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/5] iommu: Some IOVA code reorganisation

2021-10-04 Thread Robin Murphy

On 2021-10-04 12:44, Will Deacon wrote:

On Fri, Sep 24, 2021 at 06:01:52PM +0800, John Garry wrote:

The IOVA domain structure is a bit overloaded, holding:
- IOVA tree management
- FQ control
- IOVA rcache memories

Indeed only a couple of IOVA users use the rcache, and only dma-iommu.c
uses the FQ feature.

This series separates out that structure. In addition, it moves the FQ
code into dma-iommu.c . This is not strictly necessary, but it does make
it easier for the FQ domain lookup the rcache domain.

The rcache code stays where it is, as it may be reworked in future, so
there is not much point in relocating and then discarding.

This topic was initially discussed and suggested (I think) by Robin here:
https://lore.kernel.org/linux-iommu/1d06eda1-9961-d023-f5e7-fe87e768f...@arm.com/


It would be useful to have Robin's Ack on patches 2-4. The implementation
looks straightforward to me, but the thread above isn't very clear about
what is being suggested.


FWIW I actually got about half-way through writing my own equivalent of 
patches 2-3, except tackling it from the other direction - simplifying 
the FQ code *before* moving whatever was left to iommu-dma, then I got 
side-tracked trying to make io-pgtable use that freelist properly, and 
then I've been on holiday the last 2 weeks. I've got other things to 
catch up on first but I'll try to get to this later this week.



To play devil's advocate: there aren't many direct users of the iovad code:
either they'll die out entirely (and everybody will use the dma-iommu code)
and it's fine having the flush queue code where it is, or we'll get more
users and the likelihood of somebody else wanting flush queues increases.


I think the FQ code is mostly just here as a historical artefact, since 
the IOVA allocator was the only thing common to the Intel and AMD DMA 
ops when the common FQ implementation was factored out of those, so 
although it's essentially orthogonal it was still related enough that it 
was an easy place to stick it.


Cheers,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 19/20] PCI/P2PDMA: introduce pci_mmap_p2pmem()

2021-10-04 Thread Christian König via iommu

Am 04.10.21 um 15:27 schrieb Jason Gunthorpe:

On Mon, Oct 04, 2021 at 03:22:22PM +0200, Christian König wrote:


That use case is completely unrelated to GUP and when this doesn't work we
have quite a problem.

My read is that unmap_mapping_range() guarentees the physical TLB
hardware is serialized across all CPUs upon return.


Thanks, that's what I wanted to make sure.

Christian.



It also guarentees GUP slow is serialized due to the page table
spinlocks.

Jason


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-04 Thread Jacob Pan
Hi Barry,

On Sat, 2 Oct 2021 01:45:59 +1300, Barry Song <21cn...@gmail.com> wrote:

> >  
> > > I assume KVA mode can avoid this iotlb flush as the device is using
> > > the page table of the kernel and sharing the whole kernel space. But
> > > will users be glad to accept this mode?  
> >
> > You can avoid the lock be identity mapping the physical address space
> > of the kernel and maping map/unmap a NOP.
> >
> > KVA is just a different way to achive this identity map with slightly
> > different security properties than the normal way, but it doesn't
> > reach to the same security level as proper map/unmap.
> >
> > I'm not sure anyone who cares about DMA security would see value in
> > the slight difference between KVA and a normal identity map.  
> 
> yes. This is an important question. if users want a high security level,
> kva might not their choice; if users don't want the security, they are
> using iommu passthrough. So when will users choose KVA?
Right, KVAs sit in the middle in terms of performance and security.
Performance is better than IOVA due to IOTLB flush as you mentioned. Also
not too far behind of pass-through.

Security-wise, KVA respects kernel mapping. So permissions are better
enforced than pass-through and identity mapping.
To balance performance and security, we are proposing KVA is only supported
on trusted devices. On an Intel platform, it would be based on ACPI SATC
(SoC Integrated Address Translation Cache (SATC) reporting structure, VT-d
spec. 8.2). I am also adding a kernel iommu parameter to allow user
override.


Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC 0/7] Support in-kernel DMA with PASID and SVA

2021-10-04 Thread Jason Gunthorpe via iommu
On Mon, Oct 04, 2021 at 09:40:03AM -0700, Jacob Pan wrote:
> Hi Barry,
> 
> On Sat, 2 Oct 2021 01:45:59 +1300, Barry Song <21cn...@gmail.com> wrote:
> 
> > >  
> > > > I assume KVA mode can avoid this iotlb flush as the device is using
> > > > the page table of the kernel and sharing the whole kernel space. But
> > > > will users be glad to accept this mode?  
> > >
> > > You can avoid the lock be identity mapping the physical address space
> > > of the kernel and maping map/unmap a NOP.
> > >
> > > KVA is just a different way to achive this identity map with slightly
> > > different security properties than the normal way, but it doesn't
> > > reach to the same security level as proper map/unmap.
> > >
> > > I'm not sure anyone who cares about DMA security would see value in
> > > the slight difference between KVA and a normal identity map.  
> > 
> > yes. This is an important question. if users want a high security level,
> > kva might not their choice; if users don't want the security, they are
> > using iommu passthrough. So when will users choose KVA?
> Right, KVAs sit in the middle in terms of performance and security.
> Performance is better than IOVA due to IOTLB flush as you mentioned. Also
> not too far behind of pass-through.

The IOTLB flush is not on a DMA path but on a vmap path, so it is very
hard to compare the two things.. Maybe vmap can be made to do lazy
IOTLB flush or something and it could be closer

> Security-wise, KVA respects kernel mapping. So permissions are better
> enforced than pass-through and identity mapping.

Is this meaningful? Isn't the entire physical map still in the KVA and
isn't it entirely RW ?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-10-04 Thread Thierry Reding
On Sun, Oct 03, 2021 at 04:09:56AM +0300, Dmitry Osipenko wrote:
> 23.04.2021 19:32, Thierry Reding пишет:
> > I've made corresponding changes in the proprietary bootloader, added a
> > compatibility shim in U-Boot (which forwards information created by the
> > proprietary bootloader to the kernel) and the attached patches to test
> > this on Jetson TX1, Jetson TX2 and Jetson AGX Xavier.
> 
> Could you please tell what downstream kernel does for coping with the
> identity mappings in conjunction with the original proprietary bootloader?
> 
> If there is some other method of passing mappings to kernel, could it be
> supported by upstream? Putting burden on users to upgrade bootloader
> feels a bit inconvenient.

It depends on the chip generation. As far as I know there have been
several iterations. The earliest was to pass this information via a
command-line option, but more recent versions use device tree to pass
this information in a similar way as described here. However, these
use non-standard DT bindings, so I don't think we can just implement
them as-is.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v2 0/5] iommu: Support identity mappings of reserved-memory regions

2021-10-04 Thread Dmitry Osipenko
04.10.2021 22:23, Thierry Reding пишет:
> On Sun, Oct 03, 2021 at 04:09:56AM +0300, Dmitry Osipenko wrote:
>> 23.04.2021 19:32, Thierry Reding пишет:
>>> I've made corresponding changes in the proprietary bootloader, added a
>>> compatibility shim in U-Boot (which forwards information created by the
>>> proprietary bootloader to the kernel) and the attached patches to test
>>> this on Jetson TX1, Jetson TX2 and Jetson AGX Xavier.
>>
>> Could you please tell what downstream kernel does for coping with the
>> identity mappings in conjunction with the original proprietary bootloader?
>>
>> If there is some other method of passing mappings to kernel, could it be
>> supported by upstream? Putting burden on users to upgrade bootloader
>> feels a bit inconvenient.
> 
> It depends on the chip generation. As far as I know there have been
> several iterations. The earliest was to pass this information via a
> command-line option, but more recent versions use device tree to pass
> this information in a similar way as described here. However, these
> use non-standard DT bindings, so I don't think we can just implement
> them as-is.

Is it possible to boot upstream kernel with that original bootloader?

I remember seeing other platforms, like QCOM, supporting downstream
quirks in upstream kernel on a side, i.e. they are undocumented, but the
additional support code is there. That is what "normal" people want. You
should consider doing that for Tegra too, if possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu: intel: remove flooding of non-error logs, when new-DMA-PTE is the same as old-DMA-PTE.

2021-10-04 Thread Alex Williamson
On Sat, 2 Oct 2021 22:48:24 +0530
Ajay Garg  wrote:

> Thanks Lu for the reply.
> 
> >
> > Isn't the domain should be switched from a default domain to an
> > unmanaged domain when the device is assigned to the guest?
> >
> > Even you want to r-setup the same mappings, you need to un-map all
> > existing mappings, right?
> >  
> 
> Hmm, I guess that's a (design) decision the KVM/QEMU/VFIO communities
> need to take.
> May be the patch could suppress the flooding till then?

No, this is wrong.  The pte values should not exist, it doesn't matter
that they're the same.  Is the host driver failing to remove mappings
and somehow they persist in the new vfio owned domain?  There's
definitely a bug beyond logging going on here.  Thanks,

Alex

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu