Re: [PATCH] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation

2021-03-18 Thread Konrad Rzeszutek Wilk
On Thu, Mar 18, 2021 at 09:00:54PM -0700, Florian Fainelli wrote:
> When SWIOTLB_NO_FORCE is used, there should really be no allocations of
> io_tlb_nslabs to occur since we are not going to use those slabs. If a
> platform was somehow setting swiotlb_no_force and a later call to
> swiotlb_init() was to be made we would still be proceeding with
> allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce
> was set on the kernel command line we would have only allocated 2KB.
> 
> This would be inconsistent and the point of initializing io_tlb_nslabs
> to 1, was to avoid hitting the test for io_tlb_nslabs being 0/not
> initialized.

Could you rebase this on devel/for-linus-5.13 in

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git

please?
> 
> Signed-off-by: Florian Fainelli 
> ---
>  kernel/dma/swiotlb.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c10e855a03bc..526c8321b76f 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -121,12 +121,10 @@ setup_io_tlb_npages(char *str)
>   }
>   if (*str == ',')
>   ++str;
> - if (!strcmp(str, "force")) {
> + if (!strcmp(str, "force"))
>   swiotlb_force = SWIOTLB_FORCE;
> - } else if (!strcmp(str, "noforce")) {
> + else if (!strcmp(str, "noforce"))
>   swiotlb_force = SWIOTLB_NO_FORCE;
> - io_tlb_nslabs = 1;
> - }
>  
>   return 0;
>  }
> @@ -284,6 +282,9 @@ swiotlb_init(int verbose)
>   unsigned char *vstart;
>   unsigned long bytes;
>  
> + if (swiotlb_force == SWIOTLB_NO_FORCE)
> + goto out;
> +
>   if (!io_tlb_nslabs) {
>   io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
>   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> @@ -302,6 +303,7 @@ swiotlb_init(int verbose)
>   io_tlb_start = 0;
>   }
>   pr_warn("Cannot allocate buffer");
> +out:
>   no_iotlb_memory = true;
>  }
>  
> -- 
> 2.25.1
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] ARM: Qualify enabling of swiotlb_init()

2021-03-18 Thread Florian Fainelli
We do not need a SWIOTLB unless we have DRAM that is addressable beyond
the arm_dma_limit. Compare max_pfn with arm_dma_pfn_limit to determine
whether we do need a SWIOTLB to be initialized.

Fixes: ad3c7b18c5b3 ("arm: use swiotlb for bounce buffering on LPAE configs")
Signed-off-by: Florian Fainelli 
---
 arch/arm/mm/init.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index 828a2561b229..8356bf1daa28 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -301,7 +301,11 @@ static void __init free_highpages(void)
 void __init mem_init(void)
 {
 #ifdef CONFIG_ARM_LPAE
-   swiotlb_init(1);
+   if (swiotlb_force == SWIOTLB_FORCE ||
+   max_pfn > arm_dma_pfn_limit)
+   swiotlb_init(1);
+   else
+   swiotlb_force = SWIOTLB_NO_FORCE;
 #endif
 
set_max_mapnr(pfn_to_page(max_pfn) - mem_map);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation

2021-03-18 Thread Florian Fainelli
When SWIOTLB_NO_FORCE is used, there should really be no allocations of
io_tlb_nslabs to occur since we are not going to use those slabs. If a
platform was somehow setting swiotlb_no_force and a later call to
swiotlb_init() was to be made we would still be proceeding with
allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce
was set on the kernel command line we would have only allocated 2KB.

This would be inconsistent and the point of initializing io_tlb_nslabs
to 1, was to avoid hitting the test for io_tlb_nslabs being 0/not
initialized.

Signed-off-by: Florian Fainelli 
---
 kernel/dma/swiotlb.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c10e855a03bc..526c8321b76f 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -121,12 +121,10 @@ setup_io_tlb_npages(char *str)
}
if (*str == ',')
++str;
-   if (!strcmp(str, "force")) {
+   if (!strcmp(str, "force"))
swiotlb_force = SWIOTLB_FORCE;
-   } else if (!strcmp(str, "noforce")) {
+   else if (!strcmp(str, "noforce"))
swiotlb_force = SWIOTLB_NO_FORCE;
-   io_tlb_nslabs = 1;
-   }
 
return 0;
 }
@@ -284,6 +282,9 @@ swiotlb_init(int verbose)
unsigned char *vstart;
unsigned long bytes;
 
+   if (swiotlb_force == SWIOTLB_NO_FORCE)
+   goto out;
+
if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
@@ -302,6 +303,7 @@ swiotlb_init(int verbose)
io_tlb_start = 0;
}
pr_warn("Cannot allocate buffer");
+out:
no_iotlb_memory = true;
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Konrad Rzeszutek Wilk
> > 
> > In fact I should have looked more closely at that myself - checking
> > debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and
> > indeed we are bypassing initialisation completely and (ab)using
> > SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now
> > for the noforce option to do the same for itself and save even that one
> > page.
> 
> OK, I can submit a patch that does that. 5.12-rc3 works correctly for me
> here as well and only allocates SWIOTLB when needed which in our case is
> either:
> 
> - we have DRAM at PA >= 4GB
> - we have limited peripherals (Raspberry Pi 4 derivative) that can only
> address the lower 1GB
> 
> Now let's see if we can get ARM 32-bit to match :)

Whatever patch you come up with, if it is against SWIOTLB please base it on top 
of
devel/for-linus-5.12 in 
https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/

Thx
> -- 
> Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 6:10 PM, Joerg Roedel wrote:

Hi Baolu,

On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote:

The private data field of a page group response descriptor is set then
immediately cleared in prq_event_thread(). Fix this by moving clearing
code up.

Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode")
Cc: Jacob Pan 
Reviewed-by: Liu Yi L 
Signed-off-by: Lu Baolu 


Does this fix an actual bug? If so, please state it in the commit


It will cause real problem according to the VT-d spec. I haven't got a
chance run this on a real hardware yet. I'll add a commit message to
explain why this will cause problem.


message and also fix the subject line to state what is set/cleared.



Sure!


Thanks,

Joerg



Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 6:21 PM, Joerg Roedel wrote:

On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote:

The pasid_lock is used to synchronize different threads from modifying a
same pasid directory entry at the same time. It causes below lockdep splat.

[   83.296538] 
[   83.296538] WARNING: possible irq lock inversion dependency detected
[   83.296539] 5.12.0-rc3+ #25 Tainted: GW
[   83.296539] 
[   83.296540] bash/780 just changed the state of lock:
[   83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at:
iommu_flush_dev_iotlb.part.0+0x32/0x110
[   83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   83.296547]  (pasid_lock){+.+.}-{2:2}
[   83.296548]

and interrupts could create inverse lock ordering between them.

[   83.296549] other info that might help us debug this:
[   83.296549] Chain exists of:
  device_domain_lock --> &iommu->lock --> pasid_lock
[   83.296551]  Possible interrupt unsafe locking scenario:

[   83.296551]CPU0CPU1
[   83.296552]
[   83.296552]   lock(pasid_lock);
[   83.296553]local_irq_disable();
[   83.296553]lock(device_domain_lock);
[   83.296554]lock(&iommu->lock);
[   83.296554]   
[   83.296554] lock(device_domain_lock);
[   83.296555]
 *** DEADLOCK ***

Fix it by replacing the pasid_lock with an atomic exchange operation.

Reported-and-tested-by: Dave Jiang 
Signed-off-by: Lu Baolu 
---
  drivers/iommu/intel/pasid.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 9fb3d3e80408..1ddcb8295f72 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -24,7 +24,6 @@
  /*
   * Intel IOMMU system wide PASID name space:
   */
-static DEFINE_SPINLOCK(pasid_lock);
  u32 intel_pasid_max_id = PASID_MAX;
  
  int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid)

@@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device 
*dev, u32 pasid)
dir_index = pasid >> PASID_PDE_SHIFT;
index = pasid & PASID_PTE_MASK;
  
-	spin_lock(&pasid_lock);

entries = get_pasid_table_from_pde(&dir[dir_index]);
if (!entries) {
entries = alloc_pgtable_page(info->iommu->node);
-   if (!entries) {
-   spin_unlock(&pasid_lock);
+   if (!entries)
return NULL;
-   }
  
-		WRITE_ONCE(dir[dir_index].val,

-  (u64)virt_to_phys(entries) | PASID_PTE_PRESENT);
+   if (cmpxchg64(&dir[dir_index].val, 0ULL,
+ (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
+   free_pgtable_page(entries);
+   entries = get_pasid_table_from_pde(&dir[dir_index]);


This is racy, someone could have already cleared the pasid-entry again.


This code modifies the pasid directory entry. The pasid directory
entries are allocated on demand and will never be freed.


What you need to do here is to retry the whole path by adding a goto
to before  the first get_pasid_table_from_pde() call.


Yes. Retrying by adding a goto makes the code clearer.



Btw, what makes sure that the pasid_entry does not go away when it is
returned here?


As explained above, it handles the pasid directory table entry which
won't go away.



Regards,

Joerg



Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Fix a typo in a comment

2021-03-18 Thread chenxiang (M)

Hi,


在 2021/3/18 19:01, Robin Murphy 写道:

On 2021-03-18 09:55, chenxiang (M) wrote:

Hi will,


在 2021/3/18 17:34, Will Deacon 写道:

On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote:

From: Xiang Chen 

Fix a type "SAC" to "DAC" in the comment of function
iommu_dma_alloc_iova().

Signed-off-by: Xiang Chen 
---
  drivers/iommu/dma-iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c8..3bc17ee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,

  if (domain->geometry.force_aperture)
  dma_limit = min(dma_limit, 
(u64)domain->geometry.aperture_end);

-/* Try to get PCI devices a SAC address */
+/* Try to get PCI devices a DAC address */
  if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
  iova = alloc_iova_fast(iovad, iova_len,
 DMA_BIT_MASK(32) >> shift, false);

This doesn't look like a typo to me... Please explain.


As the author of said comment, it is definitely not a typo.


What's the short for SAC?



I think it means double-address cycle(DAC), and in LLD3 452 page, 
there is a description about it "PCI double-address cycle mappings 
Normally,
the DMA support layer works with 32-bit bus addresses, possibly 
restricted by a specific device’s DMA mask. The PCI bus, however, 
also supports a 64-bit addressing mode, the double-address cycle (DAC)."

Please point it out if i am wrong :)


Yes, now look at what the code above does: *if* a PCI device has a DMA 
mask greater than 32 bits (i.e. can support dual address cycles), we 
first try an IOVA allocation with an explicit 32-bit limit (i.e. that 
can be expressed in a single address cycle), in the hope that we don't 
have to fall back to allocating from the upper range and forcing the 
device to actually use DAC (and suffer the potential performance impact).


Robin.

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Florian Fainelli


On 3/18/2021 4:35 PM, Robin Murphy wrote:
> On 2021-03-18 21:31, Florian Fainelli wrote:
>>
>>
>> On 3/18/2021 12:53 PM, Robin Murphy wrote:
>>> On 2021-03-18 19:43, Florian Fainelli wrote:


 On 3/18/2021 12:34 PM, Robin Murphy wrote:
> On 2021-03-18 19:22, Florian Fainelli wrote:
>>
>>
>> On 3/18/2021 12:18 PM, Florian Fainelli wrote:
>>> It may be useful to disable the SWIOTLB completely for testing or
>>> when a
>>> platform is known not to have any DRAM addressing limitations
>>> what so
>>> ever.
>
> Isn't that what "swiotlb=noforce" is for? If you're confident that
> we've
> really ironed out *all* the awkward corners that used to blow up if
> various internal bits were left uninitialised, then it would make
> sense
> to just tweak the implementation of what we already have.

 swiotlb=noforce does prevent dma_direct_map_page() from resorting to
 the
 swiotlb, however what I am also after is reclaiming these 64MB of
 default SWIOTLB bounce buffering memory because my systems run with
 large amounts of reserved memory into ZONE_MOVABLE and everything in
 ZONE_NORMAL is precious at that point.
>>>
>>> It also forces io_tlb_nslabs to the minimum, so it should be claiming
>>> considerably less than 64MB. IIRC the original proposal *did* skip
>>> initialisation completely, but that turned up the aforementioned issues.
>>
>> AFAICT in that case we will have iotlb_n_slabs will set to 1, which will
>> still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in
>> swiotlb_init(), which still gives us 64MB.
> 
> Eh? When did 2KB become 64MB? IO_TLB_SHIFT is 11, so that's at most one
> page in anyone's money...

Yes, sorry incorrect shift applied here. Still, and I believe this is
what you mean below, architecture code setting swiotlb_force =
SWIOTLB_NO_FORCE does not result in not allocating the SWIOTLB, because
io_tlb_nslabs is still left set to 0 so swiotlb_init() will proceed with
allocating the default size.

> 
> I wouldn't necessarily disagree with adding "off" as an additional
> alias
> for "noforce", though, since it does come across as a bit wacky for
> general use.
>
>>> Signed-off-by: Florian Fainelli 
>>
>> Christoph, in addition to this change, how would you feel if we
>> qualified the swiotlb_init() in arch/arm/mm/init.c with a:
>>
>>
>> if (memblock_end_of_DRAM() >= SZ_4G)
>>   swiotlb_init(1)
>
> Modulo "swiotlb=force", of course ;)

 Indeed, we would need to handle that case as well. Does it sound
 reasonable to do that to you as well?
>>>
>>> I wouldn't like it done to me personally, but for arm64, observe what
>>> mem_init() in arch/arm64/mm/init.c already does.
> 
> In fact I should have looked more closely at that myself - checking
> debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and
> indeed we are bypassing initialisation completely and (ab)using
> SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now
> for the noforce option to do the same for itself and save even that one
> page.

OK, I can submit a patch that does that. 5.12-rc3 works correctly for me
here as well and only allocates SWIOTLB when needed which in our case is
either:

- we have DRAM at PA >= 4GB
- we have limited peripherals (Raspberry Pi 4 derivative) that can only
address the lower 1GB

Now let's see if we can get ARM 32-bit to match :)
-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-18 Thread Lu Baolu

Hi Joerg,

On 3/18/21 5:12 PM, Joerg Roedel wrote:

Hi,

On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote:

That is the primary motivation, given that we have moved to 1st level for
general IOVA, first level doesn't have a WO mapping. I didn't know enough
about the history to determine if a WO without a READ is very useful. I
guess the ZLR was invented to support those cases without a READ in PCIe. I


Okay, please update the commit message and re-send. I guess these
patches are 5.13 stuff. In that case, Baolu can include them into his
pull request later this cycle.


Okay! It works for me.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Lu Baolu

On 3/18/21 4:56 PM, Tian, Kevin wrote:

From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)



-Original Message-
From: Tian, Kevin [mailto:kevin.t...@intel.com]
Sent: Thursday, March 18, 2021 4:27 PM
To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
; Nadav Amit 
Cc: chenjiashang ; David Woodhouse
; iommu@lists.linux-foundation.org; LKML
; alex.william...@redhat.com; Gonglei

(Arei)

; w...@kernel.org
Subject: RE: A problem of Intel IOMMU hardware ?


From: iommu  On Behalf Of
Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


2. Consider ensuring that the problem is not somehow related to
queued invalidations. Try to use __iommu_flush_iotlb() instead of

qi_flush_iotlb().




I tried to force to use __iommu_flush_iotlb(), but maybe something
wrong, the system crashed, so I prefer to lower the priority of this

operation.




The VT-d spec clearly says that register-based invalidation can be used only

when

queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide

an

option to disable queued-invalidation though, when the hardware is

capable. If you

really want to try, tweak the code in intel_iommu_init_qi.



Hi Kevin,

Thanks to point out this. Do you have any ideas about this problem ? I tried
to descript the problem much clear in my reply to Alex, hope you could have
a look if you're interested.



btw I saw you used 4.18 kernel in this test. What about latest kernel?

Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
hardware. Check the window between b) and c) and see whether the
software does the right thing as expected there.


Yes. It's better if we can reproduce this with the latest kernel which
has debugfs files to expose page tables and the invalidation queues etc.

Best regards,
baolu
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

2021-03-18 Thread Jacob Pan
Hi Jean,

Slightly off the title. As we are moving to use cgroup to limit PASID
allocations, it would be much simpler if we enforce on the current task.

However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument
which implies it can be something other the the current task mm. So far all
kernel callers use current task mm. Is there a use case for doing PASID
allocation on behalf of another mm? If not, can we remove the mm argument?

Thanks,

Jacob

>  /**
>   * iommu_sva_alloc_pasid - Allocate a PASID for the mm
> @@ -35,11 +44,11 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm,
> ioasid_t min, ioasid_t max) mutex_lock(&iommu_sva_lock);
>   if (mm->pasid) {
>   if (mm->pasid >= min && mm->pasid <= max)
> - ioasid_get(mm->pasid);
> + ioasid_get(iommu_sva_pasid, mm->pasid);
>   else
>   ret = -EOVERFLOW;
>   } else {
> - pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm);
> + pasid = ioasid_alloc(iommu_sva_pasid, min, max, mm);
>   if (pasid == INVALID_IOASID)
>   ret = -ENOMEM;

Thanks,

Jacob
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 21:31, Florian Fainelli wrote:



On 3/18/2021 12:53 PM, Robin Murphy wrote:

On 2021-03-18 19:43, Florian Fainelli wrote:



On 3/18/2021 12:34 PM, Robin Murphy wrote:

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or
when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've
really ironed out *all* the awkward corners that used to blow up if
various internal bits were left uninitialised, then it would make sense
to just tweak the implementation of what we already have.


swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
swiotlb, however what I am also after is reclaiming these 64MB of
default SWIOTLB bounce buffering memory because my systems run with
large amounts of reserved memory into ZONE_MOVABLE and everything in
ZONE_NORMAL is precious at that point.


It also forces io_tlb_nslabs to the minimum, so it should be claiming
considerably less than 64MB. IIRC the original proposal *did* skip
initialisation completely, but that turned up the aforementioned issues.


AFAICT in that case we will have iotlb_n_slabs will set to 1, which will
still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in
swiotlb_init(), which still gives us 64MB.


Eh? When did 2KB become 64MB? IO_TLB_SHIFT is 11, so that's at most one 
page in anyone's money...



I wouldn't necessarily disagree with adding "off" as an additional alias
for "noforce", though, since it does come across as a bit wacky for
general use.


Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
  swiotlb_init(1)


Modulo "swiotlb=force", of course ;)


Indeed, we would need to handle that case as well. Does it sound
reasonable to do that to you as well?


I wouldn't like it done to me personally, but for arm64, observe what
mem_init() in arch/arm64/mm/init.c already does.


In fact I should have looked more closely at that myself - checking 
debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and 
indeed we are bypassing initialisation completely and (ab)using 
SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now 
for the noforce option to do the same for itself and save even that one 
page.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Florian Fainelli


On 3/18/2021 12:53 PM, Robin Murphy wrote:
> On 2021-03-18 19:43, Florian Fainelli wrote:
>>
>>
>> On 3/18/2021 12:34 PM, Robin Murphy wrote:
>>> On 2021-03-18 19:22, Florian Fainelli wrote:


 On 3/18/2021 12:18 PM, Florian Fainelli wrote:
> It may be useful to disable the SWIOTLB completely for testing or
> when a
> platform is known not to have any DRAM addressing limitations what so
> ever.
>>>
>>> Isn't that what "swiotlb=noforce" is for? If you're confident that we've
>>> really ironed out *all* the awkward corners that used to blow up if
>>> various internal bits were left uninitialised, then it would make sense
>>> to just tweak the implementation of what we already have.
>>
>> swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
>> swiotlb, however what I am also after is reclaiming these 64MB of
>> default SWIOTLB bounce buffering memory because my systems run with
>> large amounts of reserved memory into ZONE_MOVABLE and everything in
>> ZONE_NORMAL is precious at that point.
> 
> It also forces io_tlb_nslabs to the minimum, so it should be claiming
> considerably less than 64MB. IIRC the original proposal *did* skip
> initialisation completely, but that turned up the aforementioned issues.

AFAICT in that case we will have iotlb_n_slabs will set to 1, which will
still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in
swiotlb_init(), which still gives us 64MB.

> 
>>> I wouldn't necessarily disagree with adding "off" as an additional alias
>>> for "noforce", though, since it does come across as a bit wacky for
>>> general use.
>>>
> Signed-off-by: Florian Fainelli 

 Christoph, in addition to this change, how would you feel if we
 qualified the swiotlb_init() in arch/arm/mm/init.c with a:


 if (memblock_end_of_DRAM() >= SZ_4G)
  swiotlb_init(1)
>>>
>>> Modulo "swiotlb=force", of course ;)
>>
>> Indeed, we would need to handle that case as well. Does it sound
>> reasonable to do that to you as well?
> 
> I wouldn't like it done to me personally, but for arm64, observe what
> mem_init() in arch/arm64/mm/init.c already does.
> 
> Robin.

-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 19:43, Florian Fainelli wrote:



On 3/18/2021 12:34 PM, Robin Murphy wrote:

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've
really ironed out *all* the awkward corners that used to blow up if
various internal bits were left uninitialised, then it would make sense
to just tweak the implementation of what we already have.


swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
swiotlb, however what I am also after is reclaiming these 64MB of
default SWIOTLB bounce buffering memory because my systems run with
large amounts of reserved memory into ZONE_MOVABLE and everything in
ZONE_NORMAL is precious at that point.


It also forces io_tlb_nslabs to the minimum, so it should be claiming 
considerably less than 64MB. IIRC the original proposal *did* skip 
initialisation completely, but that turned up the aforementioned issues.



I wouldn't necessarily disagree with adding "off" as an additional alias
for "noforce", though, since it does come across as a bit wacky for
general use.


Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
 swiotlb_init(1)


Modulo "swiotlb=force", of course ;)


Indeed, we would need to handle that case as well. Does it sound
reasonable to do that to you as well?


I wouldn't like it done to me personally, but for arm64, observe what 
mem_init() in arch/arm64/mm/init.c already does.


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Florian Fainelli


On 3/18/2021 12:34 PM, Robin Murphy wrote:
> On 2021-03-18 19:22, Florian Fainelli wrote:
>>
>>
>> On 3/18/2021 12:18 PM, Florian Fainelli wrote:
>>> It may be useful to disable the SWIOTLB completely for testing or when a
>>> platform is known not to have any DRAM addressing limitations what so
>>> ever.
> 
> Isn't that what "swiotlb=noforce" is for? If you're confident that we've
> really ironed out *all* the awkward corners that used to blow up if
> various internal bits were left uninitialised, then it would make sense
> to just tweak the implementation of what we already have.

swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
swiotlb, however what I am also after is reclaiming these 64MB of
default SWIOTLB bounce buffering memory because my systems run with
large amounts of reserved memory into ZONE_MOVABLE and everything in
ZONE_NORMAL is precious at that point.

> 
> I wouldn't necessarily disagree with adding "off" as an additional alias
> for "noforce", though, since it does come across as a bit wacky for
> general use.
> 
>>> Signed-off-by: Florian Fainelli 
>>
>> Christoph, in addition to this change, how would you feel if we
>> qualified the swiotlb_init() in arch/arm/mm/init.c with a:
>>
>>
>> if (memblock_end_of_DRAM() >= SZ_4G)
>> swiotlb_init(1)
> 
> Modulo "swiotlb=force", of course ;)

Indeed, we would need to handle that case as well. Does it sound
reasonable to do that to you as well?
-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/3] ACPI: Add driver for the VIOT table

2021-03-18 Thread Robin Murphy

On 2021-03-16 19:16, Jean-Philippe Brucker wrote:

The ACPI Virtual I/O Translation Table describes topology of
para-virtual platforms. For now it describes the relation between
virtio-iommu and the endpoints it manages. Supporting that requires
three steps:

(1) acpi_viot_init(): parse the VIOT table, build a list of endpoints
 and vIOMMUs.

(2) acpi_viot_set_iommu_ops(): when the vIOMMU driver is loaded and the
 device probed, register it to the VIOT driver. This step is required
 because unlike similar drivers, VIOT doesn't create the vIOMMU
 device.


Note that you're basically the same as the DT case in this regard, so 
I'd expect things to be closer to that pattern than to that of IORT.


[...]

@@ -1506,12 +1507,17 @@ int acpi_dma_configure_id(struct device *dev, enum 
dev_dma_attr attr,
  {
const struct iommu_ops *iommu;
u64 dma_addr = 0, size = 0;
+   int ret;
  
  	if (attr == DEV_DMA_NOT_SUPPORTED) {

set_dma_ops(dev, &dma_dummy_ops);
return 0;
}
  
+	ret = acpi_viot_dma_setup(dev, attr);

+   if (ret)
+   return ret > 0 ? 0 : ret;


I think things could do with a fair bit of refactoring here. Ideally we 
want to process a possible _DMA method (acpi_dma_get_range()) regardless 
of which flavour of IOMMU table might be present, and the amount of 
duplication we fork into at this point is unfortunate.



+
iort_dma_setup(dev, &dma_addr, &size);


For starters I think most of that should be dragged out to this level 
here - it's really only the {rc,nc}_dma_get_range() bit that deserves to 
be the IORT-specific call.



iommu = iort_iommu_configure_id(dev, input_id);


Similarly, it feels like it's only the table scan part in the middle of 
that that needs dispatching between IORT/VIOT, and its head and tail 
pulled out into a common path.


[...]

+static const struct iommu_ops *viot_iommu_setup(struct device *dev)
+{
+   struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
+   struct viot_iommu *viommu = NULL;
+   struct viot_endpoint *ep;
+   u32 epid;
+   int ret;
+
+   /* Already translated? */
+   if (fwspec && fwspec->ops)
+   return NULL;
+
+   mutex_lock(&viommus_lock);
+   list_for_each_entry(ep, &viot_endpoints, list) {
+   if (viot_device_match(dev, &ep->dev_id, &epid)) {
+   epid += ep->endpoint_id;
+   viommu = ep->viommu;
+   break;
+   }
+   }
+   mutex_unlock(&viommus_lock);
+   if (!viommu)
+   return NULL;
+
+   /* We're not translating ourself */
+   if (viot_device_match(dev, &viommu->dev_id, &epid))
+   return NULL;
+
+   /*
+* If we found a PCI range managed by the viommu, we're the one that has
+* to request ACS.
+*/
+   if (dev_is_pci(dev))
+   pci_request_acs();
+
+   if (!viommu->ops || WARN_ON(!viommu->dev))
+   return ERR_PTR(-EPROBE_DEFER);


Can you create (or look up) a viommu->fwnode when initially parsing the 
VIOT to represent the IOMMU devices to wait for, such that the 
viot_device_match() lookup can resolve to that and let you fall into the 
standard iommu_ops_from_fwnode() path? That's what I mean about 
following the DT pattern - I guess it might need a bit of trickery to 
rewrite things if iommu_device_register() eventually turns up with a new 
fwnode, so I doubt we can get away without *some* kind of private 
interface between virtio-iommu and VIOT, but it would be nice for the 
common(ish) DMA paths to stay as unaware of the specifics as possible.



+
+   ret = iommu_fwspec_init(dev, viommu->dev->fwnode, viommu->ops);
+   if (ret)
+   return ERR_PTR(ret);
+
+   iommu_fwspec_add_ids(dev, &epid, 1);
+
+   /*
+* If we have reason to believe the IOMMU driver missed the initial
+* add_device callback for dev, replay it to get things in order.
+*/
+   if (dev->bus && !device_iommu_mapped(dev))
+   iommu_probe_device(dev);
+
+   return viommu->ops;
+}
+
+/**
+ * acpi_viot_dma_setup - Configure DMA for an endpoint described in VIOT
+ * @dev: the endpoint
+ * @attr: coherency property of the endpoint
+ *
+ * Setup the DMA and IOMMU ops for an endpoint described by the VIOT table.
+ *
+ * Return:
+ * * 0 - @dev doesn't match any VIOT node
+ * * 1 - ops for @dev were successfully installed
+ * * -EPROBE_DEFER - ops for @dev aren't yet available
+ */
+int acpi_viot_dma_setup(struct device *dev, enum dev_dma_attr attr)
+{
+   const struct iommu_ops *iommu_ops = viot_iommu_setup(dev);
+
+   if (IS_ERR_OR_NULL(iommu_ops)) {
+   int ret = PTR_ERR(iommu_ops);
+
+   if (ret == -EPROBE_DEFER || ret == 0)
+   return ret;
+   dev_err(dev, "error %d while setting up virt IOMMU\n", ret);
+  

Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've 
really ironed out *all* the awkward corners that used to blow up if 
various internal bits were left uninitialised, then it would make sense 
to just tweak the implementation of what we already have.


I wouldn't necessarily disagree with adding "off" as an additional alias 
for "noforce", though, since it does come across as a bit wacky for 
general use.



Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
swiotlb_init(1)


Modulo "swiotlb=force", of course ;)

Robin.


right now this is made unconditional whenever ARM_LPAE is enabled which
is the case for the platforms I maintain (ARCH_BRCMSTB) however we do
not really need a SWIOTLB so long as the largest DRAM physical address
does not exceed 4GB AFAICT.

Thanks!


---
  Documentation/admin-guide/kernel-parameters.txt | 1 +
  include/linux/swiotlb.h | 1 +
  kernel/dma/swiotlb.c| 9 +
  3 files changed, 11 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..b0223e48921e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5278,6 +5278,7 @@
force -- force using of bounce buffers even if they
 wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
+   off -- Completely disable SWIOTLB
  
  	switches=	[HW,M68k]
  
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h

index 5857a937c637..23f86243defe 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -15,6 +15,7 @@ enum swiotlb_force {
SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */
SWIOTLB_FORCE,  /* swiotlb=force */
SWIOTLB_NO_FORCE,   /* swiotlb=noforce */
+   SWIOTLB_OFF,/* swiotlb=off */
  };
  
  /*

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c10e855a03bc..d7a4a789c7d3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str)
} else if (!strcmp(str, "noforce")) {
swiotlb_force = SWIOTLB_NO_FORCE;
io_tlb_nslabs = 1;
+   } else if (!strcmp(str, "off")) {
+   swiotlb_force = SWIOTLB_OFF;
}
  
  	return 0;

@@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
unsigned long i, bytes;
size_t alloc_size;
  
+	if (swiotlb_force == SWIOTLB_OFF)

+   return 0;
+
bytes = nslabs << IO_TLB_SHIFT;
  
  	io_tlb_nslabs = nslabs;

@@ -284,6 +289,9 @@ swiotlb_init(int verbose)
unsigned char *vstart;
unsigned long bytes;
  
+	if (swiotlb_force == SWIOTLB_OFF)

+   goto out;
+
if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
@@ -302,6 +310,7 @@ swiotlb_init(int verbose)
io_tlb_start = 0;
}
pr_warn("Cannot allocate buffer");
+out:
no_iotlb_memory = true;
  }
  




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Florian Fainelli



On 3/18/2021 12:18 PM, Florian Fainelli wrote:
> It may be useful to disable the SWIOTLB completely for testing or when a
> platform is known not to have any DRAM addressing limitations what so
> ever.
> 
> Signed-off-by: Florian Fainelli 

Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
swiotlb_init(1)

right now this is made unconditional whenever ARM_LPAE is enabled which
is the case for the platforms I maintain (ARCH_BRCMSTB) however we do
not really need a SWIOTLB so long as the largest DRAM physical address
does not exceed 4GB AFAICT.

Thanks!

> ---
>  Documentation/admin-guide/kernel-parameters.txt | 1 +
>  include/linux/swiotlb.h | 1 +
>  kernel/dma/swiotlb.c| 9 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 04545725f187..b0223e48921e 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5278,6 +5278,7 @@
>   force -- force using of bounce buffers even if they
>wouldn't be automatically used by the kernel
>   noforce -- Never use bounce buffers (for debugging)
> + off -- Completely disable SWIOTLB
>  
>   switches=   [HW,M68k]
>  
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 5857a937c637..23f86243defe 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -15,6 +15,7 @@ enum swiotlb_force {
>   SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */
>   SWIOTLB_FORCE,  /* swiotlb=force */
>   SWIOTLB_NO_FORCE,   /* swiotlb=noforce */
> + SWIOTLB_OFF,/* swiotlb=off */
>  };
>  
>  /*
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index c10e855a03bc..d7a4a789c7d3 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str)
>   } else if (!strcmp(str, "noforce")) {
>   swiotlb_force = SWIOTLB_NO_FORCE;
>   io_tlb_nslabs = 1;
> + } else if (!strcmp(str, "off")) {
> + swiotlb_force = SWIOTLB_OFF;
>   }
>  
>   return 0;
> @@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
> nslabs, int verbose)
>   unsigned long i, bytes;
>   size_t alloc_size;
>  
> + if (swiotlb_force == SWIOTLB_OFF)
> + return 0;
> +
>   bytes = nslabs << IO_TLB_SHIFT;
>  
>   io_tlb_nslabs = nslabs;
> @@ -284,6 +289,9 @@ swiotlb_init(int verbose)
>   unsigned char *vstart;
>   unsigned long bytes;
>  
> + if (swiotlb_force == SWIOTLB_OFF)
> + goto out;
> +
>   if (!io_tlb_nslabs) {
>   io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
>   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
> @@ -302,6 +310,7 @@ swiotlb_init(int verbose)
>   io_tlb_start = 0;
>   }
>   pr_warn("Cannot allocate buffer");
> +out:
>   no_iotlb_memory = true;
>  }
>  
> 

-- 
Florian
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Florian Fainelli
It may be useful to disable the SWIOTLB completely for testing or when a
platform is known not to have any DRAM addressing limitations what so
ever.

Signed-off-by: Florian Fainelli 
---
 Documentation/admin-guide/kernel-parameters.txt | 1 +
 include/linux/swiotlb.h | 1 +
 kernel/dma/swiotlb.c| 9 +
 3 files changed, 11 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..b0223e48921e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5278,6 +5278,7 @@
force -- force using of bounce buffers even if they
 wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
+   off -- Completely disable SWIOTLB
 
switches=   [HW,M68k]
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5857a937c637..23f86243defe 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -15,6 +15,7 @@ enum swiotlb_force {
SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */
SWIOTLB_FORCE,  /* swiotlb=force */
SWIOTLB_NO_FORCE,   /* swiotlb=noforce */
+   SWIOTLB_OFF,/* swiotlb=off */
 };
 
 /*
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c10e855a03bc..d7a4a789c7d3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str)
} else if (!strcmp(str, "noforce")) {
swiotlb_force = SWIOTLB_NO_FORCE;
io_tlb_nslabs = 1;
+   } else if (!strcmp(str, "off")) {
+   swiotlb_force = SWIOTLB_OFF;
}
 
return 0;
@@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
unsigned long i, bytes;
size_t alloc_size;
 
+   if (swiotlb_force == SWIOTLB_OFF)
+   return 0;
+
bytes = nslabs << IO_TLB_SHIFT;
 
io_tlb_nslabs = nslabs;
@@ -284,6 +289,9 @@ swiotlb_init(int verbose)
unsigned char *vstart;
unsigned long bytes;
 
+   if (swiotlb_force == SWIOTLB_OFF)
+   goto out;
+
if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
@@ -302,6 +310,7 @@ swiotlb_init(int verbose)
io_tlb_start = 0;
}
pr_warn("Cannot allocate buffer");
+out:
no_iotlb_memory = true;
 }
 
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/3] iommu/virtio: Enable x86 support

2021-03-18 Thread Michael S. Tsirkin
On Tue, Mar 16, 2021 at 08:16:54PM +0100, Jean-Philippe Brucker wrote:
> With the VIOT support in place, x86 platforms can now use the
> virtio-iommu.
> 
> The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it
> themselves.
> 
> Signed-off-by: Jean-Philippe Brucker 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/iommu/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 2819b5c8ec30..ccca83ef2f06 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -400,8 +400,9 @@ config HYPERV_IOMMU
>  config VIRTIO_IOMMU
>   tristate "Virtio IOMMU driver"
>   depends on VIRTIO
> - depends on ARM64
> + depends on (ARM64 || X86)
>   select IOMMU_API
> + select IOMMU_DMA if X86

Would it hurt to just select unconditionally? Seems a bit cleaner
...

>   select INTERVAL_TREE
>   select ACPI_VIOT if ACPI
>   help
> -- 
> 2.30.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/3] ACPICA: iASL: Add definitions for the VIOT table

2021-03-18 Thread Auger Eric
Hi Jean,

On 3/16/21 8:16 PM, Jean-Philippe Brucker wrote:
> Just here for reference, don't merge!
> 
> The actual commits will be pulled from the next ACPICA release.
> I squashed the three relevant commits:
> 
> ACPICA commit fc4e33319c1ee08f20f5c44853dd8426643f6dfd
> ACPICA commit 2197e354fb5dcafaddd2016ffeb0620e5bc3d5e2
> ACPICA commit 856a96fdf4b51b2b8da17529df0255e6f51f1b5b
> 
> Link: https://github.com/acpica/acpica/commit/fc4e3331
> Link: https://github.com/acpica/acpica/commit/2197e354
> Link: https://github.com/acpica/acpica/commit/856a96fd
> Signed-off-by: Bob Moore 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  include/acpi/actbl3.h | 67 +++
>  1 file changed, 67 insertions(+)
> 
> diff --git a/include/acpi/actbl3.h b/include/acpi/actbl3.h
> index df5f4b27f3aa..09d15898e9a8 100644
> --- a/include/acpi/actbl3.h
> +++ b/include/acpi/actbl3.h
> @@ -33,6 +33,7 @@
>  #define ACPI_SIG_TCPA   "TCPA"   /* Trusted Computing Platform 
> Alliance table */
>  #define ACPI_SIG_TPM2   "TPM2"   /* Trusted Platform Module 2.0 
> H/W interface table */
>  #define ACPI_SIG_UEFI   "UEFI"   /* Uefi Boot Optimization Table 
> */
> +#define ACPI_SIG_VIOT   "VIOT"   /* Virtual I/O Translation 
> Table */
>  #define ACPI_SIG_WAET   "WAET"   /* Windows ACPI Emulated 
> devices Table */
>  #define ACPI_SIG_WDAT   "WDAT"   /* Watchdog Action Table */
>  #define ACPI_SIG_WDDT   "WDDT"   /* Watchdog Timer Description 
> Table */
> @@ -483,6 +484,72 @@ struct acpi_table_uefi {
>   u16 data_offset;/* Offset of remaining data in table */
>  };
>  
> +/***
> + *
> + * VIOT - Virtual I/O Translation Table
> + *Version 1
For other tables I see
Conforms to ../.. Shouldn't we have such section too
> + *
> + 
> **/
> +
> +struct acpi_table_viot {
> + struct acpi_table_header header;/* Common ACPI table header */
> + u16 node_count;
> + u16 node_offset;
> + u8 reserved[8];
> +};
> +
> +/* VIOT subtable header */
> +
> +struct acpi_viot_header {
> + u8 type;
> + u8 reserved;
> + u16 length;
> +};
> +
> +/* Values for Type field above */
> +
> +enum acpi_viot_node_type {
> + ACPI_VIOT_NODE_PCI_RANGE = 0x01,
> + ACPI_VIOT_NODE_MMIO = 0x02,
> + ACPI_VIOT_NODE_VIRTIO_IOMMU_PCI = 0x03,
> + ACPI_VIOT_NODE_VIRTIO_IOMMU_MMIO = 0x04,
> + ACPI_VIOT_RESERVED = 0x05
> +};
> +
> +/* VIOT subtables */
> +
> +struct acpi_viot_pci_range {
> + struct acpi_viot_header header;
> + u32 endpoint_start;
> + u16 segment_start;
> + u16 segment_end;
> + u16 bdf_start;
> + u16 bdf_end;
> + u16 output_node;
> + u8 reserved[6];
> +};
> +
> +struct acpi_viot_mmio {
> + struct acpi_viot_header header;
> + u32 endpoint;
> + u64 base_address;
> + u16 output_node;
> + u8 reserved[6];
> +};
> +
> +struct acpi_viot_virtio_iommu_pci {
> + struct acpi_viot_header header;
> + u16 segment;
> + u16 bdf;
> + u8 reserved[8];
> +};
> +
> +struct acpi_viot_virtio_iommu_mmio {
> + struct acpi_viot_header header;
> + u8 reserved[4];
> + u64 base_address;
> +};
> +
>  
> /***
>   *
>   * WAET - Windows ACPI Emulated devices Table
> 

Besides
Reviewed-by: Eric Auger 

Thanks

Eric

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Nadav Amit


> On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 
> 
> 
>> -Original Message-
>> From: Tian, Kevin [mailto:kevin.t...@intel.com]
>> Sent: Thursday, March 18, 2021 4:56 PM
>> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>> ; Nadav Amit 
>> Cc: chenjiashang ; David Woodhouse
>> ; iommu@lists.linux-foundation.org; LKML
>> ; alex.william...@redhat.com; Gonglei (Arei)
>> ; w...@kernel.org
>> Subject: RE: A problem of Intel IOMMU hardware ?
>> 
>>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
>>> 
>>> 
 -Original Message-
 From: Tian, Kevin [mailto:kevin.t...@intel.com]
 Sent: Thursday, March 18, 2021 4:27 PM
 To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
 ; Nadav Amit 
 Cc: chenjiashang ; David Woodhouse
 ; iommu@lists.linux-foundation.org; LKML
 ; alex.william...@redhat.com; Gonglei
>>> (Arei)
 ; w...@kernel.org
 Subject: RE: A problem of Intel IOMMU hardware ?
 
> From: iommu  On Behalf
> Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
>> 2. Consider ensuring that the problem is not somehow related to
>> queued invalidations. Try to use __iommu_flush_iotlb() instead
>> of
 qi_flush_iotlb().
>> 
> 
> I tried to force to use __iommu_flush_iotlb(), but maybe something
> wrong, the system crashed, so I prefer to lower the priority of
> this
>>> operation.
> 
 
 The VT-d spec clearly says that register-based invalidation can be
 used only
>>> when
 queued-invalidations are not enabled. Intel-IOMMU driver doesn't
 provide
>>> an
 option to disable queued-invalidation though, when the hardware is
>>> capable. If you
 really want to try, tweak the code in intel_iommu_init_qi.
 
>>> 
>>> Hi Kevin,
>>> 
>>> Thanks to point out this. Do you have any ideas about this problem ? I
>>> tried to descript the problem much clear in my reply to Alex, hope you
>>> could have a look if you're interested.
>>> 
>> 
>> btw I saw you used 4.18 kernel in this test. What about latest kernel?
>> 
> 
> Not test yet. It's hard to upgrade kernel in our environment.
> 
>> Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
>> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
>> hardware. Check the window between b) and c) and see whether the software 
>> does
>> the right thing as expected there.
>> 
> 
> We add some log in iommu driver these days, the software seems fine. But we
> didn't look inside the qi_submit_sync yet, I'll try it tonight.

So here is my guess:

Intel probably used as a basis for the IOTLB an implementation of
some other (regular) TLB design.

Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”):

"Software wishing to prevent this uncertainty should not write to
a paging-structure entry in a way that would change, for any linear
address, both the page size and either the page frame, access rights,
or other attributes.”


Now the aforementioned uncertainty is a bit different (multiple
*valid* translations of a single address). Yet, perhaps this is
yet another thing that might happen.

From a brief look on the handling of MMU (not IOMMU) hugepages
in Linux, indeed the PMD is first cleared and flushed before a
new valid PMD is set. This is possible for MMUs since they
allow the software to handle spurious page-faults gracefully.
This is not the case for the IOMMU though (without PRI).

Not sure this explains everything though. If that is the problem,
then during a mapping that changes page-sizes, a TLB flush is
needed, similarly to the one Longpeng did manually.




signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 3/3] swiotlb: remove swiotlb_nr_tbl

2021-03-18 Thread Christoph Hellwig
All callers just use it to check if swiotlb is active at all, for which
they can just use is_swiotlb_active.  In the longer run drivers need
to stop using is_swiotlb_active as well, but let's do the simple step
first.

Signed-off-by: Christoph Hellwig 
---
 drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
 drivers/pci/xen-pcifront.c   | 2 +-
 include/linux/swiotlb.h  | 1 -
 kernel/dma/swiotlb.c | 7 +--
 5 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ad22f42541bda6..a9d65fc8aa0eab 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
 
max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-   if (swiotlb_nr_tbl()) {
+   if (is_swiotlb_active()) {
unsigned int max_segment;
 
max_segment = swiotlb_max_segment();
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index a37bc3d7b38b3b..9662522aa0664a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = !!swiotlb_nr_tbl();
+   need_swiotlb = is_swiotlb_active();
 #endif
 
ret = ttm_bo_device_init(&drm->ttm.bdev, &nouveau_bo_driver,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index 2d75026482197d..b7a8f3a1921f83 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(&pcifront_dev_lock);
 
-   if (!err && !swiotlb_nr_tbl()) {
+   if (!err && !is_swiotlb_active()) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(&pdev->xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 63f7a63f61d098..216854a5e5134b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -37,7 +37,6 @@ enum swiotlb_force {
 
 extern void swiotlb_init(int verbose);
 int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
-extern unsigned long swiotlb_nr_tbl(void);
 unsigned long swiotlb_size_or_default(void);
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
 extern int swiotlb_late_init_with_default_size(size_t default_size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 13de669a9b4681..539c76beb52e07 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -94,12 +94,6 @@ setup_io_tlb_npages(char *str)
 }
 early_param("swiotlb", setup_io_tlb_npages);
 
-unsigned long swiotlb_nr_tbl(void)
-{
-   return io_tlb_default_mem ? io_tlb_default_mem->nslabs : 0;
-}
-EXPORT_SYMBOL_GPL(swiotlb_nr_tbl);
-
 unsigned int swiotlb_max_segment(void)
 {
return io_tlb_default_mem ? max_segment : 0;
@@ -652,6 +646,7 @@ bool is_swiotlb_active(void)
 {
return io_tlb_default_mem != NULL;
 }
+EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
 
-- 
2.30.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 2/3] swiotlb: dynamically allocate io_tlb_default_mem

2021-03-18 Thread Christoph Hellwig
Instead of allocating ->list and ->orig_addr separately just do one
dynamic allocation for the actual io_tlb_mem structure.  This simplifies
a lot of the initialization code, and also allows to just check
io_tlb_default_mem to see if swiotlb is in use.

Signed-off-by: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c |  22 +--
 include/linux/swiotlb.h   |  18 ++-
 kernel/dma/swiotlb.c  | 306 --
 3 files changed, 117 insertions(+), 229 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 5329ad54a5f34e..4c89afc0df6289 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -158,17 +158,14 @@ static const char *xen_swiotlb_error(enum xen_swiotlb_err 
err)
 int __ref xen_swiotlb_init(void)
 {
enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN;
-   unsigned long nslabs, bytes, order;
-   unsigned int repeat = 3;
+   unsigned long bytes = swiotlb_size_or_default();
+   unsigned long nslabs = bytes >> IO_TLB_SHIFT;
+   unsigned int order, repeat = 3;
int rc = -ENOMEM;
char *start;
 
-   nslabs = swiotlb_nr_tbl();
-   if (!nslabs)
-   nslabs = DEFAULT_NSLABS;
 retry:
m_ret = XEN_SWIOTLB_ENOMEM;
-   bytes = nslabs << IO_TLB_SHIFT;
order = get_order(bytes);
 
/*
@@ -221,19 +218,16 @@ int __ref xen_swiotlb_init(void)
 #ifdef CONFIG_X86
 void __init xen_swiotlb_init_early(void)
 {
-   unsigned long nslabs, bytes;
+   unsigned long bytes = swiotlb_size_or_default();
+   unsigned long nslabs = bytes >> IO_TLB_SHIFT;
unsigned int repeat = 3;
char *start;
int rc;
 
-   nslabs = swiotlb_nr_tbl();
-   if (!nslabs)
-   nslabs = DEFAULT_NSLABS;
 retry:
/*
 * Get IO TLB memory from any location.
 */
-   bytes = nslabs << IO_TLB_SHIFT;
start = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
if (!start)
panic("%s: Failed to allocate %lu bytes align=0x%lx\n",
@@ -248,8 +242,8 @@ void __init xen_swiotlb_init_early(void)
if (repeat--) {
/* Min is 2MB */
nslabs = max(1024UL, (nslabs >> 1));
-   pr_info("Lowering to %luMB\n",
-   (nslabs << IO_TLB_SHIFT) >> 20);
+   bytes = nslabs << IO_TLB_SHIFT;
+   pr_info("Lowering to %luMB\n", bytes >> 20);
goto retry;
}
panic("%s (rc:%d)", xen_swiotlb_error(XEN_SWIOTLB_EFIXUP), rc);
@@ -548,7 +542,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct 
scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
+   return xen_phys_to_dma(hwdev, io_tlb_default_mem->end - 1) <= mask;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 5ec5378b17c333..63f7a63f61d098 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -90,28 +90,30 @@ struct io_tlb_mem {
phys_addr_t end;
unsigned long nslabs;
unsigned long used;
-   unsigned int *list;
unsigned int index;
-   phys_addr_t *orig_addr;
-   size_t *alloc_size;
spinlock_t lock;
struct dentry *debugfs;
bool late_alloc;
+   struct io_tlb_slot {
+   phys_addr_t orig_addr;
+   size_t alloc_size;
+   unsigned int list;
+   } slots[];
 };
-extern struct io_tlb_mem io_tlb_default_mem;
+extern struct io_tlb_mem *io_tlb_default_mem;
 
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = &io_tlb_default_mem;
+   struct io_tlb_mem *mem = io_tlb_default_mem;
 
-   return paddr >= mem->start && paddr < mem->end;
+   return mem && paddr >= mem->start && paddr < mem->end;
 }
 
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(void);
-void __init swiotlb_adjust_size(unsigned long new_size);
+void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
@@ -135,7 +137,7 @@ static inline bool is_swiotlb_active(void)
return false;
 }
 
-static inline void swiotlb_adjust_size(unsigned long new_size)
+static inline void swiotlb_adjust_size(unsigned long size)
 {
 }
 #endif /* CONFIG_SWIOTLB */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index d9c097f0f78cec..13de669a9b4681 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -63,7 +63,7 @@
 
 enum swiotlb_force swiotlb_force;
 
-struct io_tlb_mem io_tlb_default_mem;
+struct io_tlb_mem *io_tlb_default_mem;
 
 /*
  * Max segment that we can

[PATCH 1/3] swiotlb: move global variables into a new io_tlb_mem structure

2021-03-18 Thread Christoph Hellwig
From: Claire Chang 

Added a new struct, io_tlb_mem, as the IO TLB memory pool descriptor and
moved relevant global variables into that struct.
This will be useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
[hch: rebased]
Signed-off-by: Christoph Hellwig 
---
 drivers/xen/swiotlb-xen.c |   2 +-
 include/linux/swiotlb.h   |  43 -
 kernel/dma/swiotlb.c  | 354 ++
 3 files changed, 206 insertions(+), 193 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4ecfce2c6f7263..5329ad54a5f34e 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -548,7 +548,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct 
scatterlist *sgl,
 static int
 xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
 {
-   return xen_phys_to_dma(hwdev, io_tlb_end - 1) <= mask;
+   return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask;
 }
 
 const struct dma_map_ops xen_swiotlb_dma_ops = {
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0696bdc8072e97..5ec5378b17c333 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct device;
 struct page;
@@ -61,11 +62,49 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys,
 
 #ifdef CONFIG_SWIOTLB
 extern enum swiotlb_force swiotlb_force;
-extern phys_addr_t io_tlb_start, io_tlb_end;
+
+/**
+ * struct io_tlb_mem - IO TLB Memory Pool Descriptor
+ *
+ * @start: The start address of the swiotlb memory pool. Used to do a quick
+ * range check to see if the memory was in fact allocated by this
+ * API.
+ * @end:   The end address of the swiotlb memory pool. Used to do a quick
+ * range check to see if the memory was in fact allocated by this
+ * API.
+ * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
+ * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @used:  The number of used IO TLB block.
+ * @list:  The free list describing the number of free entries available
+ * from each index.
+ * @index: The index to start searching in the next round.
+ * @orig_addr: The original address corresponding to a mapped entry.
+ * @alloc_size:Size of the allocated buffer.
+ * @lock:  The lock to protect the above data structures in the map and
+ * unmap calls.
+ * @debugfs:   The dentry to debugfs.
+ * @late_alloc:%true if allocated using the page allocator
+ */
+struct io_tlb_mem {
+   phys_addr_t start;
+   phys_addr_t end;
+   unsigned long nslabs;
+   unsigned long used;
+   unsigned int *list;
+   unsigned int index;
+   phys_addr_t *orig_addr;
+   size_t *alloc_size;
+   spinlock_t lock;
+   struct dentry *debugfs;
+   bool late_alloc;
+};
+extern struct io_tlb_mem io_tlb_default_mem;
 
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
-   return paddr >= io_tlb_start && paddr < io_tlb_end;
+   struct io_tlb_mem *mem = &io_tlb_default_mem;
+
+   return paddr >= mem->start && paddr < mem->end;
 }
 
 void __init swiotlb_exit(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 35e24f0ff8b207..d9c097f0f78cec 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -59,32 +59,11 @@
  */
 #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
 
-enum swiotlb_force swiotlb_force;
-
-/*
- * Used to do a quick range check in swiotlb_tbl_unmap_single and
- * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by 
this
- * API.
- */
-phys_addr_t io_tlb_start, io_tlb_end;
-
-/*
- * The number of IO TLB blocks (in groups of 64) between io_tlb_start and
- * io_tlb_end.  This is command line adjustable via setup_io_tlb_npages.
- */
-static unsigned long io_tlb_nslabs;
+#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
 
-/*
- * The number of used IO TLB block
- */
-static unsigned long io_tlb_used;
+enum swiotlb_force swiotlb_force;
 
-/*
- * This is a free list describing the number of free entries available from
- * each index
- */
-static unsigned int *io_tlb_list;
-static unsigned int io_tlb_index;
+struct io_tlb_mem io_tlb_default_mem;
 
 /*
  * Max segment that we can provide which (if pages are contingous) will
@@ -92,32 +71,15 @@ static unsigned int io_tlb_index;
  */
 static unsigned int max_segment;
 
-/*
- * We need to save away the original address corresponding to a mapped entry
- * for the sync operations.
- */
-#define INVALID_PHYS_ADDR (~(phys_addr_t)0)
-static phys_addr_t *io_tlb_orig_addr;
-
-/*
- * The mapped buffer's size should be validated during a sync operation.
- */
-static size_t *io_tlb_alloc_size;
-
-/*
- * Protect the above data structures in the map and unmap calls
- */
-static DEFINE_SPINLOCK(io_tlb_lock);
-
-static int late_alloc;
-
 static int __init
 setup_io_tl

swiotlb cleanups v3

2021-03-18 Thread Christoph Hellwig
Hi Konrad,

this series contains a bunch of swiotlb cleanups, mostly to reduce the
amount of internals exposed to code outside of swiotlb.c, which should
helper to prepare for supporting multiple different bounce buffer pools.

Changes since v2:
 - fix a bisetion hazard that did not allocate the alloc_size array
 - dropped all patches already merged

Changes since v1:
 - rebased to v5.12-rc1
 - a few more cleanups
 - merge and forward port the patch from Claire to move all the global
   variables into a struct to prepare for multiple instances
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread John Garry


Well yeah, in your particular case you're allocating from a heavily 
over-contended address space, so much of the time it is genuinely full. 
Plus you're primarily churning one or two sizes of IOVA, so there's a 
high chance that you will either allocate immediately from the cached 
node (after a previous free), or search the whole space and fail. In 
case it was missed, searching only some arbitrary subset of the space 
before giving up is not a good behaviour for an allocator to have in 
general.


So since the retry means that we search through the complete pfn 
range most of the time (due to poor success rate), we should be able 
to do a better job at maintaining an accurate max alloc size, by 
calculating it from the range search, and not relying on max alloc 
failed or resetting it frequently. Hopefully that would mean that 
we're smarter about not trying the allocation.


So I tried that out and we seem to be able to scrap back an 
appreciable amount of performance. Maybe 80% of original, with with 
another change, below.


TBH if you really want to make allocation more efficient I think there 
are more radical changes that would be worth experimenting with, like 
using some form of augmented rbtree to also encode the amount of free 
space under each branch, or representing the free space in its own 
parallel tree, or whether some other structure entirely might be a 
better bet these days.


And if you just want to make your thing acceptably fast, now I'm going 
to say stick a quirk somewhere to force the "forcedac" option on your 
platform ;)




Easier said than done :)

But still, I'd like to just be able to cache all IOVA sizes for my DMA 
engine, so we should not have to go near the RB tree often.


I have put together a series to allow upper limit of rcache range be 
increased per domain. So naturally that gives better performance than we 
originally had.


I don't want to prejudice the solution by saying what I think of it now, 
so will send it out...




[...]
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = &iovad->anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add 
a zero-length 32b boundary marker node for the 32b pfn restart point?


That would need special cases all over the place to prevent the marker 
getting merged into reservations or hit by lookups, and at worst break 
the ordering of the tree if a legitimate node straddles the boundary. I 
did consider having the insert/delete routines keep track of yet another 
cached node for whatever's currently the first thing above the 32-bit 
boundary, but I was worried that might be a bit too invasive.


Yeah, I did think of that. I don't think that it would have too much 
overhead.




FWIW I'm currently planning to come back to this again when I have a bit 
more time, since the optimum thing to do (modulo replacing the entire 
algorithm...) is actually to make the second part of the search 
*upwards* from the cached node to the limit. Furthermore, to revive my 
arch/arm conversion I think we're realistically going to need a 
compatibility option for bottom-up allocation to avoid too many nasty 
surprises, so I'd like to generalise things to tackle both concerns at 
once.




Thanks,
John

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC PATCH 7/7] iommu/amd: Add support for using AMD IOMMU v2 page table for DMA-API

2021-03-18 Thread Joerg Roedel
On Fri, Mar 12, 2021 at 03:04:11AM -0600, Suravee Suthikulpanit wrote:
> Introduce init function for setting up DMA domain for DMA-API with
> the IOMMU v2 page table.
> 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  drivers/iommu/amd/iommu.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index e29ece6e1e68..bd26de8764bd 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -1937,6 +1937,24 @@ static int protection_domain_init_v1(struct 
> protection_domain *domain, int mode)
>   return 0;
>  }
>  
> +static int protection_domain_init_v2(struct protection_domain *domain)
> +{
> + spin_lock_init(&domain->lock);
> + domain->id = domain_id_alloc();
> + if (!domain->id)
> + return -ENOMEM;
> + INIT_LIST_HEAD(&domain->dev_list);
> +
> + domain->giov = true;
> +
> + if (amd_iommu_pgtable == AMD_IOMMU_V2 &&
> + domain_enable_v2(domain, 1, false)) {
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +

You also need to advertise a different aperture size and a different
pgsize-bitmap. The host page-table can only map a 48 bit address space,
not a 64 bit one like with v1 page-tables.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 6/7] iommu/amd: Introduce amd_iommu_pgtable command-line option

2021-03-18 Thread Joerg Roedel
On Fri, Mar 12, 2021 at 03:04:10AM -0600, Suravee Suthikulpanit wrote:
> To allow specification whether to use v1 or v2 IOMMU pagetable for
> DMA remapping when calling kernel DMA-API.
> 
> Signed-off-by: Suravee Suthikulpanit 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  6 ++
>  drivers/iommu/amd/init.c| 15 +++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index 04545725f187..466e807369ea 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -319,6 +319,12 @@
>This mode requires kvm-amd.avic=1.
>(Default when IOMMU HW support is present.)
>  
> + amd_iommu_pgtable= [HW,X86-64]
> + Specifies one of the following AMD IOMMU page table to
> + be used for DMA remapping for DMA-API:
> + v1 - Use v1 page table (Default)
> + v2 - Use v2 page table

Any reason v2 can not be the default when it is supported by the IOMMU?

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 5/7] iommu/amd: Add support for Guest IO protection

2021-03-18 Thread Joerg Roedel
On Fri, Mar 12, 2021 at 03:04:09AM -0600, Suravee Suthikulpanit wrote:
> @@ -519,6 +521,7 @@ struct protection_domain {
>   spinlock_t lock;/* mostly used to lock the page table*/
>   u16 id; /* the domain id written to the device table */
>   int glx;/* Number of levels for GCR3 table */
> + bool giov;  /* guest IO protection domain */

Could this be turned into a flag?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [RFC PATCH 4/7] iommu/amd: Initial support for AMD IOMMU v2 page table

2021-03-18 Thread Joerg Roedel
Hi Suravee,

On Fri, Mar 12, 2021 at 03:04:08AM -0600, Suravee Suthikulpanit wrote:
> @@ -503,6 +504,7 @@ struct amd_io_pgtable {
>   int mode;
>   u64 *root;
>   atomic64_t  pt_root;/* pgtable root and pgtable mode */
> + struct mm_structv2_mm;
>  };

A whole mm_struct is a bit too much when all we really need is an 8-byte
page-table root pointer.


> +static pte_t *fetch_pte(struct amd_io_pgtable *pgtable,
> +   unsigned long iova,
> +   unsigned long *page_size)
> +{
> + int level;
> + pte_t *ptep;
> +
> + ptep = lookup_address_in_mm(&pgtable->v2_mm, iova, &level);
> + if (!ptep || pte_none(*ptep) || (level == PG_LEVEL_NONE))
> + return NULL;

So you are re-using the in-kernel page-table building code. That safes
some lines of code, but has several problems:

1) When you boot a kernel with this code on a machine with
   5-level paging, the IOMMU code will build a 5-level
   page-table too, breaking IOMMU mappings.

2) You need a whole mm_struct per domain, which is big.

3) The existing macros for CPU page-tables require locking. For
   IOMMU page-tables this is not really necessary and might
   cause scalability issues.

Overall I think you should write your own code to build a 4-level
page-table and use cmpxchg64 to avoid the need for locking. Then things
will not break when such a kernel is suddenly booted on a machine which
as 5-level paging enabled.

Regards,

Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread Robin Murphy

On 2021-03-18 11:38, John Garry wrote:

On 10/03/2021 17:47, John Garry wrote:

On 09/03/2021 15:55, John Garry wrote:

On 05/03/2021 16:35, Robin Murphy wrote:

Hi Robin,


When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of 
the

opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. 


Right


We can improve on that by implementing
a proper tree traversal to find the first node above the relevant 
limit,

and set the exact start point.


Thanks for this. However, as mentioned in the other thread, this does 
not help our performance regression Re: commit 4e89dce72521.


And mentioning this "retry" approach again, in case it was missed, 
from my experiment on the affected HW, it also has generally a 
dreadfully low success rate - less than 1% for the 32b range retry. 
Retry rate is about 20%. So I am saying from this 20%, less than 1% 
of those succeed.


Well yeah, in your particular case you're allocating from a heavily 
over-contended address space, so much of the time it is genuinely full. 
Plus you're primarily churning one or two sizes of IOVA, so there's a 
high chance that you will either allocate immediately from the cached 
node (after a previous free), or search the whole space and fail. In 
case it was missed, searching only some arbitrary subset of the space 
before giving up is not a good behaviour for an allocator to have in 
general.


So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate), we should be able to do a 
better job at maintaining an accurate max alloc size, by calculating 
it from the range search, and not relying on max alloc failed or 
resetting it frequently. Hopefully that would mean that we're smarter 
about not trying the allocation.


So I tried that out and we seem to be able to scrap back an appreciable 
amount of performance. Maybe 80% of original, with with another change, 
below.


TBH if you really want to make allocation more efficient I think there 
are more radical changes that would be worth experimenting with, like 
using some form of augmented rbtree to also encode the amount of free 
space under each branch, or representing the free space in its own 
parallel tree, or whether some other structure entirely might be a 
better bet these days.


And if you just want to make your thing acceptably fast, now I'm going 
to say stick a quirk somewhere to force the "forcedac" option on your 
platform ;)


[...]
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = &iovad->anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add a 
zero-length 32b boundary marker node for the 32b pfn restart point?


That would need special cases all over the place to prevent the marker 
getting merged into reservations or hit by lookups, and at worst break 
the ordering of the tree if a legitimate node straddles the boundary. I 
did consider having the insert/delete routines keep track of yet another 
cached node for whatever's currently the first thing above the 32-bit 
boundary, but I was worried that might be a bit too invasive.


FWIW I'm currently planning to come back to this again when I have a bit 
more time, since the optimum thing to do (modulo replacing the entire 
algorithm...) is actually to make the second part of the search 
*upwards* from the cached node to the limit. Furthermore, to revive my 
arch/arm conversion I think we're realistically going to need a 
compatibility option for bottom-up allocation to avoid too many nasty 
surprises, so I'd like to generalise things to tackle both concerns at once.


Thanks,
Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 0/2] VMD MSI Remapping Bypass

2021-03-18 Thread Lorenzo Pieralisi
On Thu, Mar 18, 2021 at 10:07:38AM +0100, j...@8bytes.org wrote:
> On Wed, Mar 17, 2021 at 07:14:17PM +, Derrick, Jonathan wrote:
> > Gentle reminder, for v5.13 ?
> 
> This should go through the PCI tree, Bjorn?

I will start queuing code next week, noted.

Thanks,
Lorenzo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/3] iommu/virtio: Enable x86 support

2021-03-18 Thread Robin Murphy

On 2021-03-16 19:16, Jean-Philippe Brucker wrote:

With the VIOT support in place, x86 platforms can now use the
virtio-iommu.

The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it
themselves.


Actually, now that both AMD and Intel are converted over, maybe it's 
finally time to punt that to x86 arch code to match arm64?


Robin.


Signed-off-by: Jean-Philippe Brucker 
---
  drivers/iommu/Kconfig | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 2819b5c8ec30..ccca83ef2f06 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -400,8 +400,9 @@ config HYPERV_IOMMU
  config VIRTIO_IOMMU
tristate "Virtio IOMMU driver"
depends on VIRTIO
-   depends on ARM64
+   depends on (ARM64 || X86)
select IOMMU_API
+   select IOMMU_DMA if X86
select INTERVAL_TREE
select ACPI_VIOT if ACPI
help


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread John Garry

On 10/03/2021 17:47, John Garry wrote:

On 09/03/2021 15:55, John Garry wrote:

On 05/03/2021 16:35, Robin Murphy wrote:

Hi Robin,


When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of the
opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. 


Right


We can improve on that by implementing
a proper tree traversal to find the first node above the relevant limit,
and set the exact start point.


Thanks for this. However, as mentioned in the other thread, this does 
not help our performance regression Re: commit 4e89dce72521.


And mentioning this "retry" approach again, in case it was missed, 
from my experiment on the affected HW, it also has generally a 
dreadfully low success rate - less than 1% for the 32b range retry. 
Retry rate is about 20%. So I am saying from this 20%, less than 1% of 
those succeed.




So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate), we should be able to do a 
better job at maintaining an accurate max alloc size, by calculating it 
from the range search, and not relying on max alloc failed or resetting 
it frequently. Hopefully that would mean that we're smarter about not 
trying the allocation.


So I tried that out and we seem to be able to scrap back an appreciable 
amount of performance. Maybe 80% of original, with with another change, 
below.


Thanks,
John








Signed-off-by: Robin Murphy 
---
  drivers/iommu/iova.c | 39 ++-
  1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c28003e1d2ee..471c48dd71e7 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain 
*iovad, struct iova *free)

  iovad->cached_node = rb_next(&free->node);
  }
+static struct rb_node *iova_find_limit(struct iova_domain *iovad, 
unsigned long limit_pfn)

+{
+    struct rb_node *node, *next;
+    /*
+ * Ideally what we'd like to judge here is whether limit_pfn is 
close
+ * enough to the highest-allocated IOVA that starting the 
allocation
+ * walk from the anchor node will be quicker than this initial 
work to
+ * find an exact starting point (especially if that ends up 
being the
+ * anchor node anyway). This is an incredibly crude 
approximation which
+ * only really helps the most likely case, but is at least 
trivially easy.

+ */
+    if (limit_pfn > iovad->dma_32bit_pfn)
+    return &iovad->anchor.node;
+
+    node = iovad->rbroot.rb_node;
+    while (to_iova(node)->pfn_hi < limit_pfn)
+    node = node->rb_right;
+
+search_left:
+    while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= 
limit_pfn)

+    node = node->rb_left;
+
+    if (!node->rb_left)
+    return node;
+
+    next = node->rb_left;
+    while (next->rb_right) {
+    next = next->rb_right;
+    if (to_iova(next)->pfn_lo >= limit_pfn) {
+    node = next;
+    goto search_left;
+    }
+    }
+
+    return node;
+}
+
  /* Insert the iova into domain rbtree by holding writer lock */
  static void
  iova_insert_rbtree(struct rb_root *root, struct iova *iova,
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = &iovad->anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add a 
zero-length 32b boundary marker node for the 32b pfn restart point?



  curr_iova = to_iova(curr);
  goto retry;
  }



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/dma: Fix a typo in a comment

2021-03-18 Thread Robin Murphy

On 2021-03-18 09:55, chenxiang (M) wrote:

Hi will,


在 2021/3/18 17:34, Will Deacon 写道:

On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote:

From: Xiang Chen 

Fix a type "SAC" to "DAC" in the comment of function
iommu_dma_alloc_iova().

Signed-off-by: Xiang Chen 
---
  drivers/iommu/dma-iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c8..3bc17ee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,

  if (domain->geometry.force_aperture)
  dma_limit = min(dma_limit, 
(u64)domain->geometry.aperture_end);

-    /* Try to get PCI devices a SAC address */
+    /* Try to get PCI devices a DAC address */
  if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
  iova = alloc_iova_fast(iovad, iova_len,
 DMA_BIT_MASK(32) >> shift, false);

This doesn't look like a typo to me... Please explain.


As the author of said comment, it is definitely not a typo.

I think it means double-address cycle(DAC), and in LLD3 452 page, there 
is a description about it "PCI double-address cycle mappings Normally,
the DMA support layer works with 32-bit bus addresses, possibly 
restricted by a specific device’s DMA mask. The PCI bus, however, also 
supports a 64-bit addressing mode, the double-address cycle (DAC)."

Please point it out if i am wrong :)


Yes, now look at what the code above does: *if* a PCI device has a DMA 
mask greater than 32 bits (i.e. can support dual address cycles), we 
first try an IOVA allocation with an explicit 32-bit limit (i.e. that 
can be expressed in a single address cycle), in the hope that we don't 
have to fall back to allocating from the upper range and forcing the 
device to actually use DAC (and suffer the potential performance impact).


Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/3] iommu/virtio: Enable x86 support

2021-03-18 Thread Joerg Roedel
On Tue, Mar 16, 2021 at 08:16:54PM +0100, Jean-Philippe Brucker wrote:
> With the VIOT support in place, x86 platforms can now use the
> virtio-iommu.
> 
> The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it
> themselves.
> 
> Signed-off-by: Jean-Philippe Brucker 
> ---
>  drivers/iommu/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 2819b5c8ec30..ccca83ef2f06 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -400,8 +400,9 @@ config HYPERV_IOMMU
>  config VIRTIO_IOMMU
>   tristate "Virtio IOMMU driver"
>   depends on VIRTIO
> - depends on ARM64
> + depends on (ARM64 || X86)
>   select IOMMU_API
> + select IOMMU_DMA if X86
>   select INTERVAL_TREE
>   select ACPI_VIOT if ACPI
>   help

Acked-by: Joerg Roedel 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 0/4] Misc vSVA fixes for VT-d

2021-03-18 Thread Joerg Roedel
On Tue, Mar 02, 2021 at 02:13:56AM -0800, Jacob Pan wrote:
> Hi Baolu et al,
> 
> This is a collection of SVA-related fixes.
> 
> ChangeLog:
> 
> v2:
>   - For guest SVA, call pasid_set_wpe directly w/o checking host CR0.wp
> (Review comments by Kevin T.)
>   - Added fixes tag
> 
> Thanks,
> 
> Jacob
> 
> Jacob Pan (4):
>   iommu/vt-d: Enable write protect for supervisor SVM
>   iommu/vt-d: Enable write protect propagation from guest
>   iommu/vt-d: Reject unsupported page request modes
>   iommu/vt-d: Calculate and set flags for handle_mm_fault
> 
>  drivers/iommu/intel/pasid.c | 29 +
>  drivers/iommu/intel/svm.c   | 21 +
>  include/uapi/linux/iommu.h  |  3 ++-
>  3 files changed, 48 insertions(+), 5 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v1] iommu/tegra-smmu: Make tegra_smmu_probe_device() to handle all IOMMU phandles

2021-03-18 Thread Joerg Roedel
On Fri, Mar 12, 2021 at 06:54:39PM +0300, Dmitry Osipenko wrote:
> The tegra_smmu_probe_device() handles only the first IOMMU device-tree
> phandle, skipping the rest. Devices like 3D module on Tegra30 have
> multiple IOMMU phandles, one for each h/w block, and thus, only one
> IOMMU phandle is added to fwspec for the 3D module, breaking GPU.
> Previously this problem was masked by tegra_smmu_attach_dev() which
> didn't use the fwspec, but parsed the DT by itself. The previous commit
> to tegra-smmu driver partially reverted changes that caused problems for
> T124 and now we have tegra_smmu_attach_dev() that uses the fwspec and
> the old-buggy variant of tegra_smmu_probe_device() which skips secondary
> IOMMUs.
> 
> Make tegra_smmu_probe_device() not to skip the secondary IOMMUs. This
> fixes a partially attached IOMMU of the 3D module on Tegra30 and now GPU
> works properly once again.
> 
> Fixes: 765a9d1d02b2 ("iommu/tegra-smmu: Fix mc errors on tegra124-nyan")
> Signed-off-by: Dmitry Osipenko 
> ---
>  drivers/iommu/tegra-smmu.c | 7 +++
>  1 file changed, 3 insertions(+), 4 deletions(-)

Applied for v5.12, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Report more information about invalidation errors

2021-03-18 Thread Joerg Roedel
On Thu, Mar 18, 2021 at 08:53:40AM +0800, Lu Baolu wrote:
> When the invalidation queue errors are encountered, dump the information
> logged by the VT-d hardware together with the pending queue invalidation
> descriptors.
> 
> Signed-off-by: Ashok Raj 
> Tested-by: Guo Kaijie 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/dmar.c  | 68 ++---
>  include/linux/intel-iommu.h |  6 
>  2 files changed, 70 insertions(+), 4 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/vt-d: Disable SVM when ATS/PRI/PASID are not enabled in the device

2021-03-18 Thread Joerg Roedel
On Sun, Mar 14, 2021 at 01:15:34PM -0700, Kyung Min Park wrote:
> Currently, the Intel VT-d supports Shared Virtual Memory (SVM) only when
> IO page fault is supported. Otherwise, shared memory pages can not be
> swapped out and need to be pinned. The device needs the Address Translation
> Service (ATS), Page Request Interface (PRI) and Process Address Space
> Identifier (PASID) capabilities to be enabled to support IO page fault.
> 
> Disable SVM when ATS, PRI and PASID are not enabled in the device.
> 
> Signed-off-by: Kyung Min Park 

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()

2021-03-18 Thread Joerg Roedel
On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote:
> The pasid_lock is used to synchronize different threads from modifying a
> same pasid directory entry at the same time. It causes below lockdep splat.
> 
> [   83.296538] 
> [   83.296538] WARNING: possible irq lock inversion dependency detected
> [   83.296539] 5.12.0-rc3+ #25 Tainted: GW
> [   83.296539] 
> [   83.296540] bash/780 just changed the state of lock:
> [   83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at:
>iommu_flush_dev_iotlb.part.0+0x32/0x110
> [   83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past:
> [   83.296547]  (pasid_lock){+.+.}-{2:2}
> [   83.296548]
> 
>and interrupts could create inverse lock ordering between them.
> 
> [   83.296549] other info that might help us debug this:
> [   83.296549] Chain exists of:
>  device_domain_lock --> &iommu->lock --> pasid_lock
> [   83.296551]  Possible interrupt unsafe locking scenario:
> 
> [   83.296551]CPU0CPU1
> [   83.296552]
> [   83.296552]   lock(pasid_lock);
> [   83.296553]local_irq_disable();
> [   83.296553]lock(device_domain_lock);
> [   83.296554]lock(&iommu->lock);
> [   83.296554]   
> [   83.296554] lock(device_domain_lock);
> [   83.296555]
> *** DEADLOCK ***
> 
> Fix it by replacing the pasid_lock with an atomic exchange operation.
> 
> Reported-and-tested-by: Dave Jiang 
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/intel/pasid.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
> index 9fb3d3e80408..1ddcb8295f72 100644
> --- a/drivers/iommu/intel/pasid.c
> +++ b/drivers/iommu/intel/pasid.c
> @@ -24,7 +24,6 @@
>  /*
>   * Intel IOMMU system wide PASID name space:
>   */
> -static DEFINE_SPINLOCK(pasid_lock);
>  u32 intel_pasid_max_id = PASID_MAX;
>  
>  int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid)
> @@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device 
> *dev, u32 pasid)
>   dir_index = pasid >> PASID_PDE_SHIFT;
>   index = pasid & PASID_PTE_MASK;
>  
> - spin_lock(&pasid_lock);
>   entries = get_pasid_table_from_pde(&dir[dir_index]);
>   if (!entries) {
>   entries = alloc_pgtable_page(info->iommu->node);
> - if (!entries) {
> - spin_unlock(&pasid_lock);
> + if (!entries)
>   return NULL;
> - }
>  
> - WRITE_ONCE(dir[dir_index].val,
> -(u64)virt_to_phys(entries) | PASID_PTE_PRESENT);
> + if (cmpxchg64(&dir[dir_index].val, 0ULL,
> +   (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) {
> + free_pgtable_page(entries);
> + entries = get_pasid_table_from_pde(&dir[dir_index]);

This is racy, someone could have already cleared the pasid-entry again.
What you need to do here is to retry the whole path by adding a goto
to before  the first get_pasid_table_from_pde() call.

Btw, what makes sure that the pasid_entry does not go away when it is
returned here?

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()

2021-03-18 Thread Joerg Roedel
Hi Baolu,

On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote:
> The private data field of a page group response descriptor is set then
> immediately cleared in prq_event_thread(). Fix this by moving clearing
> code up.
> 
> Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode")
> Cc: Jacob Pan 
> Reviewed-by: Liu Yi L 
> Signed-off-by: Lu Baolu 

Does this fix an actual bug? If so, please state it in the commit
message and also fix the subject line to state what is set/cleared.

Thanks,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 1/2] iommu/iova: Add rbtree entry helper

2021-03-18 Thread Joerg Roedel
On Fri, Mar 05, 2021 at 04:35:22PM +, Robin Murphy wrote:
> Repeating the rb_entry() boilerplate all over the place gets old fast.
> Before adding yet more instances, add a little hepler to tidy it up.
> 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/iova.c | 23 ++-
>  1 file changed, 14 insertions(+), 9 deletions(-)

Applied both, thanks Robin.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] iommu/dma: Fix a typo in a comment

2021-03-18 Thread chenxiang (M)

Hi will,


在 2021/3/18 17:34, Will Deacon 写道:

On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote:

From: Xiang Chen 

Fix a type "SAC" to "DAC" in the comment of function
iommu_dma_alloc_iova().

Signed-off-by: Xiang Chen 
---
  drivers/iommu/dma-iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c8..3bc17ee 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
if (domain->geometry.force_aperture)
dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
  
-	/* Try to get PCI devices a SAC address */

+   /* Try to get PCI devices a DAC address */
if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
iova = alloc_iova_fast(iovad, iova_len,
   DMA_BIT_MASK(32) >> shift, false);

This doesn't look like a typo to me... Please explain.


I think it means double-address cycle(DAC), and in LLD3 452 page, there 
is a description about it "PCI double-address cycle mappings Normally,
the DMA support layer works with 32-bit bus addresses, possibly 
restricted by a specific device’s DMA mask. The PCI bus, however, also 
supports a 64-bit addressing mode, the double-address cycle (DAC)."

Please point it out if i am wrong :)

Best Regards,
chenxiang



Will

.




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-18 Thread Joerg Roedel
On Fri, Mar 05, 2021 at 04:32:34PM +, Robin Murphy wrote:
> In converting intel-iommu over to the common IOMMU DMA ops, it quietly
> lost the functionality of its "forcedac" option. Since this is a handy
> thing both for testing and for performance optimisation on certain
> platforms, reimplement it under the common IOMMU parameter namespace.
> 
> For the sake of fixing the inadvertent breakage of the Intel-specific
> parameter, remove the dmar_forcedac remnants and hook it up as an alias
> while documenting the transition to the new common parameter.
> 
> Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu 
> ops")
> Signed-off-by: Robin Murphy 
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 15 ---
>  drivers/iommu/dma-iommu.c   | 13 -
>  drivers/iommu/intel/iommu.c |  5 ++---
>  include/linux/dma-iommu.h   |  2 ++
>  4 files changed, 24 insertions(+), 11 deletions(-)

Applied for v5.13, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 0/2] Add Unisoc IOMMU basic driver

2021-03-18 Thread Joerg Roedel
On Fri, Mar 05, 2021 at 05:32:14PM +0800, Chunyan Zhang wrote:
>  .../devicetree/bindings/iommu/sprd,iommu.yaml |  57 ++
>  drivers/iommu/Kconfig |  12 +
>  drivers/iommu/Makefile|   1 +
>  drivers/iommu/sprd-iommu.c| 577 ++
>  4 files changed, 647 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
>  create mode 100644 drivers/iommu/sprd-iommu.c

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3] iommu: Check dev->iommu in iommu_dev_xxx functions

2021-03-18 Thread Joerg Roedel
On Wed, Mar 03, 2021 at 05:36:11PM +, Shameer Kolothum wrote:
> The device iommu probe/attach might have failed leaving dev->iommu
> to NULL and device drivers may still invoke these functions resulting
> in a crash in iommu vendor driver code.
> 
> Hence make sure we check that.
> 
> Fixes: a3a195929d40 ("iommu: Add APIs for multiple domains per device")
> Signed-off-by: Shameer Kolothum 
> ---
> v2 --> v3
>  -Removed iommu_ops from struct dev_iommu.
> v1 --> v2:
>  -Added iommu_ops to struct dev_iommu based on the discussion with Robin.
>  -Rebased against iommu-tree core branch.
> ---
>  drivers/iommu/iommu.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)

Applied, thanks.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 0/3] iommu/amd: Fix booting with amd_iommu=off

2021-03-18 Thread Joerg Roedel
On Wed, Mar 17, 2021 at 06:48:50PM +0800, Huang Rui wrote:
> Series are Acked-by: Huang Rui 

Thanks, series is applied for v5.12
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap

2021-03-18 Thread Song Bao Hua (Barry Song)



> -Original Message-
> From: chenxiang (M)
> Sent: Thursday, March 18, 2021 10:30 PM
> To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua
> (Barry Song) 
> Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org;
> linux-kselft...@vger.kernel.org; chenxiang (M) 
> Subject: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
> 
> From: Xiang Chen 
> 
> Currently it only support one page map/unmap once a time for dma-map
> benchmark, but there are some other scenaries which need to support for
> multi-page map/unmap: for those multi-pages interfaces such as
> dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages
> map/unmap is not the time of a single page * npages (not linear) as it
> may use block description instead of page description when it is satified
> with the size such as 2M/1G, and also it can send a single TLB invalidation
> command to invalidate multi-pages instead of multi-times when RIL is
> enabled (which will short the time of unmap). So it is necessary to add
> support for multi-pages map/unmap.
> 
> Add a parameter "-g" to support multi-pages map/unmap.
> 
> Signed-off-by: Xiang Chen 
> ---

Acked-by: Barry Song 

>  kernel/dma/map_benchmark.c  | 21 ++---
>  tools/testing/selftests/dma/dma_map_benchmark.c | 20 
>  2 files changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
> index e0e64f8..a5c1b01 100644
> --- a/kernel/dma/map_benchmark.c
> +++ b/kernel/dma/map_benchmark.c
> @@ -38,7 +38,8 @@ struct map_benchmark {
>   __u32 dma_bits; /* DMA addressing capability */
>   __u32 dma_dir; /* DMA data direction */
>   __u32 dma_trans_ns; /* time for DMA transmission in ns */
> - __u8 expansion[80]; /* For future use */
> + __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
> + __u8 expansion[76]; /* For future use */
>  };
> 
>  struct map_benchmark_data {
> @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data)
>   void *buf;
>   dma_addr_t dma_addr;
>   struct map_benchmark_data *map = data;
> + int npages = map->bparam.granule;
> + u64 size = npages * PAGE_SIZE;
>   int ret = 0;
> 
> - buf = (void *)__get_free_page(GFP_KERNEL);
> + buf = alloc_pages_exact(size, GFP_KERNEL);
>   if (!buf)
>   return -ENOMEM;
> 
> @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data)
>* 66 means evertything goes well! 66 is lucky.
>*/
>   if (map->dir != DMA_FROM_DEVICE)
> - memset(buf, 0x66, PAGE_SIZE);
> + memset(buf, 0x66, size);
> 
>   map_stime = ktime_get();
> - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
> + dma_addr = dma_map_single(map->dev, buf, size, map->dir);
>   if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
>   pr_err("dma_map_single failed on %s\n",
>   dev_name(map->dev));
> @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data)
>   ndelay(map->bparam.dma_trans_ns);
> 
>   unmap_stime = ktime_get();
> - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
> + dma_unmap_single(map->dev, dma_addr, size, map->dir);
>   unmap_etime = ktime_get();
>   unmap_delta = ktime_sub(unmap_etime, unmap_stime);
> 
> @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data)
>   }
> 
>  out:
> - free_page((unsigned long)buf);
> + free_pages_exact(buf, size);
>   return ret;
>  }
> 
> @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   struct map_benchmark_data *map = file->private_data;
>   void __user *argp = (void __user *)arg;
>   u64 old_dma_mask;
> -
>   int ret;
> 
>   if (copy_from_user(&map->bparam, argp, sizeof(map->bparam)))
> @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, 
> unsigned
> int cmd,
>   return -EINVAL;
>   }
> 
> + if (map->bparam.granule < 1 || map->bparam.granule > 1024) {
> + pr_err("invalid granule size\n");
> + return -EINVAL;
> + }
> +
>   switch (map->bparam.dma_dir) {
>   case DMA_MAP_BIDIRECTIONAL:
>   map->dir = DMA_BIDIRECTIONAL;
> diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c
> b/tools/testing/selftests/dma/dma_map_benchmark.c
> index fb23ce9..6f2caa6 100644
> --- a/tools/testing/selftests/dma/dma_map_benchmark.c
> +++ b/tools/testing/selftests/dma/dma_map_benchmark.c
> @@ -40,7 +40,8 @@ struct map_benchmark {
>   __u32 dma_bits; /* DMA addressing capability */
>   __u32 dma_dir; /* DMA data direction */
>

Re: [PATCH] iommu/dma: Fix a typo in a comment

2021-03-18 Thread Will Deacon
On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote:
> From: Xiang Chen 
> 
> Fix a type "SAC" to "DAC" in the comment of function
> iommu_dma_alloc_iova().
> 
> Signed-off-by: Xiang Chen 
> ---
>  drivers/iommu/dma-iommu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index af765c8..3bc17ee 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
> iommu_domain *domain,
>   if (domain->geometry.force_aperture)
>   dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
>  
> - /* Try to get PCI devices a SAC address */
> + /* Try to get PCI devices a DAC address */
>   if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
>   iova = alloc_iova_fast(iovad, iova_len,
>  DMA_BIT_MASK(32) >> shift, false);

This doesn't look like a typo to me... Please explain.

Will
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap

2021-03-18 Thread chenxiang
From: Xiang Chen 

Currently it only support one page map/unmap once a time for dma-map
benchmark, but there are some other scenaries which need to support for
multi-page map/unmap: for those multi-pages interfaces such as
dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages
map/unmap is not the time of a single page * npages (not linear) as it
may use block description instead of page description when it is satified
with the size such as 2M/1G, and also it can send a single TLB invalidation
command to invalidate multi-pages instead of multi-times when RIL is
enabled (which will short the time of unmap). So it is necessary to add
support for multi-pages map/unmap.

Add a parameter "-g" to support multi-pages map/unmap.

Signed-off-by: Xiang Chen 
---
 kernel/dma/map_benchmark.c  | 21 ++---
 tools/testing/selftests/dma/dma_map_benchmark.c | 20 
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index e0e64f8..a5c1b01 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -38,7 +38,8 @@ struct map_benchmark {
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */
-   __u8 expansion[80]; /* For future use */
+   __u32 granule;  /* how many PAGE_SIZE will do map/unmap once a time */
+   __u8 expansion[76]; /* For future use */
 };
 
 struct map_benchmark_data {
@@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data)
void *buf;
dma_addr_t dma_addr;
struct map_benchmark_data *map = data;
+   int npages = map->bparam.granule;
+   u64 size = npages * PAGE_SIZE;
int ret = 0;
 
-   buf = (void *)__get_free_page(GFP_KERNEL);
+   buf = alloc_pages_exact(size, GFP_KERNEL);
if (!buf)
return -ENOMEM;
 
@@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data)
 * 66 means evertything goes well! 66 is lucky.
 */
if (map->dir != DMA_FROM_DEVICE)
-   memset(buf, 0x66, PAGE_SIZE);
+   memset(buf, 0x66, size);
 
map_stime = ktime_get();
-   dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir);
+   dma_addr = dma_map_single(map->dev, buf, size, map->dir);
if (unlikely(dma_mapping_error(map->dev, dma_addr))) {
pr_err("dma_map_single failed on %s\n",
dev_name(map->dev));
@@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data)
ndelay(map->bparam.dma_trans_ns);
 
unmap_stime = ktime_get();
-   dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
+   dma_unmap_single(map->dev, dma_addr, size, map->dir);
unmap_etime = ktime_get();
unmap_delta = ktime_sub(unmap_etime, unmap_stime);
 
@@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data)
}
 
 out:
-   free_page((unsigned long)buf);
+   free_pages_exact(buf, size);
return ret;
 }
 
@@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, unsigned 
int cmd,
struct map_benchmark_data *map = file->private_data;
void __user *argp = (void __user *)arg;
u64 old_dma_mask;
-
int ret;
 
if (copy_from_user(&map->bparam, argp, sizeof(map->bparam)))
@@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
 
+   if (map->bparam.granule < 1 || map->bparam.granule > 1024) {
+   pr_err("invalid granule size\n");
+   return -EINVAL;
+   }
+
switch (map->bparam.dma_dir) {
case DMA_MAP_BIDIRECTIONAL:
map->dir = DMA_BIDIRECTIONAL;
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index fb23ce9..6f2caa6 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -40,7 +40,8 @@ struct map_benchmark {
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
__u32 dma_trans_ns; /* time for DMA transmission in ns */
-   __u8 expansion[80]; /* For future use */
+   __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */
+   __u8 expansion[76]; /* For future use */
 };
 
 int main(int argc, char **argv)
@@ -51,11 +52,13 @@ int main(int argc, char **argv)
int threads = 1, seconds = 20, node = -1;
/* default dma mask 32bit, bidirectional DMA */
int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL;
+   /* de

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:56 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; iommu@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > 
> >
> > > -Original Message-
> > > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > > Sent: Thursday, March 18, 2021 4:27 PM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > ; Nadav Amit 
> > > Cc: chenjiashang ; David Woodhouse
> > > ; iommu@lists.linux-foundation.org; LKML
> > > ; alex.william...@redhat.com; Gonglei
> > (Arei)
> > > ; w...@kernel.org
> > > Subject: RE: A problem of Intel IOMMU hardware ?
> > >
> > > > From: iommu  On Behalf
> > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > >
> > > > > 2. Consider ensuring that the problem is not somehow related to
> > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead
> > > > > of
> > > qi_flush_iotlb().
> > > > >
> > > >
> > > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > > wrong, the system crashed, so I prefer to lower the priority of
> > > > this
> > operation.
> > > >
> > >
> > > The VT-d spec clearly says that register-based invalidation can be
> > > used only
> > when
> > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> > > provide
> > an
> > > option to disable queued-invalidation though, when the hardware is
> > capable. If you
> > > really want to try, tweak the code in intel_iommu_init_qi.
> > >
> >
> > Hi Kevin,
> >
> > Thanks to point out this. Do you have any ideas about this problem ? I
> > tried to descript the problem much clear in my reply to Alex, hope you
> > could have a look if you're interested.
> >
> 
> btw I saw you used 4.18 kernel in this test. What about latest kernel?
> 

Not test yet. It's hard to upgrade kernel in our environment.

> Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
> hardware. Check the window between b) and c) and see whether the software does
> the right thing as expected there.
> 

We add some log in iommu driver these days, the software seems fine. But we
didn't look inside the qi_submit_sync yet, I'll try it tonight.

> Thanks
> Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH] Revert "iommu/amd: Fix performance counter initialization"

2021-03-18 Thread Paul Menzel

Dear Jörg, dear Suravee,


Am 03.03.21 um 15:10 schrieb Alexander Monakov:

On Wed, 3 Mar 2021, Suravee Suthikulpanit wrote:


Additionally, alternative proposed solutions [1] were not considered or
discussed.

[1]:https://lore.kernel.org/linux-iommu/alpine.lnx.2.20.13.2006030935570.3...@monopod.intra.ispras.ru/


This check has been introduced early on to detect a HW issue for
certain platforms in the past, where the performance counters are not
accessible and would result in silent failure when try to use the
counters. This is considered legacy code, and can be removed if we
decide to no longer provide sanity check for such case.


Which platforms? There is no such information in the code or the commit
messages that introduced this.

According to AMD's documentation, presence of performance counters is
indicated by "PCSup" bit in the "EFR" register. I don't think the driver
should second-guess that. If there were platforms where the CPU or the
firmware lied to the OS (EFR[PCSup] was 1, but counters were not present),
I think that should have been handled in a more explicit manner, e.g.
via matching broken CPUs by cpuid.


Suravee, could you please answer the questions?

Jörg, I know you are probably busy, but the patch was applied to the 
stable series (v5.11.7). There are still too many question open 
regarding the patch, and Suravee has not yet addressed the comments. 
It’d be great, if you could revert it.



Kind regards,

Paul

Could you please
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries

2021-03-18 Thread Joerg Roedel
Hi,

On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote:
> That is the primary motivation, given that we have moved to 1st level for
> general IOVA, first level doesn't have a WO mapping. I didn't know enough
> about the history to determine if a WO without a READ is very useful. I
> guess the ZLR was invented to support those cases without a READ in PCIe. I

Okay, please update the commit message and re-send. I guess these
patches are 5.13 stuff. In that case, Baolu can include them into his
pull request later this cycle.

Regards,

Joerg
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 0/2] VMD MSI Remapping Bypass

2021-03-18 Thread j...@8bytes.org
On Wed, Mar 17, 2021 at 07:14:17PM +, Derrick, Jonathan wrote:
> Gentle reminder, for v5.13 ?

This should go through the PCI tree, Bjorn?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> 
> > -Original Message-
> > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > Sent: Thursday, March 18, 2021 4:27 PM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > ; Nadav Amit 
> > Cc: chenjiashang ; David Woodhouse
> > ; iommu@lists.linux-foundation.org; LKML
> > ; alex.william...@redhat.com; Gonglei
> (Arei)
> > ; w...@kernel.org
> > Subject: RE: A problem of Intel IOMMU hardware ?
> >
> > > From: iommu  On Behalf Of
> > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >
> > > > 2. Consider ensuring that the problem is not somehow related to
> > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> > qi_flush_iotlb().
> > > >
> > >
> > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > wrong, the system crashed, so I prefer to lower the priority of this
> operation.
> > >
> >
> > The VT-d spec clearly says that register-based invalidation can be used only
> when
> > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide
> an
> > option to disable queued-invalidation though, when the hardware is
> capable. If you
> > really want to try, tweak the code in intel_iommu_init_qi.
> >
> 
> Hi Kevin,
> 
> Thanks to point out this. Do you have any ideas about this problem ? I tried
> to descript the problem much clear in my reply to Alex, hope you could have
> a look if you're interested.
> 

btw I saw you used 4.18 kernel in this test. What about latest kernel?

Also one way to separate sw/hw bug is to trace the low level interface (e.g.,
qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU
hardware. Check the window between b) and c) and see whether the
software does the right thing as expected there. 

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:43 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; iommu@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > 
> >
> >
> > > -Original Message-
> > > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > > Sent: Thursday, March 18, 2021 4:27 PM
> > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > ; Nadav Amit 
> > > Cc: chenjiashang ; David Woodhouse
> > > ; iommu@lists.linux-foundation.org; LKML
> > > ; alex.william...@redhat.com; Gonglei
> > (Arei)
> > > ; w...@kernel.org
> > > Subject: RE: A problem of Intel IOMMU hardware ?
> > >
> > > > From: iommu  On Behalf
> > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > > >
> > > > > 2. Consider ensuring that the problem is not somehow related to
> > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead
> > > > > of
> > > qi_flush_iotlb().
> > > > >
> > > >
> > > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > > wrong, the system crashed, so I prefer to lower the priority of
> > > > this
> > operation.
> > > >
> > >
> > > The VT-d spec clearly says that register-based invalidation can be
> > > used only
> > when
> > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't
> > > provide
> > an
> > > option to disable queued-invalidation though, when the hardware is
> > capable. If you
> > > really want to try, tweak the code in intel_iommu_init_qi.
> > >
> >
> > Hi Kevin,
> >
> > Thanks to point out this. Do you have any ideas about this problem ? I
> > tried to descript the problem much clear in my reply to Alex, hope you
> > could have a look if you're interested.
> >
> 
> I agree with Nadav. Looks this implies some stale paging structure cache 
> entry (e.g.
> PMD) is not invalidated properly. It's better if Baolu can reproduce this 
> problem in
> his local environment and then do more debug to identify whether it's a 
> software or
> hardware defect.
> 
> btw what is the device under test? Does it support ATS?
> 

The device is our offload card, it does not support ATS cap.

> Thanks
> Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> 
> 
> > -Original Message-
> > From: Tian, Kevin [mailto:kevin.t...@intel.com]
> > Sent: Thursday, March 18, 2021 4:27 PM
> > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > ; Nadav Amit 
> > Cc: chenjiashang ; David Woodhouse
> > ; iommu@lists.linux-foundation.org; LKML
> > ; alex.william...@redhat.com; Gonglei
> (Arei)
> > ; w...@kernel.org
> > Subject: RE: A problem of Intel IOMMU hardware ?
> >
> > > From: iommu  On Behalf Of
> > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> > >
> > > > 2. Consider ensuring that the problem is not somehow related to
> > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> > qi_flush_iotlb().
> > > >
> > >
> > > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > > wrong, the system crashed, so I prefer to lower the priority of this
> operation.
> > >
> >
> > The VT-d spec clearly says that register-based invalidation can be used only
> when
> > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide
> an
> > option to disable queued-invalidation though, when the hardware is
> capable. If you
> > really want to try, tweak the code in intel_iommu_init_qi.
> >
> 
> Hi Kevin,
> 
> Thanks to point out this. Do you have any ideas about this problem ? I tried
> to descript the problem much clear in my reply to Alex, hope you could have
> a look if you're interested.
> 

I agree with Nadav. Looks this implies some stale paging structure cache entry
(e.g. PMD) is not invalidated properly. It's better if Baolu can reproduce this
problem in his local environment and then do more debug to identify whether
it's a software or hardware defect.

btw what is the device under test? Does it support ATS?

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)


> -Original Message-
> From: Tian, Kevin [mailto:kevin.t...@intel.com]
> Sent: Thursday, March 18, 2021 4:27 PM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> ; Nadav Amit 
> Cc: chenjiashang ; David Woodhouse
> ; iommu@lists.linux-foundation.org; LKML
> ; alex.william...@redhat.com; Gonglei (Arei)
> ; w...@kernel.org
> Subject: RE: A problem of Intel IOMMU hardware ?
> 
> > From: iommu  On Behalf Of
> > Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> >
> > > 2. Consider ensuring that the problem is not somehow related to
> > > queued invalidations. Try to use __iommu_flush_iotlb() instead of
> qi_flush_iotlb().
> > >
> >
> > I tried to force to use __iommu_flush_iotlb(), but maybe something
> > wrong, the system crashed, so I prefer to lower the priority of this 
> > operation.
> >
> 
> The VT-d spec clearly says that register-based invalidation can be used only 
> when
> queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an
> option to disable queued-invalidation though, when the hardware is capable. 
> If you
> really want to try, tweak the code in intel_iommu_init_qi.
> 

Hi Kevin,

Thanks to point out this. Do you have any ideas about this problem ? I tried
to descript the problem much clear in my reply to Alex, hope you could have
a look if you're interested.

> Thanks
> Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Tian, Kevin
> From: iommu  On Behalf Of
> Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> > 2. Consider ensuring that the problem is not somehow related to queued
> > invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb().
> >
> 
> I tried to force to use __iommu_flush_iotlb(), but maybe something wrong,
> the system crashed, so I prefer to lower the priority of this operation.
> 

The VT-d spec clearly says that register-based invalidation can be used
only when queued-invalidations are not enabled. Intel-IOMMU driver
doesn't provide an option to disable queued-invalidation though, when
the hardware is capable. If you really want to try, tweak the code in
intel_iommu_init_qi.

Thanks
Kevin
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
Hi Nadav,

> -Original Message-
> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> Sent: Thursday, March 18, 2021 2:13 AM
> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.)
> 
> Cc: David Woodhouse ; Lu Baolu
> ; Joerg Roedel ; w...@kernel.org;
> alex.william...@redhat.com; chenjiashang ;
> iommu@lists.linux-foundation.org; Gonglei (Arei) ;
> LKML 
> Subject: Re: A problem of Intel IOMMU hardware ?
> 
> 
> 
> > On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service
> Product Dept.)  wrote:
> >
> > Hi Nadav,
> >
> >> -Original Message-
> >> From: Nadav Amit [mailto:nadav.a...@gmail.com]
> >>>  reproduce the problem with high probability (~50%).
> >>
> >> I saw Lu replied, and he is much more knowledgable than I am (I was
> >> just intrigued by your email).
> >>
> >> However, if I were you I would try also to remove some
> >> “optimizations” to look for the root-cause (e.g., use domain specific
> invalidations instead of page-specific).
> >>
> >
> > Good suggestion! But we did it these days, we tried to use global 
> > invalidations as
> follow:
> > iommu->flush.flush_iotlb(iommu, did, 0, 0,
> > DMA_TLB_DSI_FLUSH);
> > But can not resolve the problem.
> >
> >> The first thing that comes to my mind is the invalidation hint (ih)
> >> in iommu_flush_iotlb_psi(). I would remove it to see whether you get
> >> the failure without it.
> >
> > We also notice the IH, but the IH is always ZERO in our case, as the spec 
> > says:
> > '''
> > Paging-structure-cache entries caching second-level mappings
> > associated with the specified domain-id and the
> > second-level-input-address range are invalidated, if the Invalidation
> > Hint
> > (IH) field is Clear.
> > '''
> >
> > It seems the software is everything fine, so we've no choice but to suspect 
> > the
> hardware.
> 
> Ok, I am pretty much out of ideas. I have two more suggestions, but they are 
> much
> less likely to help. Yet, they can further help to rule out software bugs:
> 
> 1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE()
> to prevent split-write, which might potentially cause “invalid” (partially
> cleared) PTE to be stored in the TLB. Having said that, the subsequent IOTLB 
> flush
> should have prevented the problem.
> 

Yes, use WRITE_ONCE is much safer, however I was just testing the following 
code,
it didn't resolved my problem.

static inline void dma_clear_pte(struct dma_pte *pte)
{
WRITE_ONCE(pte->val, 0ULL);
}

> 2. Consider ensuring that the problem is not somehow related to queued
> invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb().
> 

I tried to force to use __iommu_flush_iotlb(), but maybe something wrong,
the system crashed, so I prefer to lower the priority of this operation.

> Regards,
> Nadav
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: A problem of Intel IOMMU hardware ?

2021-03-18 Thread Nadav Amit

> On Mar 17, 2021, at 9:46 PM, Longpeng (Mike, Cloud Infrastructure Service 
> Product Dept.)  wrote:
> 

[Snip]

> 
> NOTE, the magical thing happen...(*Operation-4*) we write the PTE
> of Operation-1 from 0 to 0x3 which means can Read/Write, and then
> we trigger DMA read again, it success and return the data of HPA 0 !!
> 
> Why we modify the older page table would make sense ? As we
> have discussed previously, the cache flush part of the driver is correct,
> it call flush_iotlb after (b) and no need to flush after (c). But the result
> of the experiment shows the older page table or older caches is effective
> actually.
> 
> Any ideas ?

Interesting. Sounds as if there is some page-walk cache that was not
invalidated properly.



signature.asc
Description: Message signed with OpenPGP
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu