Re: [PATCH] dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()

2020-08-14 Thread Linus Torvalds
On Fri, Aug 14, 2020 at 5:26 PM Hugh Dickins  wrote:
>
> We used to rely on page count there, and on trylock_page() only; but
> there was at least one user whose app went wrong when occasionally we
> COWed the page, just because something else momentarily took a reference
> to it, or locked it.  Around 2006, bug report from 2004: I did look up
> the history a week ago, but was interrupted before taking notes.

I actually think you may be talking about the exact problem that that
debug patch from Dan was originally created for:

  0abdd7a81b7e dma-debug: introduce debug_dma_assert_idle()
  77873803363c net_dma: mark broken

and your memory sounds exactly like that net_dma case (and the timing
matches roughly too - the NET_DMA code was merged in 2006, but I think
people had been playing trial games with it before that).

IOW, net_dma was horribly broken, and just couldn't deal with COW
because it did things wrong.

The thing is, doing extra COW's really shouldn't matter in _any_
half-way correct situation. There's a few cases:

 - user space writing to it, so we COW.

   This is the "simple" case that is obvious and we've always done the
same thing. User space will get the new copy, and there's no possible
situation when that can be wrong.

 - get_user_pages() for reading.

   This is the one we actually used to get wrong, and when another
user *didn't* cow, the data that was read might not match what the
original get_uiser_pages() case expected.

But in this case, the bug only happened when we didn't cow
aggressively enough.

 - get_user_pages() for writing

   This is another 'simple" case, because it does the COW at
get_user_pages() time and gets it's own copy (which is also installed
in the thread that does the GUP, of course, so a subsequent fork an
danother write can obviously cause *further* COW action).

But in no case should an extra COW matter. Except if somebody uses
get_user_pages() to write to the page, and the COW "hides" that write
by giving a new copy to whoever expected to see it, but that's exactly
the case that Dan's patch was supposed to notice.

And since it never triggered outside of that invalid net_dma case, I
don't think any other case really ever existed.

Yes, I can well imagine that some people loved the concept of that TCP
receive copy offload, but it really was broken, and was removed
entirely by Dan in commit 7bced397510a ("net_dma: simple removal") a
year after being marked broken (the author date makes it look like
it's just a couple of weeks after being marked broken, but the commit
date for that removal is September 2014).

So I don't think that the trylock and checking page counts is a
correctness issue.

It had better not be, because anybody that writes to a shared-cow page
 without breaking COW is simply broken.

No, I really think that the real worry about doing more aggressive
copying is that it doesn't steal back the KSM page or the swap cache
page, so it will leave those pages around, and while they should then
be really easy for the VM to reclaim, I really worry that we have a
couple of decades of VM reclaim tuning with that swap cache reuse
behavior (KSM, not so much).

And while it works fine on my machine, I currently have 40GB or RAM
free, because honestly, the stuff I do doesn't need all that much
memory, and I ridiculously overspecced my new machine RAM'wise. So
nothing I will do would show any problems.

Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()

2020-08-14 Thread Hugh Dickins via iommu
On Fri, 14 Aug 2020, Linus Torvalds wrote:
> On Thu, Aug 13, 2020 at 10:42 PM Christoph Hellwig  wrote:
> >
> > The whole thing predates my involvement with the code, but I defintively
> > think the patch from Hugh is a major improvement.  But I would also
> > have no problem with just removing it entirely.
> 
> I decided to just do both, since neither you nor Dan seemed to really object.
> 
> I applied Hugh's RCU read locking patch as a clear improvement, and
> then I did a second patch that just removed this function entirely.
> That sounds a bit odd, perhaps, but in case people decide to resurrect
> the debugging code, I didn't want us to lose sight of Hugh's
> improvement just because I then decided that we might as well go one
> step further and just remove it entirely.

That's ideal, thanks - exactly the sequence I was hoping for.

(Another shortcoming in debug_dma_assert_idle(), that I hadn't wanted
to distract us by mentioning, is that it assumed that the mapping is
contained within one small page, whereas I believe one or more of the
DMA mapping functions take a size_t argument, that could in theory span
small pages - I guess more plausible inside a compound page; yet it
looked like only an initial entry would be put into the radix-tree.)

> 
> And the only real reason I care is that this whole COW and page lock
> thing has showed up lately, and I like removing code.
> 
> I'm _very_ tempted to just apply my COW simplification patch that gets
> rid of all the complex try-to-share cases entirely (and would also
> obviate the whole forced-cow patch). I suspect it would effectively
> remove almost all of the [un[lock_page() bottlenecks entirely, but
> that code has decades of history and I suspect it's a bit too drastic
> wrt KSM and the swap cache pages.

Yes, you're right to hold back.

I'd been looking there too (but backed off while speeding up the
fork was causing the "Hugh load" to "fail": it's the exit that now
wants speeding up, to please that test).  I think it could well avoid
getting into page locking when mapcount is quickly seen to be high
(> 1? > 2? > bigger? I never did the logic), but the page locking
becomes important when mapcount looks low, yet swap might be involved.

We used to rely on page count there, and on trylock_page() only; but
there was at least one user whose app went wrong when occasionally we
COWed the page, just because something else momentarily took a reference
to it, or locked it.  Around 2006, bug report from 2004: I did look up
the history a week ago, but was interrupted before taking notes.

> 
> It would be lovely if the main source of page locking would really be
> about just IO, but the page lock has also become the thing that
> serializes almost everything related to page state. Which is why you
> find it in contexts that are really not IO-related at all (not just
> COW - page migration is the other one that has shown up a lot under
> "heavy CPU loads" without really necessarily any IO component to it).
> 
>  Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] dma-debug: fix debug_dma_assert_idle(), use rcu_read_lock()

2020-08-14 Thread Linus Torvalds
On Thu, Aug 13, 2020 at 10:42 PM Christoph Hellwig  wrote:
>
> The whole thing predates my involvement with the code, but I defintively
> think the patch from Hugh is a major improvement.  But I would also
> have no problem with just removing it entirely.

I decided to just do both, since neither you nor Dan seemed to really object.

I applied Hugh's RCU read locking patch as a clear improvement, and
then I did a second patch that just removed this function entirely.
That sounds a bit odd, perhaps, but in case people decide to resurrect
the debugging code, I didn't want us to lose sight of Hugh's
improvement just because I then decided that we might as well go one
step further and just remove it entirely.

And the only real reason I care is that this whole COW and page lock
thing has showed up lately, and I like removing code.

I'm _very_ tempted to just apply my COW simplification patch that gets
rid of all the complex try-to-share cases entirely (and would also
obviate the whole forced-cow patch). I suspect it would effectively
remove almost all of the [un[lock_page() bottlenecks entirely, but
that code has decades of history and I suspect it's a bit too drastic
wrt KSM and the swap cache pages.

It would be lovely if the main source of page locking would really be
about just IO, but the page lock has also become the thing that
serializes almost everything related to page state. Which is why you
find it in contexts that are really not IO-related at all (not just
COW - page migration is the other one that has shown up a lot under
"heavy CPU loads" without really necessarily any IO component to it).

 Linus
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH RESEND v3] iommu/tegra-smmu: Add missing locks around mapping operations

2020-08-14 Thread Dmitry Osipenko
The mapping operations of the Tegra SMMU driver are subjected to a race
condition issues because SMMU Address Space isn't allocated and freed
atomically, while it should be. This patch makes the mapping operations
atomic, it fixes an accidentally released Host1x Address Space problem
which happens while running multiple graphics tests in parallel on
Tegra30, i.e. by having multiple threads racing with each other in the
Host1x's submission and completion code paths, performing IOVA mappings
and unmappings in parallel.

Cc: 
Signed-off-by: Dmitry Osipenko 
---

Changelog:

v3: - No changes. Resending for visibility, please apply.

v2: - Now using mutex instead of spinlock.

- The _locked postfix is replaced with the underscores prefix.


 drivers/iommu/tegra-smmu.c | 38 ++
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
index 124c8848ab7e..4315b6381354 100644
--- a/drivers/iommu/tegra-smmu.c
+++ b/drivers/iommu/tegra-smmu.c
@@ -49,6 +49,7 @@ struct tegra_smmu_as {
struct iommu_domain domain;
struct tegra_smmu *smmu;
unsigned int use_count;
+   struct mutex lock;
u32 *count;
struct page **pts;
struct page *pd;
@@ -308,6 +309,8 @@ static struct iommu_domain 
*tegra_smmu_domain_alloc(unsigned type)
return NULL;
}
 
+   mutex_init(>lock);
+
/* setup aperture */
as->domain.geometry.aperture_start = 0;
as->domain.geometry.aperture_end = 0x;
@@ -655,8 +658,9 @@ static void tegra_smmu_set_pte(struct tegra_smmu_as *as, 
unsigned long iova,
smmu_flush(smmu);
 }
 
-static int tegra_smmu_map(struct iommu_domain *domain, unsigned long iova,
- phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+static int
+__tegra_smmu_map(struct iommu_domain *domain, unsigned long iova,
+phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
 {
struct tegra_smmu_as *as = to_smmu_as(domain);
dma_addr_t pte_dma;
@@ -685,8 +689,9 @@ static int tegra_smmu_map(struct iommu_domain *domain, 
unsigned long iova,
return 0;
 }
 
-static size_t tegra_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
-  size_t size, struct iommu_iotlb_gather *gather)
+static size_t
+__tegra_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
+  size_t size, struct iommu_iotlb_gather *gather)
 {
struct tegra_smmu_as *as = to_smmu_as(domain);
dma_addr_t pte_dma;
@@ -702,6 +707,31 @@ static size_t tegra_smmu_unmap(struct iommu_domain 
*domain, unsigned long iova,
return size;
 }
 
+static int tegra_smmu_map(struct iommu_domain *domain, unsigned long iova,
+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+   struct tegra_smmu_as *as = to_smmu_as(domain);
+   int ret;
+
+   mutex_lock(>lock);
+   ret = __tegra_smmu_map(domain, iova, paddr, size, prot, gfp);
+   mutex_unlock(>lock);
+
+   return ret;
+}
+
+static size_t tegra_smmu_unmap(struct iommu_domain *domain, unsigned long iova,
+  size_t size, struct iommu_iotlb_gather *gather)
+{
+   struct tegra_smmu_as *as = to_smmu_as(domain);
+
+   mutex_lock(>lock);
+   size = __tegra_smmu_unmap(domain, iova, size, gather);
+   mutex_unlock(>lock);
+
+   return size;
+}
+
 static phys_addr_t tegra_smmu_iova_to_phys(struct iommu_domain *domain,
   dma_addr_t iova)
 {
-- 
2.27.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 0/2] dma-pool fixes

2020-08-14 Thread Christoph Hellwig
Thanks,

applied to the dma-mapping tree.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: nvme crash - Re: linux-next: Tree for Aug 13

2020-08-14 Thread John Garry

On 14/08/2020 14:07, John Garry wrote:


BTW, as for the DMA/sg scatterlist code, it so happens in this case that 
we try the dma alloc for size=0 in nvme_alloc_queue() - I know an 
allocation for size=0 makes no sense, but couldn't we bit a bit more 
robust?


it's giving ZERO_SIZE_PTR, which we deference, so ignore me...
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 0/2] dma-pool fixes

2020-08-14 Thread Amit Pundir
On Fri, 14 Aug 2020 at 15:56, Nicolas Saenz Julienne
 wrote:
>
> Now that we have an explanation to Amir's issue, we can re-spin this
> series.

Hi, Smoke tested (boots AOSP to UI with Touch/WiFi/BT working) on my
Poco F1 phone, with upstream commit 00e4db51259a (+ 30 odd out of
tree patches [1]) and I see no obvious regressions.

For both the patches in the series:

Tested-by: Amit Pundir 

[1] https://github.com/pundiramit/linux/commits/beryllium-mainline-display


>
> ---
> Changes since v3:
>  - Do not use memblock_start_of_DRAM()
>
> Changes since v2:
>  - Go back to v1's behavior for patch #2
>
> Changes since v1:
>  - Make cma_in_zone() more strict, GFP_KERNEL doesn't default to true
>now
>
>  - Check if phys_addr_ok() exists prior calling it
>
> Christoph Hellwig (1):
>   dma-pool: fix coherent pool allocations for IOMMU mappings
>
> Nicolas Saenz Julienne (1):
>   dma-pool: Only allocate from CMA when in same memory zone
>
>  drivers/iommu/dma-iommu.c   |   4 +-
>  include/linux/dma-direct.h  |   3 -
>  include/linux/dma-mapping.h |   5 +-
>  kernel/dma/direct.c |  13 +++-
>  kernel/dma/pool.c   | 145 +++-
>  5 files changed, 92 insertions(+), 78 deletions(-)
>
> --
> 2.28.0
>
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: nvme crash - Re: linux-next: Tree for Aug 13

2020-08-14 Thread John Garry

On 14/08/2020 13:08, Christoph Hellwig wrote:

[148.455065]__sg_alloc_table_from_pages+0xec/0x238
[148.459931]sg_alloc_table_from_pages+0x18/0x28
[148.464541]iommu_dma_alloc+0x474/0x678
[148.468455]dma_alloc_attrs+0xd8/0xf0
[148.472193]nvme_alloc_queue+0x114/0x160 [nvme]
[148.476798]nvme_reset_work+0xb34/0x14b4 [nvme]
[148.481407]process_one_work+0x1e8/0x360
[148.485405]worker_thread+0x44/0x478
[148.489055]kthread+0x150/0x158
[148.492273]ret_from_fork+0x10/0x34
[148.495838] Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
[148.501921] ---[ end trace 89bb2b72d59bf925 ]---

Anything to worry about? I guess not since we're in the merge window, but
mentioning just in case ...

I bisected, and this patch looks to fix it (note the comments below the
'---'):

 From 263891a760edc24b901085bf6e5fe2480808f86d Mon Sep 17 00:00:00 2001
From: John Garry
Date: Fri, 14 Aug 2020 12:45:18 +0100
Subject: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth

Recently nvme_dev.q_depth was changed from int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow, as NVME_CAP_MQES()
gives a 16b number also. That happens for me, and this is the result:

Oh, interesting.  Please also switch the module option parsing to
use kstrtou32 and param_set_uint and send this as a formal patch.



I'm doing it now.

BTW, as for the DMA/sg scatterlist code, it so happens in this case that 
we try the dma alloc for size=0 in nvme_alloc_queue() - I know an 
allocation for size=0 makes no sense, but couldn't we bit a bit more robust?


Cheers,
John



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: nvme crash - Re: linux-next: Tree for Aug 13

2020-08-14 Thread Christoph Hellwig
On Fri, Aug 14, 2020 at 01:00:30PM +0100, John Garry wrote:
> 
> > > I have experienced this this crash below on linux-next for the last few 
> > > days
> > > on my arm64 system. Linus' master branch today also has it.
> > Adding Robin and the iommu list as this seems to be in the dma-iommu
> > code.
> > 
> > > root@ubuntu:/home/john# insmod nvme.ko
> > > [148.254564] nvme :81:00.0: Adding to iommu group 21
> > > [148.260973] nvme nvme0: pci function :81:00.0
> > > root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
> > > dereference at virtual address 0010
> > > [148.281784] Mem abort info:
> > > [148.284584] ESR = 0x9604
> > > [148.287641] EC = 0x25: DABT (current EL), IL = 32 bits
> > > [148.292950] SET = 0, FnV = 0
> > > [148.295998] EA = 0, S1PTW = 0
> > > [148.299126] Data abort info:
> > > [148.302003] ISV = 0, ISS = 0x0004
> > > [148.305832] CM = 0, WnR = 0
> > > [148.308794] user pgtable: 4k pages, 48-bit VAs, pgdp=0a27bf3c9000
> > > [148.315229] [0010] pgd=, p4d=
> > > [148.322016] Internal error: Oops: 9604 [#1] PREEMPT SMP
> > > [148.327577] Modules linked in: nvme nvme_core
> > > [148.331927] CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
> > > 5.8.0-next-20200812 #27
> > > [148.339744] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> > > V1.16.01 03/15/2019
> > > [148.348260] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> > > [148.353822] pstate: 80c9 (Nzcv daif +PAN +UAO BTYPE=--)
> > > [148.359390] pc : __sg_alloc_table_from_pages+0xec/0x238
> > > [148.364604] lr : __sg_alloc_table_from_pages+0xc8/0x238
> > > [148.369815] sp : 800013ccbad0
> > > [148.373116] x29: 800013ccbad0 x28: 0a27b3d380a8
> > > [148.378417] x27:  x26: 2dc2
> > > [148.383718] x25: 0dc0 x24: 
> > > [148.389019] x23:  x22: 800013ccbbe8
> > > [148.394320] x21: 0010 x20: 
> > > [148.399621] x19: f000 x18: 
> > > [148.404922] x17: 00c0 x16: fe289eaf6380
> > > [148.410223] x15: 800011b59948 x14: 002bc8fe98f8
> > > [148.415523] x13: ff00 x12: 8000114ca000
> > > [148.420824] x11:  x10: 
> > > [148.426124] x9 : ffc0 x8 : 0a27b5f9b6a0
> > > [148.431425] x7 :  x6 : 0001
> > > [148.436726] x5 : 0a27b5f9b680 x4 : 
> > > [148.442027] x3 : 0a27b5f9b680 x2 : 
> > > [148.447328] x1 : 0001 x0 : 
> > > [148.452629] Call trace:
> > > [148.455065]__sg_alloc_table_from_pages+0xec/0x238
> > > [148.459931]sg_alloc_table_from_pages+0x18/0x28
> > > [148.464541]iommu_dma_alloc+0x474/0x678
> > > [148.468455]dma_alloc_attrs+0xd8/0xf0
> > > [148.472193]nvme_alloc_queue+0x114/0x160 [nvme]
> > > [148.476798]nvme_reset_work+0xb34/0x14b4 [nvme]
> > > [148.481407]process_one_work+0x1e8/0x360
> > > [148.485405]worker_thread+0x44/0x478
> > > [148.489055]kthread+0x150/0x158
> > > [148.492273]ret_from_fork+0x10/0x34
> > > [148.495838] Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
> > > [148.501921] ---[ end trace 89bb2b72d59bf925 ]---
> > > 
> > > Anything to worry about? I guess not since we're in the merge window, but
> > > mentioning just in case ...
> 
> I bisected, and this patch looks to fix it (note the comments below the
> '---'):
> 
> From 263891a760edc24b901085bf6e5fe2480808f86d Mon Sep 17 00:00:00 2001
> From: John Garry 
> Date: Fri, 14 Aug 2020 12:45:18 +0100
> Subject: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth
> 
> Recently nvme_dev.q_depth was changed from int to u16 type.
> 
> This falls over for the queue depth calculation in nvme_pci_enable(),
> where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow, as NVME_CAP_MQES()
> gives a 16b number also. That happens for me, and this is the result:

Oh, interesting.  Please also switch the module option parsing to
use kstrtou32 and param_set_uint and send this as a formal patch.

Thanks!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: nvme crash - Re: linux-next: Tree for Aug 13

2020-08-14 Thread John Garry




I have experienced this this crash below on linux-next for the last few days
on my arm64 system. Linus' master branch today also has it.

Adding Robin and the iommu list as this seems to be in the dma-iommu
code.


root@ubuntu:/home/john# insmod nvme.ko
[148.254564] nvme :81:00.0: Adding to iommu group 21
[148.260973] nvme nvme0: pci function :81:00.0
root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0010
[148.281784] Mem abort info:
[148.284584] ESR = 0x9604
[148.287641] EC = 0x25: DABT (current EL), IL = 32 bits
[148.292950] SET = 0, FnV = 0
[148.295998] EA = 0, S1PTW = 0
[148.299126] Data abort info:
[148.302003] ISV = 0, ISS = 0x0004
[148.305832] CM = 0, WnR = 0
[148.308794] user pgtable: 4k pages, 48-bit VAs, pgdp=0a27bf3c9000
[148.315229] [0010] pgd=, p4d=
[148.322016] Internal error: Oops: 9604 [#1] PREEMPT SMP
[148.327577] Modules linked in: nvme nvme_core
[148.331927] CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
[148.339744] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
[148.348260] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[148.353822] pstate: 80c9 (Nzcv daif +PAN +UAO BTYPE=--)
[148.359390] pc : __sg_alloc_table_from_pages+0xec/0x238
[148.364604] lr : __sg_alloc_table_from_pages+0xc8/0x238
[148.369815] sp : 800013ccbad0
[148.373116] x29: 800013ccbad0 x28: 0a27b3d380a8
[148.378417] x27:  x26: 2dc2
[148.383718] x25: 0dc0 x24: 
[148.389019] x23:  x22: 800013ccbbe8
[148.394320] x21: 0010 x20: 
[148.399621] x19: f000 x18: 
[148.404922] x17: 00c0 x16: fe289eaf6380
[148.410223] x15: 800011b59948 x14: 002bc8fe98f8
[148.415523] x13: ff00 x12: 8000114ca000
[148.420824] x11:  x10: 
[148.426124] x9 : ffc0 x8 : 0a27b5f9b6a0
[148.431425] x7 :  x6 : 0001
[148.436726] x5 : 0a27b5f9b680 x4 : 
[148.442027] x3 : 0a27b5f9b680 x2 : 
[148.447328] x1 : 0001 x0 : 
[148.452629] Call trace:
[148.455065]__sg_alloc_table_from_pages+0xec/0x238
[148.459931]sg_alloc_table_from_pages+0x18/0x28
[148.464541]iommu_dma_alloc+0x474/0x678
[148.468455]dma_alloc_attrs+0xd8/0xf0
[148.472193]nvme_alloc_queue+0x114/0x160 [nvme]
[148.476798]nvme_reset_work+0xb34/0x14b4 [nvme]
[148.481407]process_one_work+0x1e8/0x360
[148.485405]worker_thread+0x44/0x478
[148.489055]kthread+0x150/0x158
[148.492273]ret_from_fork+0x10/0x34
[148.495838] Code: f94002c3 6b01017f 540007c2 11000486 (f8645aa5)
[148.501921] ---[ end trace 89bb2b72d59bf925 ]---

Anything to worry about? I guess not since we're in the merge window, but
mentioning just in case ...


I bisected, and this patch looks to fix it (note the comments below the 
'---'):


From 263891a760edc24b901085bf6e5fe2480808f86d Mon Sep 17 00:00:00 2001
From: John Garry 
Date: Fri, 14 Aug 2020 12:45:18 +0100
Subject: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth

Recently nvme_dev.q_depth was changed from int to u16 type.

This falls over for the queue depth calculation in nvme_pci_enable(),
where NVME_CAP_MQES(dev->ctrl.cap) + 1 may overflow, as NVME_CAP_MQES()
gives a 16b number also. That happens for me, and this is the result:

root@ubuntu:/home/john# [148.272996] Unable to handle kernel NULL pointer
dereference at virtual address 0010
[148.281784] Mem abort info:
[148.284584] ESR = 0x9604
[148.287641] EC = 0x25: DABT (current EL), IL = 32 bits
[148.292950] SET = 0, FnV = 0
[148.295998] EA = 0, S1PTW = 0
[148.299126] Data abort info:
[148.302003] ISV = 0, ISS = 0x0004
[148.305832] CM = 0, WnR = 0
[148.308794] user pgtable: 4k pages, 48-bit VAs, pgdp=0a27bf3c9000
[148.315229] [0010] pgd=, p4d=
[148.322016] Internal error: Oops: 9604 [#1] PREEMPT SMP
[148.327577] Modules linked in: nvme nvme_core
[148.331927] CPU: 56 PID: 256 Comm: kworker/u195:0 Not tainted
5.8.0-next-20200812 #27
[148.339744] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
V1.16.01 03/15/2019
[148.348260] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[148.353822] pstate: 80c9 (Nzcv daif +PAN +UAO BTYPE=--)
[148.359390] pc : __sg_alloc_table_from_pages+0xec/0x238
[148.364604] lr : __sg_alloc_table_from_pages+0xc8/0x238
[148.369815] sp : 800013ccbad0
[148.373116] x29: 800013ccbad0 x28: 0a27b3d380a8
[148.378417] x27:  x26: 2dc2
[148.383718] x25: 0dc0 x24: 
[148.389019] x23:  x22: 800013ccbbe8
[148.394320] x21: 0010 x20: 
[148.399621] x19: f000 x18: 
[148.404922] x17: 

[PATCH v4 1/2] dma-pool: Only allocate from CMA when in same memory zone

2020-08-14 Thread Nicolas Saenz Julienne
There is no guarantee to CMA's placement, so allocating a zone specific
atomic pool from CMA might return memory from a completely different
memory zone. To get around this double check CMA's placement before
allocating from it.

Signed-off-by: Nicolas Saenz Julienne 
[hch: rebased, added a fallback to the page allocator, allow dipping into
  lower CMA pools]
Signed-off-by: Christoph Hellwig 
---

Changes since v3:
 - Do not use memblock_start_of_DRAM()

Changes since v2:
 - Go back to v1 behavior

 kernel/dma/pool.c | 31 ++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 6bc74a2d5127..57f4a0f32a92 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -3,7 +3,9 @@
  * Copyright (C) 2012 ARM Ltd.
  * Copyright (C) 2020 Google LLC
  */
+#include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -55,6 +57,29 @@ static void dma_atomic_pool_size_add(gfp_t gfp, size_t size)
pool_size_kernel += size;
 }
 
+static bool cma_in_zone(gfp_t gfp)
+{
+   unsigned long size;
+   phys_addr_t end;
+   struct cma *cma;
+
+   cma = dev_get_cma_area(NULL);
+   if (!cma)
+   return false;
+
+   size = cma_get_size(cma);
+   if (!size)
+   return false;
+
+   /* CMA can't cross zone boundaries, see cma_activate_area() */
+   end = cma_get_base(cma) + size - 1;
+   if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
+   return end <= DMA_BIT_MASK(zone_dma_bits);
+   if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
+   return end <= DMA_BIT_MASK(32);
+   return true;
+}
+
 static int atomic_pool_expand(struct gen_pool *pool, size_t pool_size,
  gfp_t gfp)
 {
@@ -68,7 +93,11 @@ static int atomic_pool_expand(struct gen_pool *pool, size_t 
pool_size,
 
do {
pool_size = 1 << (PAGE_SHIFT + order);
-   page = alloc_pages(gfp, order);
+   if (cma_in_zone(gfp))
+   page = dma_alloc_from_contiguous(NULL, 1 << order,
+order, false);
+   if (!page)
+   page = alloc_pages(gfp, order);
} while (!page && order-- > 0);
if (!page)
goto out;
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 0/2] dma-pool fixes

2020-08-14 Thread Nicolas Saenz Julienne
Now that we have an explanation to Amir's issue, we can re-spin this
series.

---
Changes since v3:
 - Do not use memblock_start_of_DRAM()

Changes since v2:
 - Go back to v1's behavior for patch #2

Changes since v1:
 - Make cma_in_zone() more strict, GFP_KERNEL doesn't default to true
   now

 - Check if phys_addr_ok() exists prior calling it

Christoph Hellwig (1):
  dma-pool: fix coherent pool allocations for IOMMU mappings

Nicolas Saenz Julienne (1):
  dma-pool: Only allocate from CMA when in same memory zone

 drivers/iommu/dma-iommu.c   |   4 +-
 include/linux/dma-direct.h  |   3 -
 include/linux/dma-mapping.h |   5 +-
 kernel/dma/direct.c |  13 +++-
 kernel/dma/pool.c   | 145 +++-
 5 files changed, 92 insertions(+), 78 deletions(-)

-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/2] dma-pool: fix coherent pool allocations for IOMMU mappings

2020-08-14 Thread Nicolas Saenz Julienne
From: Christoph Hellwig 

When allocating coherent pool memory for an IOMMU mapping we don't care
about the DMA mask.  Move the guess for the initial GFP mask into the
dma_direct_alloc_pages and pass dma_coherent_ok as a function pointer
argument so that it doesn't get applied to the IOMMU case.

Signed-off-by: Christoph Hellwig 
---

Changes since v1:
  - Check if phys_addr_ok() exists prior calling it

 drivers/iommu/dma-iommu.c   |   4 +-
 include/linux/dma-direct.h  |   3 -
 include/linux/dma-mapping.h |   5 +-
 kernel/dma/direct.c |  13 ++--
 kernel/dma/pool.c   | 114 +++-
 5 files changed, 62 insertions(+), 77 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 4959f5df21bd..5141d49a046b 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1035,8 +1035,8 @@ static void *iommu_dma_alloc(struct device *dev, size_t 
size,
 
if (IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
!gfpflags_allow_blocking(gfp) && !coherent)
-   cpu_addr = dma_alloc_from_pool(dev, PAGE_ALIGN(size), ,
-  gfp);
+   page = dma_alloc_from_pool(dev, PAGE_ALIGN(size), _addr,
+  gfp, NULL);
else
cpu_addr = iommu_dma_alloc_pages(dev, size, , gfp, attrs);
if (!cpu_addr)
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 5a3ce2a24794..6e87225600ae 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -73,9 +73,6 @@ static inline bool dma_capable(struct device *dev, dma_addr_t 
addr, size_t size,
 }
 
 u64 dma_direct_get_required_mask(struct device *dev);
-gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
- u64 *phys_mask);
-bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size);
 void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
gfp_t gfp, unsigned long attrs);
 void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 016b96b384bd..52635e91143b 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -522,8 +522,9 @@ void *dma_common_pages_remap(struct page **pages, size_t 
size,
pgprot_t prot, const void *caller);
 void dma_common_free_remap(void *cpu_addr, size_t size);
 
-void *dma_alloc_from_pool(struct device *dev, size_t size,
- struct page **ret_page, gfp_t flags);
+struct page *dma_alloc_from_pool(struct device *dev, size_t size,
+   void **cpu_addr, gfp_t flags,
+   bool (*phys_addr_ok)(struct device *, phys_addr_t, size_t));
 bool dma_free_from_pool(struct device *dev, void *start, size_t size);
 
 int
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index bb0041e99659..db6ef07aec3b 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -43,7 +43,7 @@ u64 dma_direct_get_required_mask(struct device *dev)
return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
 }
 
-gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
+static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 dma_mask,
  u64 *phys_limit)
 {
u64 dma_limit = min_not_zero(dma_mask, dev->bus_dma_limit);
@@ -68,7 +68,7 @@ gfp_t dma_direct_optimal_gfp_mask(struct device *dev, u64 
dma_mask,
return 0;
 }
 
-bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
+static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size)
 {
return phys_to_dma_direct(dev, phys) + size - 1 <=
min_not_zero(dev->coherent_dma_mask, 
dev->bus_dma_limit);
@@ -161,8 +161,13 @@ void *dma_direct_alloc_pages(struct device *dev, size_t 
size,
size = PAGE_ALIGN(size);
 
if (dma_should_alloc_from_pool(dev, gfp, attrs)) {
-   ret = dma_alloc_from_pool(dev, size, , gfp);
-   if (!ret)
+   u64 phys_mask;
+
+   gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
+   _mask);
+   page = dma_alloc_from_pool(dev, size, , gfp,
+   dma_coherent_ok);
+   if (!page)
return NULL;
goto done;
}
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 57f4a0f32a92..b0aaba4197ae 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -225,93 +225,75 @@ static int __init dma_atomic_pool_init(void)
 }
 postcore_initcall(dma_atomic_pool_init);
 
-static inline struct gen_pool *dma_guess_pool_from_device(struct device *dev)
+static inline struct gen_pool *dma_guess_pool(struct gen_pool *prev, gfp_t gfp)
 {
-   u64 phys_mask;
-   gfp_t gfp;
-
-   gfp = 

Re: [PATCH v3 2/2] dma-pool: Only allocate from CMA when in same memory zone

2020-08-14 Thread Nicolas Saenz Julienne
On Fri, 2020-08-14 at 08:06 +0200, Christoph Hellwig wrote:
> On Sat, Aug 08, 2020 at 08:33:54AM +0200, Christoph Hellwig wrote:
> > On Fri, Aug 07, 2020 at 10:50:19AM +0200, Nicolas Saenz Julienne wrote:
> > > On Fri, 2020-08-07 at 07:21 +0200, Christoph Hellwig wrote:
> > > > On Thu, Aug 06, 2020 at 08:47:55PM +0200, Nicolas Saenz Julienne wrote:
> > > > > There is no guarantee to CMA's placement, so allocating a zone 
> > > > > specific
> > > > > atomic pool from CMA might return memory from a completely different
> > > > > memory zone. To get around this double check CMA's placement before
> > > > > allocating from it.
> > > > 
> > > > As the builtbot pointed out, memblock_start_of_DRAM can't be used from
> > > > non-__init code.  But lookig at it I think throwing that in
> > > > is bogus anyway, as cma_get_base returns a proper physical address
> > > > already.
> > > 
> > > It does indeed, but I'm comparing CMA's base with bitmasks that don't 
> > > take into
> > > account where the memory starts. Say memory starts at 0x8000, and CMA 
> > > falls
> > > into ZONE_DMA [0x8000 0xC000], if you want to compare it with
> > > DMA_BIT_MASK(zone_dma_bits) you're forced to unify the memory bases.
> > > 
> > > That said, I now realize that this doesn't work for ZONE_DMA32 which has 
> > > a hard
> > > limit on 32bit addresses reglardless of the memory base.
> > > 
> > > That said I still need to call memblock_start_of_DRAM() any suggestions 
> > > WRT
> > > that? I could save the value in dma_atomic_pool_init(), which is __init 
> > > code.
> > 
> > The pool must be about a 32-bit physical address.  The offsets can be
> > different for every device..

I now see what you mean.

I was trying to blindly fit CMA with arm64's DMA zone setup, which, as it turns
out, doesn't really honor its purpose. arm64 introduced ZONE_DMA to provide a
30-bit address space, but we're creating it regardless of whether it exists or
not. This creates a mismatch between zone_dma_bits and ZONE_DMA's real
placement. I'll try to look at fixing that in arm64.

> Do you plan to resend this one without the memblock_start_of_DRAM
> thingy?

Yes, sorry for the wait, I've been on vacation and short on time, I'll send it
during the day.

Regards,
Nicolas

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 07/17] media: mtk-mdp: Get rid of mtk_smi_larb_get/put

2020-08-14 Thread houlong wei
On Sat, 2020-05-30 at 16:10 +0800, Yong Wu wrote:

> MediaTek IOMMU has already added the device_link between the consumer
> and smi-larb device. If the mdp device call the pm_runtime_get_sync,
> the smi-larb's pm_runtime_get_sync also be called automatically.
> 
> CC: Minghsiu Tsai 
> CC: Houlong Wei 
> Signed-off-by: Yong Wu 
> Reviewed-by: Evan Green 
> ---
>  drivers/media/platform/mtk-mdp/mtk_mdp_comp.c | 44 
> +--
>  drivers/media/platform/mtk-mdp/mtk_mdp_comp.h |  2 --
>  drivers/media/platform/mtk-mdp/mtk_mdp_core.c |  1 -
>  3 files changed, 1 insertion(+), 46 deletions(-)
> 
> diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_comp.c 
> b/drivers/media/platform/mtk-mdp/mtk_mdp_comp.c
> index 228c58f..388ae67 100644
> --- a/drivers/media/platform/mtk-mdp/mtk_mdp_comp.c
> +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_comp.c
> @@ -13,7 +13,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  
> @@ -58,18 +57,6 @@ void mtk_mdp_comp_clock_on(struct mtk_mdp_comp *comp)
>  {
>   int i, err;
>  
> - if (comp->larb_dev) {
> - err = mtk_smi_larb_get(comp->larb_dev);
> - if (err) {
> - enum mtk_mdp_comp_type comp_type =
> - (enum mtk_mdp_comp_type)
> - of_device_get_match_data(comp->dev);
> - dev_err(comp->dev,
> - "failed to get larb, err %d. type:%d\n",
> - err, comp_type);
> - }
> - }
> -
>   err = pm_runtime_get_sync(comp->dev);
>   if (err < 0)
>   dev_err(comp->dev,
> @@ -97,9 +84,6 @@ void mtk_mdp_comp_clock_off(struct mtk_mdp_comp *comp)
>   clk_disable_unprepare(comp->clk[i]);
>   }
>  
> - if (comp->larb_dev)
> - mtk_smi_larb_put(comp->larb_dev);
> -
>   pm_runtime_put_sync(comp->dev);
>  }
>  
> @@ -132,12 +116,10 @@ static void mtk_mdp_comp_unbind(struct device *dev, 
> struct device *master,
>  
>  int mtk_mdp_comp_init(struct mtk_mdp_comp *comp, struct device *dev)
>  {
> - struct device_node *larb_node;
> - struct platform_device *larb_pdev;
> - int i;
>   struct device_node *node = dev->of_node;
>   enum mtk_mdp_comp_type comp_type =
>(enum mtk_mdp_comp_type)of_device_get_match_data(dev);
> + int i;
>  
>   INIT_LIST_HEAD(>node);
>   comp->dev = dev;
> @@ -156,30 +138,6 @@ int mtk_mdp_comp_init(struct mtk_mdp_comp *comp, struct 
> device *dev)
>   break;
>   }
>  
> - /* Only DMA capable components need the LARB property */
> - comp->larb_dev = NULL;
> - if (comp_type != MTK_MDP_RDMA &&
> - comp_type != MTK_MDP_WDMA &&
> - comp_type != MTK_MDP_WROT)
> - return 0;
> -
> - larb_node = of_parse_phandle(node, "mediatek,larb", 0);
> - if (!larb_node) {
> - dev_err(dev,
> - "Missing mediadek,larb phandle in %pOF node\n", node);
> - return -EINVAL;
> - }
> -
> - larb_pdev = of_find_device_by_node(larb_node);
> - if (!larb_pdev) {
> - dev_warn(dev, "Waiting for larb device %pOF\n", larb_node);
> - of_node_put(larb_node);
> - return -EPROBE_DEFER;
> - }
> - of_node_put(larb_node);
> -
> - comp->larb_dev = _pdev->dev;
> -
>   return 0;
>  }
>  
> diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_comp.h 
> b/drivers/media/platform/mtk-mdp/mtk_mdp_comp.h
> index de158d3..355e226 100644
> --- a/drivers/media/platform/mtk-mdp/mtk_mdp_comp.h
> +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_comp.h
> @@ -11,13 +11,11 @@
>   * struct mtk_mdp_comp - the MDP's function component data
>   * @node:list node to track sibing MDP components
>   * @clk: clocks required for component
> - * @larb_dev:SMI device required for component
>   * @dev: component's device
>   */
>  struct mtk_mdp_comp {
>   struct list_headnode;
>   struct clk  *clk[2];
> - struct device   *larb_dev;
>   struct device   *dev;
>  };
>  
> diff --git a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c 
> b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
> index 133d107..bc5472d 100644
> --- a/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
> +++ b/drivers/media/platform/mtk-mdp/mtk_mdp_core.c
> @@ -18,7 +18,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  
>  #include "mtk_mdp_comp.h"
>  #include "mtk_mdp_core.h"


Acked-by: Houlong Wei 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

RE: [PATCH v6 06/15] iommu/vt-d: Support setting ioasid set to domain

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 11:07 PM
> 
> Hi Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > From IOMMU p.o.v., PASIDs allocated and managed by external components
> > (e.g. VFIO) will be passed in for gpasid_bind/unbind operation. IOMMU
> > needs some knowledge to check the PASID ownership, hence add an
> > interface for those components to tell the PASID owner.
> >
> > In latest kernel design, PASID ownership is managed by IOASID set
> > where the PASID is allocated from. This patch adds support for setting
> > ioasid set ID to the domains used for nesting/vSVA. Subsequent SVA
> > operations will check the PASID against its IOASID set for proper ownership.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > ---
> >  drivers/iommu/intel/iommu.c | 23 +++
> > include/linux/intel-iommu.h |  4 
> >  include/linux/iommu.h   |  1 +
> >  3 files changed, 28 insertions(+)
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index ed4b71c..b2fe54e 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -1793,6 +1793,7 @@ static struct dmar_domain *alloc_domain(int flags)
> > if (first_level_by_default())
> > domain->flags |= DOMAIN_FLAG_USE_FIRST_LEVEL;
> > domain->has_iotlb_device = false;
> > +   domain->ioasid_sid = INVALID_IOASID_SET;
> > INIT_LIST_HEAD(>devices);
> >
> > return domain;
> > @@ -6040,6 +6041,28 @@ intel_iommu_domain_set_attr(struct iommu_domain
> *domain,
> > }
> > spin_unlock_irqrestore(_domain_lock, flags);
> > break;
> > +   case DOMAIN_ATTR_IOASID_SID:
> > +   {
> > +   int sid = *(int *)data;
> 
> > +
> > +   spin_lock_irqsave(_domain_lock, flags);
> > +   if (!(dmar_domain->flags & DOMAIN_FLAG_NESTING_MODE)) {
> > +   ret = -ENODEV;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > +   if (dmar_domain->ioasid_sid != INVALID_IOASID_SET &&
> > +   dmar_domain->ioasid_sid != sid) {
> > +   pr_warn_ratelimited("multi ioasid_set (%d:%d) setting",
> > +   dmar_domain->ioasid_sid, sid);
> > +   ret = -EBUSY;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> > +   }
> > +   dmar_domain->ioasid_sid = sid;
> > +   spin_unlock_irqrestore(_domain_lock, flags);
> > +   break;
> nit: Adding a small helper
> int__set_ioasid_sid(struct dmar_domain *dmar_domain, int sid_id)
> 
> may simplify the lock handling

ok. will do.

> 
> > +   }
> > default:
> > ret = -EINVAL;
> > break;
> > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> > index 3f23c26..0d0ab32 100644
> > --- a/include/linux/intel-iommu.h
> > +++ b/include/linux/intel-iommu.h
> > @@ -549,6 +549,10 @@ struct dmar_domain {
> >2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
> > u64 max_addr;   /* maximum mapped address */
> >
> > +   int ioasid_sid; /*
> > +* the ioasid set which tracks all
> id of the ioasid set?

should be ioasid_set. however, ioasid_alloc_set() returns sid in Jacob's
series. but, I heard from Jacob, he will remove ioasid_sid, and return
ioasid_set. so I will modify it once his patch is sent out.

https://lore.kernel.org/linux-iommu/1585158931-1825-4-git-send-email-jacob.jun@linux.intel.com/

> > +* PASIDs used by the domain.
> > +*/
> > int default_pasid;  /*
> >  * The default pasid used for non-SVM
> >  * traffic on mediated devices.
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 4a02c9e..b1ff702 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -124,6 +124,7 @@ enum iommu_attr {
> > DOMAIN_ATTR_FSL_PAMUV1,
> > DOMAIN_ATTR_NESTING,/* two stages of translation */
> > DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
> > +   DOMAIN_ATTR_IOASID_SID,
> > DOMAIN_ATTR_MAX,
> >  };
> >
> >
> Besides
> Reviewed-by: Eric Auger 

thanks :-)

Regards,
Yi Liu

> 
> Eric

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: [PATCH v6 05/15] vfio: Add PASID allocation/free support

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 11:07 PM
> 
> Yi,
> 
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > Shared Virtual Addressing (a.k.a Shared Virtual Memory) allows sharing
> > multiple process virtual address spaces with the device for simplified
> > programming model. PASID is used to tag an virtual address space in
> > DMA requests and to identify the related translation structure in
> > IOMMU. When a PASID-capable device is assigned to a VM, we want the
> > same capability of using PASID to tag guest process virtual address
> > spaces to achieve virtual SVA (vSVA).
> >
> > PASID management for guest is vendor specific. Some vendors (e.g.
> > Intel
> > VT-d) requires system-wide managed PASIDs across all devices,
> > regardless of whether a device is used by host or assigned to guest.
> > Other vendors (e.g. ARM SMMU) may allow PASIDs managed per-device thus
> > could be fully delegated to the guest for assigned devices.
> >
> > For system-wide managed PASIDs, this patch introduces a vfio module to
> > handle explicit PASID alloc/free requests from guest. Allocated PASIDs
> > are associated to a process (or, mm_struct) in IOASID core. A vfio_mm
> > object is introduced to track mm_struct. Multiple VFIO containers
> > within a process share the same vfio_mm object.
> >
> > A quota mechanism is provided to prevent malicious user from
> > exhausting available PASIDs. Currently the quota is a global parameter
> > applied to all VFIO devices. In the future per-device quota might be 
> > supported too.
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Suggested-by: Alex Williamson 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) address comments from Eric. Add vfio_unlink_pasid() to be consistent
> >with vfio_unlink_dma(). Add a comment in vfio_pasid_exit().
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) address the comments from Alex on the pasid free range support. Added
> >per vfio_mm pasid r-b tree.
> >https://lore.kernel.org/kvm/20200709082751.32074...@x1.home/
> >
> > v3 -> v4:
> > *) fix lock leam in vfio_mm_get_from_task()
> > *) drop pasid_quota field in struct vfio_mm
> > *) vfio_mm_get_from_task() returns ERR_PTR(-ENOTTY) when
> > !CONFIG_VFIO_PASID
> >
> > v1 -> v2:
> > *) added in v2, split from the pasid alloc/free support of v1
> > ---
> >  drivers/vfio/Kconfig  |   5 +
> >  drivers/vfio/Makefile |   1 +
> >  drivers/vfio/vfio_pasid.c | 248
> ++
> >  include/linux/vfio.h  |  28 ++
> >  4 files changed, 282 insertions(+)
> >  create mode 100644 drivers/vfio/vfio_pasid.c
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > fd17db9..3d8a108 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -19,6 +19,11 @@ config VFIO_VIRQFD
> > depends on VFIO && EVENTFD
> > default n
> >
> > +config VFIO_PASID
> > +   tristate
> > +   depends on IOASID && VFIO
> > +   default n
> > +
> >  menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > depends on IOMMU_API
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > de67c47..bb836a3 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -3,6 +3,7 @@ vfio_virqfd-y := virqfd.o
> >
> >  obj-$(CONFIG_VFIO) += vfio.o
> >  obj-$(CONFIG_VFIO_VIRQFD) += vfio_virqfd.o
> > +obj-$(CONFIG_VFIO_PASID) += vfio_pasid.o
> >  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> >  obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> >  obj-$(CONFIG_VFIO_SPAPR_EEH) += vfio_spapr_eeh.o diff --git
> > a/drivers/vfio/vfio_pasid.c b/drivers/vfio/vfio_pasid.c new file mode
> > 100644 index 000..befcf29
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_pasid.c
> > @@ -0,0 +1,248 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Intel Corporation.
> > + * Author: Liu Yi L 
> > + *
> > + */
> > +
> > +#include 
> > +#include 
> not needed

oh, yes. 

> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define DRIVER_VERSION  "0.1"
> > +#define DRIVER_AUTHOR   "Liu Yi L "
> > +#define DRIVER_DESC "PASID management for VFIO bus drivers"
> > +
> > +#define VFIO_DEFAULT_PASID_QUOTA   1000
> > +static int pasid_quota = VFIO_DEFAULT_PASID_QUOTA;
> > +module_param_named(pasid_quota, pasid_quota, uint, 0444);
> > +MODULE_PARM_DESC(pasid_quota,
> > +"Set the quota for max number of PASIDs that an application is
> > +allowed to request (default 1000)");
> > +
> > +struct vfio_mm_token {
> > +   unsigned long long val;
> > +};
> > +
> > +struct vfio_mm {
> > +   struct kref kref;
> > +   int ioasid_sid;
> > +   struct mutexpasid_lock;
> > +   struct rb_root  pasid_list;
> > +   struct list_headnext;
> > +   struct 

RE: [PATCH v6 04/15] vfio/type1: Report iommu nesting info to userspace

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 9:20 PM
>
> Hi Yi,
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > This patch exports iommu nesting capability info to user space through
> > VFIO. Userspace is expected to check this info for supported uAPIs (e.g.
> > PASID alloc/free, bind page table, and cache invalidation) and the
> > vendor specific format information for first level/stage page table
> > that will be bound to.
> >
> > The nesting info is available only after container set to be NESTED type.
> > Current implementation imposes one limitation - one nesting container
> > should include at most one iommu group. The philosophy of vfio
> > container is having all groups/devices within the container share the
> > same IOMMU context. When vSVA is enabled, one IOMMU context could
> > include one 2nd- level address space and multiple 1st-level address
> > spaces. While the 2nd-level address space is reasonably sharable by
> > multiple groups, blindly sharing 1st-level address spaces across all
> > groups within the container might instead break the guest expectation.
> > In the future sub/super container concept might be introduced to allow
> > partial address space sharing within an IOMMU context. But for now
> > let's go with this restriction by requiring singleton container for
> > using nesting iommu features. Below link has the related discussion about 
> > this
> decision.
> >
> > https://lore.kernel.org/kvm/20200515115924.37e69...@w520.home/
> >
> > This patch also changes the NESTING type container behaviour.
> > Something that would have succeeded before will now fail: Before this
> > series, if user asked for a VFIO_IOMMU_TYPE1_NESTING, it would have
> > succeeded even if the SMMU didn't support stage-2, as the driver would
> > have silently fallen back on stage-1 mappings (which work exactly the
> > same as stage-2 only since there was no nesting supported). After the
> > series, we do check for DOMAIN_ATTR_NESTING so if user asks for
> > VFIO_IOMMU_TYPE1_NESTING and the SMMU doesn't support stage-2, the
> > ioctl fails. But it should be a good fix and completely harmless. Detail 
> > can be found
> in below link as well.
> >
> > https://lore.kernel.org/kvm/20200717090900.GC4850@myrica/
> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > ---
> > v5 -> v6:
> > *) address comments against v5 from Eric Auger.
> > *) don't report nesting cap to userspace if the nesting_info->format is
> >invalid.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> > *) return struct iommu_nesting_info for
> VFIO_IOMMU_TYPE1_INFO_CAP_NESTING as
> >cap is much "cheap", if needs extension in future, just define another 
> > cap.
> >https://lore.kernel.org/kvm/20200708132947.5b7ee...@x1.home/
> >
> > v3 -> v4:
> > *) address comments against v3.
> >
> > v1 -> v2:
> > *) added in v2
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 106
> +++-
> >  include/uapi/linux/vfio.h   |  19 +++
> >  2 files changed, 113 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index 3bd70ff..18ff0c3 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -62,18 +62,20 @@ MODULE_PARM_DESC(dma_entry_limit,
> >  "Maximum number of user DMA mappings per container (65535).");
> >
> >  struct vfio_iommu {
> > -   struct list_headdomain_list;
> > -   struct list_headiova_list;
> > -   struct vfio_domain  *external_domain; /* domain for external user */
> > -   struct mutexlock;
> > -   struct rb_root  dma_list;
> > -   struct blocking_notifier_head notifier;
> > -   unsigned intdma_avail;
> > -   uint64_tpgsize_bitmap;
> > -   boolv2;
> > -   boolnesting;
> > -   booldirty_page_tracking;
> > -   boolpinned_page_dirty_scope;
> > +   struct list_headdomain_list;
> > +   struct list_headiova_list;
> > +   /* domain for external user */
> > +   struct vfio_domain  *external_domain;
> > +   struct mutexlock;
> > +   struct rb_root  dma_list;
> > +   struct blocking_notifier_head   notifier;
> > +   unsigned intdma_avail;
> > +   uint64_tpgsize_bitmap;
> > +   boolv2;
> > +   boolnesting;
> > +   booldirty_page_tracking;
> > +   boolpinned_page_dirty_scope;
> > +   struct iommu_nesting_info   *nesting_info;
> >  };
> >
> >  struct vfio_domain {
> > @@ -130,6 +132,9 @@ struct vfio_regions {
> >  #define 

RE: [PATCH v6 02/15] iommu: Report domain nesting info

2020-08-14 Thread Liu, Yi L
Hi Eric,

> From: Auger Eric 
> Sent: Thursday, August 13, 2020 8:53 PM
> 
> Yi,
> On 7/28/20 8:27 AM, Liu Yi L wrote:
> > IOMMUs that support nesting translation needs report the capability info
> s/needs/need to
> > to userspace. It gives information about requirements the userspace needs
> > to implement plus other features characterizing the physical implementation.
> >
> > This patch reports nesting info by DOMAIN_ATTR_NESTING. Caller can get
> > nesting info after setting DOMAIN_ATTR_NESTING. For VFIO, it is after
> > selecting VFIO_TYPE1_NESTING_IOMMU.
> This is not what this patch does ;-) It introduces a new IOMMU UAPI
> struct that gives information about the nesting capabilities and
> features. This struct is supposed to be returned by
> iommu_domain_get_attr() with DOMAIN_ATTR_NESTING attribute parameter,
> one a domain whose type has been set to DOMAIN_ATTR_NESTING.

got it. let me apply your suggestion. thanks. :-)

> >
> > Cc: Kevin Tian 
> > CC: Jacob Pan 
> > Cc: Alex Williamson 
> > Cc: Eric Auger 
> > Cc: Jean-Philippe Brucker 
> > Cc: Joerg Roedel 
> > Cc: Lu Baolu 
> > Signed-off-by: Liu Yi L 
> > Signed-off-by: Jacob Pan 
> > ---
> > v5 -> v6:
> > *) rephrase the feature notes per comments from Eric Auger.
> > *) rename @size of struct iommu_nesting_info to @argsz.
> >
> > v4 -> v5:
> > *) address comments from Eric Auger.
> >
> > v3 -> v4:
> > *) split the SMMU driver changes to be a separate patch
> > *) move the @addr_width and @pasid_bits from vendor specific
> >part to generic part.
> > *) tweak the description for the @features field of struct
> >iommu_nesting_info.
> > *) add description on the @data[] field of struct iommu_nesting_info
> >
> > v2 -> v3:
> > *) remvoe cap/ecap_mask in iommu_nesting_info.
> > *) reuse DOMAIN_ATTR_NESTING to get nesting info.
> > *) return an empty iommu_nesting_info for SMMU drivers per Jean'
> >suggestion.
> > ---
> >  include/uapi/linux/iommu.h | 74
> ++
> >  1 file changed, 74 insertions(+)
> >
> > diff --git a/include/uapi/linux/iommu.h b/include/uapi/linux/iommu.h
> > index 7c8e075..5e4745a 100644
> > --- a/include/uapi/linux/iommu.h
> > +++ b/include/uapi/linux/iommu.h
> > @@ -332,4 +332,78 @@ struct iommu_gpasid_bind_data {
> > } vendor;
> >  };
> >
> > +/*
> > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > + *userspace should check it before using
> > + *nesting capability.
> > + *
> > + * @argsz: size of the whole structure.
> > + * @flags: currently reserved for future extension. must set to 0.
> > + * @format:PASID table entry format, the same definition as struct
> > + * iommu_gpasid_bind_data @format.
> > + * @features:  supported nesting features.
> > + * @addr_width:The output addr width of first level/stage translation
> > + * @pasid_bits:Maximum supported PASID bits, 0 represents no PASID
> > + * support.
> > + * @data:  vendor specific cap info. data[] structure type can be deduced
> > + * from @format field.
> > + *
> > + *
> +===+===
> ===+
> > + * | feature   |  Notes   |
> > + *
> +===+===
> ===+
> > + * | SYSWIDE_PASID |  IOMMU vendor driver sets it to mandate userspace|
> > + * |   |  to allocate PASID from kernel. All PASID allocation |
> > + * |   |  free must be mediated through the TBD API.  |
> s/TBD/IOMMU

got it.

> > + * +---+--+
> > + * | BIND_PGTBL|  IOMMU vendor driver sets it to mandate userspace|
> > + * |   |  bind the first level/stage page table to associated |
> s/bind/to bind

got it.

> > + * |   |  PASID (either the one specified in bind request or  |
> > + * |   |  the default PASID of iommu domain), through IOMMU   |
> > + * |   |  UAPI.   |
> > + * +---+--+
> > + * | CACHE_INVLD   |  IOMMU vendor driver sets it to mandate userspace|
> 
> > + * |   |  explicitly invalidate the IOMMU cache through IOMMU |
> to explicitly

I see.

> > + * |   |  U
> > API according to vendor-specific requirement when  |
> > + * |   |  changing the 1st level/stage page table.|
> > + * +---+--+
> > + *
> > + * @data[] types defined for @format:
> > + *
> ++==
> ===+
> > + * | @format| @data[] |
> > + *
> ++==
> ===+
> > + * | 

Re: [PATCH v3 2/2] dma-pool: Only allocate from CMA when in same memory zone

2020-08-14 Thread Christoph Hellwig
On Sat, Aug 08, 2020 at 08:33:54AM +0200, Christoph Hellwig wrote:
> On Fri, Aug 07, 2020 at 10:50:19AM +0200, Nicolas Saenz Julienne wrote:
> > On Fri, 2020-08-07 at 07:21 +0200, Christoph Hellwig wrote:
> > > On Thu, Aug 06, 2020 at 08:47:55PM +0200, Nicolas Saenz Julienne wrote:
> > > > There is no guarantee to CMA's placement, so allocating a zone specific
> > > > atomic pool from CMA might return memory from a completely different
> > > > memory zone. To get around this double check CMA's placement before
> > > > allocating from it.
> > > 
> > > As the builtbot pointed out, memblock_start_of_DRAM can't be used from
> > > non-__init code.  But lookig at it I think throwing that in
> > > is bogus anyway, as cma_get_base returns a proper physical address
> > > already.
> > 
> > It does indeed, but I'm comparing CMA's base with bitmasks that don't take 
> > into
> > account where the memory starts. Say memory starts at 0x8000, and CMA 
> > falls
> > into ZONE_DMA [0x8000 0xC000], if you want to compare it with
> > DMA_BIT_MASK(zone_dma_bits) you're forced to unify the memory bases.
> > 
> > That said, I now realize that this doesn't work for ZONE_DMA32 which has a 
> > hard
> > limit on 32bit addresses reglardless of the memory base.
> > 
> > That said I still need to call memblock_start_of_DRAM() any suggestions WRT
> > that? I could save the value in dma_atomic_pool_init(), which is __init 
> > code.
> 
> The pool must be about a 32-bit physical address.  The offsets can be
> different for every device..

Do you plan to resend this one without the memblock_start_of_DRAM
thingy?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu