date:20210624

Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-24 Thread Bharata B Rao

On Fri, Jun 25, 2021 at 11:16:08AM +0530, Srikar Dronamraju wrote:
> * Bharata B Rao  [2021-06-24 21:25:09]:
> 
> > A PowerPC KVM guest gets the following BUG message when booting
> > linux-next-20210623:
> > 
> > smp: Bringing up secondary CPUs ...
> > BUG: scheduling while atomic: swapper/1/0/0x
> > no locks held by swapper/1/0.
> > Modules linked in:
> > CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> > Call Trace:
> > [cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> > [cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
> > [cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
> > [cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
> > [cae5be50] [c022984c] do_idle+0x2bc/0x420
> > [cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
> > [cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
> > [cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14
> > 
> > 
> > 
> > smp: Brought up 2 nodes, 16 CPUs
> > numa: Node 0 CPUs: 0-7
> > numa: Node 1 CPUs: 8-15
> > 
> > This seems to have started from next-20210521 and isn't seen on
> > next-20210511.
> > 
> 
> Bharata,
> 
> I think the regression is due to Commit f1a0a376ca0c ("sched/core:
> Initialize the idle task with preemption disabled")
> 
> Can you please try with the above commit reverted?

Yes, reverting that commit helps.

Regards,
Bharata.

Re: PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-24 Thread Srikar Dronamraju

* Bharata B Rao  [2021-06-24 21:25:09]:

> A PowerPC KVM guest gets the following BUG message when booting
> linux-next-20210623:
> 
> smp: Bringing up secondary CPUs ...
> BUG: scheduling while atomic: swapper/1/0/0x
> no locks held by swapper/1/0.
> Modules linked in:
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
> Call Trace:
> [cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
> [cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
> [cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
> [cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
> [cae5be50] [c022984c] do_idle+0x2bc/0x420
> [cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
> [cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
> [cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14
> 
> 
> 
> smp: Brought up 2 nodes, 16 CPUs
> numa: Node 0 CPUs: 0-7
> numa: Node 1 CPUs: 8-15
> 
> This seems to have started from next-20210521 and isn't seen on
> next-20210511.
> 

Bharata,

I think the regression is due to Commit f1a0a376ca0c ("sched/core:
Initialize the idle task with preemption disabled")

Can you please try with the above commit reverted?

-- 
Thanks and Regards
Srikar Dronamraju

[PATCH v3] mm: pagewalk: Fix walk for hugepage tables

2021-06-24 Thread Christophe Leroy

Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.

Signed-off-by: Christophe Leroy 
Reviewed-by: Steven Price 
---
v3:
- Rebased on next-20210624 (no change since v2)
- Added Steven's Reviewed-by
- Sent as standalone for merge via mm

v2:
- Add a guard for NULL ops->pte_entry
- Take mm->page_table_lock when walking hugepage table, as suggested by 
follow_huge_pd()
---
 mm/pagewalk.c | 58 ++-
 1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..9b3db11a4d1d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,6 +58,45 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
return err;
 }
 
+#ifdef CONFIG_ARCH_HAS_HUGEPD
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   int err = 0;
+   const struct mm_walk_ops *ops = walk->ops;
+   int shift = hugepd_shift(*phpd);
+   int page_size = 1 << shift;
+
+   if (!ops->pte_entry)
+   return 0;
+
+   if (addr & (page_size - 1))
+   return 0;
+
+   for (;;) {
+   pte_t *pte;
+
+   spin_lock(>mm->page_table_lock);
+   pte = hugepte_offset(*phpd, addr, pdshift);
+   err = ops->pte_entry(pte, addr, addr + page_size, walk);
+   spin_unlock(>mm->page_table_lock);
+
+   if (err)
+   break;
+   if (addr >= end - page_size)
+   break;
+   addr += page_size;
+   }
+   return err;
+}
+#else
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   return 0;
+}
+#endif
+
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
 {
@@ -108,7 +147,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
goto again;
}
 
-   err = walk_pte_range(pmd, addr, next, walk);
+   if (is_hugepd(__hugepd(pmd_val(*pmd
+   err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
walk, PMD_SHIFT);
+   else
+   err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
@@ -157,7 +199,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
if (pud_none(*pud))
goto again;
 
-   err = walk_pmd_range(pud, addr, next, walk);
+   if (is_hugepd(__hugepd(pud_val(*pud
+   err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
+   else
+   err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);
@@ -189,7 +234,9 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
if (err)
break;
}
-   if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+   if (is_hugepd(__hugepd(p4d_val(*p4d
+   err = walk_hugepd_range((hugepd_t *)p4d, addr, next, 
walk, P4D_SHIFT);
+   else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -224,8 +271,9 @@ static int walk_pgd_range(unsigned long addr, unsigned long 
end,
if (err)
break;
}
-   if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
-   ops->pte_entry)
+   if (is_hugepd(__hugepd(pgd_val(*pgd
+   err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
walk, PGDIR_SHIFT);
+   else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || 
ops->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;
-- 
2.25.0

Re: [PATCH v2 1/4] mm: pagewalk: Fix walk for hugepage tables

2021-06-24 Thread Christophe Leroy





Le 25/06/2021 à 06:45, Michael Ellerman a écrit :

Christophe Leroy  writes:

Hi Michael,

Le 19/04/2021 à 12:47, Christophe Leroy a écrit :

Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.


I see you took patch 2 and 3 of the series.


Yeah I decided those were bug fixes so could be taken separately.


Do you expect Andrew to take patch 1 via mm tree, and then you'll take
patch 4 once mm tree is merged ?


I didn't feel I could take patch 1 via the powerpc tree without risking
conflicts.

Andrew could take patch 1 and 4 via mm, though he might not want to pick
them up this late.


Patch 4 needs patches 2 and 3 and doesn't apply without them so it is not that 
easy.

Maybe Andrew you can take patch 1 now and then Michael you can take patch 4 at anytime during 5.15 
preparation without any conflict risk ?




I guess step one would be to repost 1 and 4 as a new series. Either they
can go via mm, or for 5.15 I could probably take them both as long as I
pick them up early enough.



I'll first repost patch 1 as standalone and see what happens.

Christophe

[PATCH] powerpc/pseries/vas: Include irqdomain.h

2021-06-24 Thread Michael Ellerman

There are patches in flight to break the dependency between asm/irq.h
and linux/irqdomain.h, which would break compilation of vas.c because it
needs the declaration of irq_create_mapping() etc.

So add an explicit include of irqdomain.h to avoid that becoming a
problem in future.

Reported-by: Stephen Rothwell 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/platforms/pseries/vas.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 3385b5400cc6..b5c1cf1bc64d 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
-- 
2.25.1

Re: [PATCH v2] powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() on ppc32

2021-06-24 Thread Michael Ellerman

Christophe Leroy  writes:
> Le 24/06/2021 à 12:59, Naveen N. Rao a écrit :
>> Christophe Leroy wrote:
>>> From: Naveen N. Rao 
>>>
>>> Trying to use a kprobe on ppc32 results in the below splat:
>>>     BUG: Unable to handle kernel data access on read at 0x7c0802a6
>>>     Faulting instruction address: 0xc002e9f0
>>>     Oops: Kernel access of bad area, sig: 11 [#1]
>>>     BE PAGE_SIZE=4K PowerPC 44x Platform
>>>     Modules linked in:
>>>     CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
>>>     NIP:  c002e9f0 LR: c0011858 CTR: 8a47
>>>     REGS: c292fd50 TRAP: 0300   Not tainted  
>>> (5.13.0-rc1-01824-g3a81c0495fdb)
>>>     MSR:  9000   CR: 24002002  XER: 2000
>>>     DEAR: 7c0802a6 ESR: 
>>>     
>>>     NIP [c002e9f0] emulate_step+0x28/0x324
>>>     LR [c0011858] optinsn_slot+0x128/0x1
>>>     Call Trace:
>>>  opt_pre_handler+0x7c/0xb4 (unreliable)
>>>  optinsn_slot+0x128/0x1
>>>  ret_from_syscall+0x0/0x28
>>>
>>> The offending instruction is:
>>>     81 24 00 00 lwz r9,0(r4)
>>>
>>> Here, we are trying to load the second argument to emulate_step():
>>> struct ppc_inst, which is the instruction to be emulated. On ppc64,
>>> structures are passed in registers when passed by value. However, per
>>> the ppc32 ABI, structures are always passed to functions as pointers.
>>> This isn't being adhered to when setting up the call to emulate_step()
>>> in the optprobe trampoline. Fix the same.
>>>
>>> Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
>>> Cc: sta...@vger.kernel.org
>>> Signed-off-by: Naveen N. Rao 
>>> ---
>>> v2: Rebased on powerpc/merge 7f030e9d57b8
>>> Signed-off-by: Christophe Leroy 
>> 
>> Thanks for rebasing this!
>> 
>> I think git am drops everything after three dashes, so applying this patch 
>> will drop your SOB. The 
>> recommended style (*) for adding a changelog is to include it within [] 
>> before the second SOB.
>
> Yes, I saw that after sending the mail. Usually I add a signed-off-by with 
> 'git --amend -s' when I 
> add the history, so it goes before the '---' I'm adding.
>
> This time I must have forgotten the '-s' so it was added by the 'git 
> format-patch -s' which is in my 
> submit script, and so it was added at the end.
>
> It's not really a big deal, up to Michael to either move it at the right 
> place or discard it, I 
> don't really mind. Do the easiest for you.

I just added Christophe's SoB.

cheers

Re: [PATCH v2 1/4] mm: pagewalk: Fix walk for hugepage tables

2021-06-24 Thread Michael Ellerman

Christophe Leroy  writes:
> Hi Michael,
>
> Le 19/04/2021 à 12:47, Christophe Leroy a écrit :
>> Pagewalk ignores hugepd entries and walk down the tables
>> as if it was traditionnal entries, leading to crazy result.
>> 
>> Add walk_hugepd_range() and use it to walk hugepage tables.
>
> I see you took patch 2 and 3 of the series.

Yeah I decided those were bug fixes so could be taken separately.

> Do you expect Andrew to take patch 1 via mm tree, and then you'll take
> patch 4 once mm tree is merged ?

I didn't feel I could take patch 1 via the powerpc tree without risking
conflicts.

Andrew could take patch 1 and 4 via mm, though he might not want to pick
them up this late.

I guess step one would be to repost 1 and 4 as a new series. Either they
can go via mm, or for 5.15 I could probably take them both as long as I
pick them up early enough.

cheers

[powerpc:merge] BUILD SUCCESS dc4225b1afe8424b3fdf56599cdac283be683c69

2021-06-24 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
merge
branch HEAD: dc4225b1afe8424b3fdf56599cdac283be683c69  Automatic merge of 
'next' into merge (2021-06-25 00:45)

elapsed time: 734m

configs tested: 105
configs skipped: 3

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
sh kfr2r09-romimage_defconfig
sh  ul2_defconfig
armspear3xx_defconfig
alphaalldefconfig
m68k  amiga_defconfig
powerpc ep8248e_defconfig
mipsomega2p_defconfig
arm s5pv210_defconfig
powerpc akebono_defconfig
xtensageneric_kc705_defconfig
openriscor1ksim_defconfig
ia64 allmodconfig
shedosk7760_defconfig
powerpc  katmai_defconfig
powerpc mpc834x_mds_defconfig
arcnsimosci_defconfig
powerpc mpc8540_ads_defconfig
mipsmaltaup_xpa_defconfig
arm   multi_v4t_defconfig
um  defconfig
powerpc   mpc834x_itxgp_defconfig
powerpc mpc8313_rdb_defconfig
arm   milbeaut_m10v_defconfig
arm   tegra_defconfig
s390  debug_defconfig
powerpc   ebony_defconfig
powerpc   holly_defconfig
x86_64allnoconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a001-20210622
i386 randconfig-a002-20210622
i386 randconfig-a003-20210622
i386 randconfig-a006-20210622
i386 randconfig-a005-20210622
i386 randconfig-a004-20210622
x86_64   randconfig-a012-20210622
x86_64   randconfig-a016-20210622
x86_64   randconfig-a015-20210622
x86_64   randconfig-a014-20210622
x86_64   randconfig-a013-20210622
x86_64   randconfig-a011-20210622
i386 randconfig-a011-20210622
i386 randconfig-a014-20210622
i386 randconfig-a013-20210622
i386 randconfig-a015-20210622
i386 randconfig-a016-20210622
i386 randconfig-a012-20210622
riscvnommu_k210_defconfig
riscvallyesconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
riscvallmodconfig
x86_64rhel-8.3-kselftests
um   x86_64_defconfig
um i386_defconfig
umkunit_defconfig
x86_64   allyesconfig
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-b001-20210622
x86_64

Re: [PATCH v15 00/12] Restricted DMA

2021-06-24 Thread Claire Chang

On Fri, Jun 25, 2021 at 3:20 AM Konrad Rzeszutek Wilk
 wrote:
>
> On Thu, Jun 24, 2021 at 11:55:14PM +0800, Claire Chang wrote:
> > This series implements mitigations for lack of DMA access control on
> > systems without an IOMMU, which could result in the DMA accessing the
> > system memory at unexpected times and/or unexpected addresses, possibly
> > leading to data leakage or corruption.
> >
> > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> > not behind an IOMMU. As PCI-e, by design, gives the device full access to
> > system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> > full chain of exploits; [2], [3]).
> >
> > To mitigate the security concerns, we introduce restricted DMA. Restricted
> > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> > specially allocated region and does memory allocation from the same region.
> > The feature on its own provides a basic level of protection against the DMA
> > overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system needs
> > to provide a way to restrict the DMA to a predefined memory region (this is
> > usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> >
> > [1a] 
> > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> > [1b] 
> > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> > [2] https://blade.tencent.com/en/advisories/qualpwn/
> > [3] 
> > https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> > [4] 
> > https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> >
> > v15:
> > - Apply Will's diff 
> > (https://lore.kernel.org/patchwork/patch/1448957/#1647521)
> >   to fix the crash reported by Qian.
> > - Add Stefano's Acked-by tag for patch 01/12 from v14
>
> That all should be now be on
>
> https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/
> devel/for-linus-5.14 (and linux-next)
>

devel/for-linus-5.14 looks good. Thanks!

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from Paolo Bonzini's message of June 25, 2021 1:35 am:
> On 24/06/21 14:57, Nicholas Piggin wrote:
>> KVM: Fix page ref underflow for regions with valid but non-refcounted pages
> 
> It doesn't really fix the underflow, it disallows mapping them in the 
> first place.  Since in principle things can break, I'd rather be 
> explicit, so let's go with "KVM: do not allow mapping valid but 
> non-reference-counted pages".
> 
>> It's possible to create a region which maps valid but non-refcounted
>> pages (e.g., tail pages of non-compound higher order allocations). These
>> host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
>> of APIs, which take a reference to the page, which takes it from 0 to 1.
>> When the reference is dropped, this will free the page incorrectly.
>> 
>> Fix this by only taking a reference on the page if it was non-zero,
> 
> s/on the page/on valid pages/ (makes clear that invalid pages are fine 
> without refcounting).

That seems okay, you can adjust the title or changelog as you like.

> Thank you *so* much, I'm awful at Linux mm.

Glad to help. Easy to see why you were taking this approach because the 
API really does need to be improved and even a pretty intwined with mm 
subsystem like KVM shouldn't _really_ be doing this kind of trick (and
it should go away when old API is removed).

Thanks,
Nick

Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

2021-06-24 Thread Sean Christopherson

On Thu, Jun 24, 2021, Paolo Bonzini wrote:
> On 24/06/21 10:43, Nicholas Piggin wrote:
> > Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> > > From: David Stevens 
> > 
> > Changelog? This looks like a bug, should it have a Fixes: tag?
> 
> Probably has been there forever... The best way to fix the bug would be to
> nuke mmu_audit.c, which I've threatened to do many times but never followed
> up on.

Yar.  It has only survived because it hasn't required any maintenance.

Re: [PATCH v2 1/4] mm: pagewalk: Fix walk for hugepage tables

2021-06-24 Thread Christophe Leroy


Hi Michael,

Le 19/04/2021 à 12:47, Christophe Leroy a écrit :

Pagewalk ignores hugepd entries and walk down the tables
as if it was traditionnal entries, leading to crazy result.

Add walk_hugepd_range() and use it to walk hugepage tables.


I see you took patch 2 and 3 of the series.

Do you expect Andrew to take patch 1 via mm tree, and then you'll take patch 4 
once mm tree is merged ?

Christophe



Signed-off-by: Christophe Leroy 
---
v2:
- Add a guard for NULL ops->pte_entry
- Take mm->page_table_lock when walking hugepage table, as suggested by 
follow_huge_pd()
---
  mm/pagewalk.c | 58 ++-
  1 file changed, 53 insertions(+), 5 deletions(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d9f177..9b3db11a4d1d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,6 +58,45 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, 
unsigned long end,
return err;
  }
  
+#ifdef CONFIG_ARCH_HAS_HUGEPD

+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   int err = 0;
+   const struct mm_walk_ops *ops = walk->ops;
+   int shift = hugepd_shift(*phpd);
+   int page_size = 1 << shift;
+
+   if (!ops->pte_entry)
+   return 0;
+
+   if (addr & (page_size - 1))
+   return 0;
+
+   for (;;) {
+   pte_t *pte;
+
+   spin_lock(>mm->page_table_lock);
+   pte = hugepte_offset(*phpd, addr, pdshift);
+   err = ops->pte_entry(pte, addr, addr + page_size, walk);
+   spin_unlock(>mm->page_table_lock);
+
+   if (err)
+   break;
+   if (addr >= end - page_size)
+   break;
+   addr += page_size;
+   }
+   return err;
+}
+#else
+static int walk_hugepd_range(hugepd_t *phpd, unsigned long addr,
+unsigned long end, struct mm_walk *walk, int 
pdshift)
+{
+   return 0;
+}
+#endif
+
  static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
  struct mm_walk *walk)
  {
@@ -108,7 +147,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, 
unsigned long end,
goto again;
}
  
-		err = walk_pte_range(pmd, addr, next, walk);

+   if (is_hugepd(__hugepd(pmd_val(*pmd
+   err = walk_hugepd_range((hugepd_t *)pmd, addr, next, 
walk, PMD_SHIFT);
+   else
+   err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
} while (pmd++, addr = next, addr != end);
@@ -157,7 +199,10 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, 
unsigned long end,
if (pud_none(*pud))
goto again;
  
-		err = walk_pmd_range(pud, addr, next, walk);

+   if (is_hugepd(__hugepd(pud_val(*pud
+   err = walk_hugepd_range((hugepd_t *)pud, addr, next, 
walk, PUD_SHIFT);
+   else
+   err = walk_pmd_range(pud, addr, next, walk);
if (err)
break;
} while (pud++, addr = next, addr != end);
@@ -189,7 +234,9 @@ static int walk_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
if (err)
break;
}
-   if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
+   if (is_hugepd(__hugepd(p4d_val(*p4d
+   err = walk_hugepd_range((hugepd_t *)p4d, addr, next, 
walk, P4D_SHIFT);
+   else if (ops->pud_entry || ops->pmd_entry || ops->pte_entry)
err = walk_pud_range(p4d, addr, next, walk);
if (err)
break;
@@ -224,8 +271,9 @@ static int walk_pgd_range(unsigned long addr, unsigned long 
end,
if (err)
break;
}
-   if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry ||
-   ops->pte_entry)
+   if (is_hugepd(__hugepd(pgd_val(*pgd
+   err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
walk, PGDIR_SHIFT);
+   else if (ops->p4d_entry || ops->pud_entry || ops->pmd_entry || 
ops->pte_entry)
err = walk_p4d_range(pgd, addr, next, walk);
if (err)
break;

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Konrad Rzeszutek Wilk

On Thu, Jun 24, 2021 at 11:58:57PM +0800, Claire Chang wrote:
> On Thu, Jun 24, 2021 at 11:56 PM Konrad Rzeszutek Wilk
>  wrote:
> >
> > On Thu, Jun 24, 2021 at 10:10:51AM -0400, Qian Cai wrote:
> > >
> > >
> > > On 6/24/2021 7:48 AM, Will Deacon wrote:
> > > > Ok, diff below which attempts to tackle the offset issue I mentioned as
> > > > well. Qian Cai -- please can you try with these changes?
> > >
> > > This works fine.
> >
> > Cool. Let me squash this patch in #6 and rebase the rest of them.
> >
> > Claire, could you check the devel/for-linus-5.14 say by end of today to
> > double check that I didn't mess anything up please?
> 
> I just submitted v15 here
> (https://lore.kernel.org/patchwork/cover/1451322/) in case it's
> helpful.

Oh! Nice!
> I'll double check of course. Thanks for the efforts!

I ended up using your patch #6 and #7. Please double-check.

Re: [PATCH v15 00/12] Restricted DMA

2021-06-24 Thread Konrad Rzeszutek Wilk

On Thu, Jun 24, 2021 at 11:55:14PM +0800, Claire Chang wrote:
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
> 
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
> 
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> 
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
> 
> v15:
> - Apply Will's diff (https://lore.kernel.org/patchwork/patch/1448957/#1647521)
>   to fix the crash reported by Qian.
> - Add Stefano's Acked-by tag for patch 01/12 from v14

That all should be now be on

https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/
devel/for-linus-5.14 (and linux-next)

Re: [PATCH printk v3 3/6] printk: remove safe buffers

2021-06-24 Thread Petr Mladek

On Thu 2021-06-24 13:17:45, John Ogness wrote:
> With @logbuf_lock removed, the high level printk functions for
> storing messages are lockless. Messages can be stored from any
> context, so there is no need for the NMI and safe buffers anymore.
> Remove the NMI and safe buffers.
> 
> Although the safe buffers are removed, the NMI and safe context
> tracking is still in place. In these contexts, store the message
> immediately but still use irq_work to defer the console printing.
> 
> Since printk recursion tracking is in place, safe context tracking
> for most of printk is not needed. Remove it. Only safe context
> tracking relating to the console lock is left in place. This is
> because the console lock is needed for the actual printing.

Feel free to use:

Reviewed-by: Petr Mladek 

There are some comments below.

> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -1852,7 +1839,7 @@ static int console_trylock_spinning(void)
>   if (console_trylock())
>   return 1;
>  
> - printk_safe_enter_irqsave(flags);
> + local_irq_save(flags);
>  
>   raw_spin_lock(_owner_lock);

This spin_lock is in the printk() path. We must make sure that
it does not cause deadlock.

printk_safe_enter_irqsave(flags) prevented the recursion because
it deferred the console handling.

One danger might be a lockdep report triggered by
raw_spin_lock(_owner_lock) itself. But it should be safe.
lockdep is checked before the lock is actually taken
and lockdep should disable itself before printing anything.

Another danger might be any printk() called under the lock.
The code just compares and assigns values to some variables
(static, on stack) so we should be on the safe side.

Well, I would feel more comfortable if we add printk_safe_enter_irqsave()
back around the sections guarded by this lock. It can be done
in a separate patch. The code looks safe at the moment.

> @@ -2664,9 +2648,9 @@ void console_unlock(void)
>  
>   for (;;) {
>   size_t ext_len = 0;
> + int handover;
>   size_t len;
>  
> - printk_safe_enter_irqsave(flags);
>  skip:
>   if (!prb_read_valid(prb, console_seq, ))
>   break;
> @@ -2716,19 +2700,22 @@ void console_unlock(void)
>* were to occur on another CPU, it may wait for this one to
>* finish. This task can not be preempted if there is a
>* waiter waiting to take over.
> +  *
> +  * Interrupts are disabled because the hand over to a waiter
> +  * must not be interrupted until the hand over is completed
> +  * (@console_waiter is cleared).
>*/
> + local_irq_save(flags);
>   console_lock_spinning_enable();

Same here. console_lock_spinning_enable() takes console_owner_lock.
I would feel more comfortable if we added printk_safe_enter_irqsave(flags)
inside console_lock_spinning_enable() around the locked code. The code
is safe at the moment but...

>  
>   stop_critical_timings();/* don't trace print latency */
>   call_console_drivers(ext_text, ext_len, text, len);
>   start_critical_timings();
>  
> - if (console_lock_spinning_disable_and_check()) {
> - printk_safe_exit_irqrestore(flags);
> + handover = console_lock_spinning_disable_and_check();

Same here. Also console_lock_spinning_disable_and_check() takes
console_owner_lock. It looks safe at the moment but...

> + local_irq_restore(flags);
> + if (handover)
>   return;
> - }
> -
> - printk_safe_exit_irqrestore(flags);
>  
>   if (do_cond_resched)
>   cond_resched();

> --- a/kernel/printk/printk_safe.c
> +++ b/kernel/printk/printk_safe.c
> @@ -369,7 +70,10 @@ asmlinkage int vprintk(const char *fmt, va_list args)
>* Use the main logbuf even in NMI. But avoid calling console
>* drivers that might have their own locks.
>*/
> - if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) {
> + if (this_cpu_read(printk_context) &
> + (PRINTK_NMI_DIRECT_CONTEXT_MASK |
> +  PRINTK_NMI_CONTEXT_MASK |
> +  PRINTK_SAFE_CONTEXT_MASK)) {
>   unsigned long flags;
>   int len;
>  

There is the following code right below:

printk_safe_enter_irqsave(flags);
len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, fmt, args);
printk_safe_exit_irqrestore(flags);
defer_console_output();
return len;

printk_safe_enter_irqsave(flags) is not needed here. Any nested
printk() ends here as well.

Against this can be done in a separate patch. Well, the commit message
mentions that the printk_safe context is removed everywhere except
for the code manipulating console lock. But is it

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Claire Chang

On Thu, Jun 24, 2021 at 11:56 PM Konrad Rzeszutek Wilk
 wrote:
>
> On Thu, Jun 24, 2021 at 10:10:51AM -0400, Qian Cai wrote:
> >
> >
> > On 6/24/2021 7:48 AM, Will Deacon wrote:
> > > Ok, diff below which attempts to tackle the offset issue I mentioned as
> > > well. Qian Cai -- please can you try with these changes?
> >
> > This works fine.
>
> Cool. Let me squash this patch in #6 and rebase the rest of them.
>
> Claire, could you check the devel/for-linus-5.14 say by end of today to
> double check that I didn't mess anything up please?

I just submitted v15 here
(https://lore.kernel.org/patchwork/cover/1451322/) in case it's
helpful.
I'll double check of course. Thanks for the efforts!

>
> Will,
>
> Thank you for generating the fix! I am going to run it on x86 and Xen
> to make sure all is good (granted last time I ran devel/for-linus-5.14
> on that setup I didn't see any errors so I need to double check
> I didn't do something silly like run a wrong kernel).
>
>
> >
> > >
> > > Will
> > >
> > > --->8
> > >
> > > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > > index 175b6c113ed8..39284ff2a6cd 100644
> > > --- a/include/linux/swiotlb.h
> > > +++ b/include/linux/swiotlb.h
> > > @@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device 
> > > *dev, phys_addr_t paddr)
> > >
> > >  static inline bool is_swiotlb_force_bounce(struct device *dev)
> > >  {
> > > -   return dev->dma_io_tlb_mem->force_bounce;
> > > +   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> > > +
> > > +   return mem && mem->force_bounce;
> > >  }
> > >
> > >  void __init swiotlb_exit(void);
> > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > > index 44be8258e27b..0ffbaae9fba2 100644
> > > --- a/kernel/dma/swiotlb.c
> > > +++ b/kernel/dma/swiotlb.c
> > > @@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, 
> > > phys_addr_t orig_addr,
> > > dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > > unsigned int nslots = nr_slots(alloc_size), stride;
> > > unsigned int index, wrap, count = 0, i;
> > > +   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > > unsigned long flags;
> > >
> > > BUG_ON(!nslots);
> > > @@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, 
> > > phys_addr_t orig_addr,
> > > for (i = index; i < index + nslots; i++) {
> > > mem->slots[i].list = 0;
> > > mem->slots[i].alloc_size =
> > > -   alloc_size - ((i - index) << IO_TLB_SHIFT);
> > > +   alloc_size - (offset + ((i - index) << 
> > > IO_TLB_SHIFT));
> > > }
> > > for (i = index - 1;
> > >  io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
> > >

[PATCH v15 12/12] of: Add plumbing for restricted DMA pool

2021-06-24 Thread Claire Chang

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 drivers/of/address.c| 33 +
 drivers/of/device.c |  3 +++
 drivers/of/of_private.h |  6 ++
 3 files changed, 42 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 73ddf2540f3f..cdf700fba5c4 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1022,6 +1023,38 @@ int of_dma_get_range(struct device_node *np, const 
struct bus_dma_region **map)
of_node_put(node);
return ret;
 }
+
+int of_dma_set_restricted_buffer(struct device *dev, struct device_node *np)
+{
+   struct device_node *node, *of_node = dev->of_node;
+   int count, i;
+
+   count = of_property_count_elems_of_size(of_node, "memory-region",
+   sizeof(u32));
+   /*
+* If dev->of_node doesn't exist or doesn't contain memory-region, try
+* the OF node having DMA configuration.
+*/
+   if (count <= 0) {
+   of_node = np;
+   count = of_property_count_elems_of_size(
+   of_node, "memory-region", sizeof(u32));
+   }
+
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(of_node, "memory-region", i);
+   /*
+* There might be multiple memory regions, but only one
+* restricted-dma-pool region is allowed.
+*/
+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(dev, of_node,
+ i);
+   }
+
+   return 0;
+}
 #endif /* CONFIG_HAS_DMA */
 
 /**
diff --git a/drivers/of/device.c b/drivers/of/device.c
index 6cb86de404f1..e68316836a7a 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
 
arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
 
+   if (!iommu)
+   return of_dma_set_restricted_buffer(dev, np);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d9e6a324de0a..25cebbed5f02 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -161,12 +161,18 @@ struct bus_dma_region;
 #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
 int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev, struct device_node *np);
 #else
 static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
 {
return -ENODEV;
 }
+static inline int of_dma_set_restricted_buffer(struct device *dev,
+  struct device_node *np)
+{
+   return -ENODEV;
+}
 #endif
 
 #endif /* _LINUX_OF_PRIVATE_H */
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 11/12] dt-bindings: of: Add restricted DMA pool

2021-06-24 Thread Claire Chang

Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the reserved-memory node.

Signed-off-by: Claire Chang 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 .../reserved-memory/reserved-memory.txt   | 36 +--
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..39b5f4c5a511 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,23 @@ compatible (optional) - standard definition
   used as a shared pool of DMA buffers for a set of devices. It can
   be used by an operating system to instantiate the necessary pool
   management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to lock 
down
+  the memory access, e.g., MPU. Note that since coherent allocation
+  needs remapping, one must set up another device coherent pool by
+  shared-dma-pool and use dma_alloc_from_dev_coherent instead for 
atomic
+  coherent allocation.
 - vendor specific string in the form ,[-]
 no-map (optional) - empty property
 - Indicates the operating system must not create a virtual mapping
@@ -85,10 +102,11 @@ memory-region-names (optional) - a list of names, one for 
each corresponding
 
 Example
 ---
-This example defines 3 contiguous regions are defined for Linux kernel:
+This example defines 4 contiguous regions for Linux kernel:
 one default of all device drivers (named linux,cma@7200 and 64MiB in size),
-one dedicated to the framebuffer device (named framebuffer@7800, 8MiB), and
-one for multimedia processing (named multimedia-memory@7700, 64MiB).
+one dedicated to the framebuffer device (named framebuffer@7800, 8MiB),
+one for multimedia processing (named multimedia-memory@7700, 64MiB), and
+one for restricted dma pool (named restricted_dma_reserved@0x5000, 64MiB).
 
 / {
#address-cells = <1>;
@@ -120,6 +138,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_reserved: restricted_dma_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x400>;
+   };
};
 
/* ... */
@@ -138,4 +161,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   reg = <0x8301 0x0 0x 0x0 0x0010
+  0x8301 0x0 0x0010 0x0 0x0010>;
+   memory-region = <_dma_reserved>;
+   /* ... */
+   };
 };
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization

2021-06-24 Thread Claire Chang

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes.

Regardless of swiotlb setting, the restricted DMA pool is preferred if
available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 include/linux/swiotlb.h |  3 +-
 kernel/dma/Kconfig  | 14 
 kernel/dma/swiotlb.c| 76 +
 3 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 3b9454d1e498..39284ff2a6cd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -73,7 +73,8 @@ extern enum swiotlb_force swiotlb_force;
  * range check to see if the memory was in fact allocated by this
  * API.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
- * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @end. For default swiotlb, this is command line adjustable via
+ * setup_io_tlb_npages.
  * @used:  The number of used IO TLB block.
  * @list:  The free list describing the number of free entries available
  * from each index.
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 77b405508743..3e961dc39634 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -80,6 +80,20 @@ config SWIOTLB
bool
select NEED_DMA_MAP_STATE
 
+config DMA_RESTRICTED_POOL
+   bool "DMA Restricted Pool"
+   depends on OF && OF_RESERVED_MEM
+   select SWIOTLB
+   help
+ This enables support for restricted DMA pools which provide a level of
+ DMA memory protection on systems with limited hardware protection
+ capabilities, such as those lacking an IOMMU.
+
+ For more information see
+ 

+ and .
+ If unsure, say "n".
+
 #
 # Should be selected if we can mmap non-coherent mappings to userspace.
 # The only thing that is really required is a way to set an uncached bit
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 6a7c6e30eb4b..3baf49c9b766 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -39,6 +39,13 @@
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
 
 #include 
 #include 
@@ -737,4 +744,73 @@ bool swiotlb_free(struct device *dev, struct page *page, 
size_t size)
return true;
 }
 
+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
+
+   /*
+* Since multiple devices can share the same pool, the private data,
+* io_tlb_mem struct, will be initialized by the first device attached
+* to it.
+*/
+   if (!mem) {
+   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+
+   set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
+rmem->size >> PAGE_SHIFT);
+   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
+   mem->force_bounce = true;
+   mem->for_alloc = true;
+
+   rmem->priv = mem;
+
+   if (IS_ENABLED(CONFIG_DEBUG_FS)) {
+   mem->debugfs =
+   debugfs_create_dir(rmem->name, debugfs_dir);
+   swiotlb_create_debugfs_files(mem);
+   }
+   }
+
+   dev->dma_io_tlb_mem = mem;
+
+   return 0;
+}
+
+static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   dev->dma_io_tlb_mem = io_tlb_default_mem;
+}
+
+static const struct reserved_mem_ops rmem_swiotlb_ops = {
+   .device_init = rmem_swiotlb_device_init,
+   .device_release = rmem_swiotlb_device_release,
+};
+
+static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
+{
+   unsigned long node = rmem->fdt_node;
+
+   if (of_get_flat_dt_prop(node, "reusable", NULL) ||
+   of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
+   of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
+   of_get_flat_dt_prop(node, "no-map", NULL))
+   return -EINVAL;
+
+   if (PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
+   pr_err("Restricted DMA pool must be accessible within the 
linear mapping.");
+   return

[PATCH v15 09/12] swiotlb: Add restricted DMA alloc/free support

2021-06-24 Thread Claire Chang

Add the functions, swiotlb_{alloc,free} and is_swiotlb_for_alloc to
support the memory allocation from restricted DMA pool.

The restricted DMA pool is preferred if available.

Note that since coherent allocation needs remapping, one must set up
another device coherent pool by shared-dma-pool and use
dma_alloc_from_dev_coherent instead for atomic coherent allocation.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 include/linux/swiotlb.h | 26 ++
 kernel/dma/direct.c | 49 +++--
 kernel/dma/swiotlb.c| 38 ++--
 3 files changed, 99 insertions(+), 14 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index da348671b0d5..3b9454d1e498 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -85,6 +85,7 @@ extern enum swiotlb_force swiotlb_force;
  * @debugfs:   The dentry to debugfs.
  * @late_alloc:%true if allocated using the page allocator
  * @force_bounce: %true if swiotlb bouncing is forced
+ * @for_alloc:  %true if the pool is used for memory allocation
  */
 struct io_tlb_mem {
phys_addr_t start;
@@ -96,6 +97,7 @@ struct io_tlb_mem {
struct dentry *debugfs;
bool late_alloc;
bool force_bounce;
+   bool for_alloc;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -158,4 +160,28 @@ static inline void swiotlb_adjust_size(unsigned long size)
 extern void swiotlb_print_info(void);
 extern void swiotlb_set_max_segment(unsigned int);
 
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size);
+bool swiotlb_free(struct device *dev, struct page *page, size_t size);
+
+static inline bool is_swiotlb_for_alloc(struct device *dev)
+{
+   return dev->dma_io_tlb_mem->for_alloc;
+}
+#else
+static inline struct page *swiotlb_alloc(struct device *dev, size_t size)
+{
+   return NULL;
+}
+static inline bool swiotlb_free(struct device *dev, struct page *page,
+   size_t size)
+{
+   return false;
+}
+static inline bool is_swiotlb_for_alloc(struct device *dev)
+{
+   return false;
+}
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+
 #endif /* __LINUX_SWIOTLB_H */
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index a92465b4eb12..2de33e5d302b 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -75,6 +75,15 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
 }
 
+static void __dma_direct_free_pages(struct device *dev, struct page *page,
+   size_t size)
+{
+   if (IS_ENABLED(CONFIG_DMA_RESTRICTED_POOL) &&
+   swiotlb_free(dev, page, size))
+   return;
+   dma_free_contiguous(dev, page, size);
+}
+
 static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
gfp_t gfp)
 {
@@ -86,6 +95,16 @@ static struct page *__dma_direct_alloc_pages(struct device 
*dev, size_t size,
 
gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
   _limit);
+   if (IS_ENABLED(CONFIG_DMA_RESTRICTED_POOL) &&
+   is_swiotlb_for_alloc(dev)) {
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   return NULL;
+   }
+   return page;
+   }
+
page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
@@ -142,7 +161,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_swiotlb_for_alloc(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -155,18 +174,23 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_swiotlb_for_alloc(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
 
/*
 * Remapping or decrypting memory may block. If either is required and
 * we can't block, allocate the memory from the atomic pools.
+* If restricted DMA (i.e., is_swiotlb_for_alloc) is

[PATCH v15 08/12] swiotlb: Refactor swiotlb_tbl_unmap_single

2021-06-24 Thread Claire Chang

Add a new function, swiotlb_release_slots, to make the code reusable for
supporting different bounce buffer pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 kernel/dma/swiotlb.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b41d16e92cf6..93752e752e76 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -557,27 +557,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
- size_t mapping_size, enum dma_data_direction dir,
- unsigned long attrs)
+static void swiotlb_release_slots(struct device *dev, phys_addr_t tlb_addr)
 {
-   struct io_tlb_mem *mem = hwdev->dma_io_tlb_mem;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long flags;
-   unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
+   unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
int nslots = nr_slots(mem->slots[index].alloc_size + offset);
int count, i;
 
-   /*
-* First, sync the memory before unmapping the entry
-*/
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
-
/*
 * Return the buffer to the free list by setting the corresponding
 * entries to indicate the number of contiguous entries available.
@@ -612,6 +600,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * tlb_addr is the physical address of the bounce buffer to unmap.
+ */
+void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
+ size_t mapping_size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+   /*
+* First, sync the memory before unmapping the entry
+*/
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
+   swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+
+   swiotlb_release_slots(dev, tlb_addr);
+}
+
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
 {
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 07/12] swiotlb: Move alloc_size to swiotlb_find_slots

2021-06-24 Thread Claire Chang

Rename find_slots to swiotlb_find_slots and move the maintenance of
alloc_size to it for better code reusability later.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 kernel/dma/swiotlb.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0d294bbf274c..b41d16e92cf6 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -432,8 +432,8 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
  * Find a suitable number of IO TLB entries size that will fit this request and
  * allocate a buffer from that IO TLB pool.
  */
-static int find_slots(struct device *dev, phys_addr_t orig_addr,
-   size_t alloc_size)
+static int swiotlb_find_slots(struct device *dev, phys_addr_t orig_addr,
+ size_t alloc_size)
 {
struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long boundary_mask = dma_get_seg_boundary(dev);
@@ -444,6 +444,7 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int index, wrap, count = 0, i;
+   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned long flags;
 
BUG_ON(!nslots);
@@ -488,8 +489,11 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
return -1;
 
 found:
-   for (i = index; i < index + nslots; i++)
+   for (i = index; i < index + nslots; i++) {
mem->slots[i].list = 0;
+   mem->slots[i].alloc_size =
+   alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
+   }
for (i = index - 1;
 io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 mem->slots[i].list; i--)
@@ -530,7 +534,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
return (phys_addr_t)DMA_MAPPING_ERROR;
}
 
-   index = find_slots(dev, orig_addr, alloc_size + offset);
+   index = swiotlb_find_slots(dev, orig_addr, alloc_size + offset);
if (index == -1) {
if (!(attrs & DMA_ATTR_NO_WARN))
dev_warn_ratelimited(dev,
@@ -544,11 +548,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
 * This is needed when we sync the memory.  Then we sync the buffer if
 * needed.
 */
-   for (i = 0; i < nr_slots(alloc_size + offset); i++) {
+   for (i = 0; i < nr_slots(alloc_size + offset); i++)
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
-   mem->slots[index + i].alloc_size =
-   alloc_size - (i << IO_TLB_SHIFT);
-   }
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-- 
2.32.0.288.g62a8d224e6-goog

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Konrad Rzeszutek Wilk

On Thu, Jun 24, 2021 at 10:10:51AM -0400, Qian Cai wrote:
> 
> 
> On 6/24/2021 7:48 AM, Will Deacon wrote:
> > Ok, diff below which attempts to tackle the offset issue I mentioned as
> > well. Qian Cai -- please can you try with these changes?
> 
> This works fine.

Cool. Let me squash this patch in #6 and rebase the rest of them.

Claire, could you check the devel/for-linus-5.14 say by end of today to
double check that I didn't mess anything up please?

Will,

Thank you for generating the fix! I am going to run it on x86 and Xen
to make sure all is good (granted last time I ran devel/for-linus-5.14
on that setup I didn't see any errors so I need to double check
I didn't do something silly like run a wrong kernel).


> 
> > 
> > Will
> > 
> > --->8
> > 
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 175b6c113ed8..39284ff2a6cd 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device 
> > *dev, phys_addr_t paddr)
> >  
> >  static inline bool is_swiotlb_force_bounce(struct device *dev)
> >  {
> > -   return dev->dma_io_tlb_mem->force_bounce;
> > +   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> > +
> > +   return mem && mem->force_bounce;
> >  }
> >  
> >  void __init swiotlb_exit(void);
> > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> > index 44be8258e27b..0ffbaae9fba2 100644
> > --- a/kernel/dma/swiotlb.c
> > +++ b/kernel/dma/swiotlb.c
> > @@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, 
> > phys_addr_t orig_addr,
> > dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> > unsigned int nslots = nr_slots(alloc_size), stride;
> > unsigned int index, wrap, count = 0, i;
> > +   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> > unsigned long flags;
> >  
> > BUG_ON(!nslots);
> > @@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, 
> > phys_addr_t orig_addr,
> > for (i = index; i < index + nslots; i++) {
> > mem->slots[i].list = 0;
> > mem->slots[i].alloc_size =
> > -   alloc_size - ((i - index) << IO_TLB_SHIFT);
> > +   alloc_size - (offset + ((i - index) << 
> > IO_TLB_SHIFT));
> > }
> > for (i = index - 1;
> >  io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
> >

[PATCH v15 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Claire Chang

Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
use it to determine whether to bounce the data or not. This will be
useful later to allow for different pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/swiotlb.h   | 13 +
 kernel/dma/direct.c   |  2 +-
 kernel/dma/direct.h   |  2 +-
 kernel/dma/swiotlb.c  |  4 
 5 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 0c6ed09f8513..4730a146fa35 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -369,7 +369,7 @@ static dma_addr_t xen_swiotlb_map_page(struct device *dev, 
struct page *page,
if (dma_capable(dev, dev_addr, size, true) &&
!range_straddles_page_boundary(phys, size) &&
!xen_arch_need_swiotlb(dev, phys, dev_addr) &&
-   swiotlb_force != SWIOTLB_FORCE)
+   !is_swiotlb_force_bounce(dev))
goto done;
 
/*
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index dd1c30a83058..da348671b0d5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -84,6 +84,7 @@ extern enum swiotlb_force swiotlb_force;
  * unmap calls.
  * @debugfs:   The dentry to debugfs.
  * @late_alloc:%true if allocated using the page allocator
+ * @force_bounce: %true if swiotlb bouncing is forced
  */
 struct io_tlb_mem {
phys_addr_t start;
@@ -94,6 +95,7 @@ struct io_tlb_mem {
spinlock_t lock;
struct dentry *debugfs;
bool late_alloc;
+   bool force_bounce;
struct io_tlb_slot {
phys_addr_t orig_addr;
size_t alloc_size;
@@ -109,6 +111,13 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
return mem && paddr >= mem->start && paddr < mem->end;
 }
 
+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+
+   return mem && mem->force_bounce;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -120,6 +129,10 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 {
return false;
 }
+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+   return false;
+}
 static inline void swiotlb_exit(void)
 {
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a88c34d0867..a92465b4eb12 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -496,7 +496,7 @@ size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
if (is_swiotlb_active(dev) &&
-   (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
+   (dma_addressing_limited(dev) || is_swiotlb_force_bounce(dev)))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
 }
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 13e9e7158d94..4632b0f4f72e 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -87,7 +87,7 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
+   if (is_swiotlb_force_bounce(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8a120f42340b..0d294bbf274c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -179,6 +179,10 @@ static void swiotlb_init_io_tlb_mem(struct io_tlb_mem 
*mem, phys_addr_t start,
mem->end = mem->start + bytes;
mem->index = 0;
mem->late_alloc = late_alloc;
+
+   if (swiotlb_force == SWIOTLB_FORCE)
+   mem->force_bounce = true;
+
spin_lock_init(>lock);
for (i = 0; i < mem->nslabs; i++) {
mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 05/12] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-06-24 Thread Claire Chang

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for different pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
 drivers/pci/xen-pcifront.c   | 2 +-
 include/linux/swiotlb.h  | 4 ++--
 kernel/dma/direct.c  | 2 +-
 kernel/dma/swiotlb.c | 4 ++--
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index a9d65fc8aa0e..4b7afa0fc85d 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
 
max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(obj->base.dev->dev)) {
unsigned int max_segment;
 
max_segment = swiotlb_max_segment();
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index 9662522aa066..be15bfd9e0ee 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(dev->dev);
 #endif
 
ret = ttm_bo_device_init(>ttm.bdev, _bo_driver,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..0d56985bfe81 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(_dev_lock);
 
-   if (!err && !is_swiotlb_active()) {
+   if (!err && !is_swiotlb_active(>xdev->dev)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index d1f3d95881cd..dd1c30a83058 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -112,7 +112,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
@@ -132,7 +132,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
 }
 
-static inline bool is_swiotlb_active(void)
+static inline bool is_swiotlb_active(struct device *dev)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
 size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 72a4289faed1..8a120f42340b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -664,9 +664,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
 }
 
-bool is_swiotlb_active(void)
+bool is_swiotlb_active(struct device *dev)
 {
-   return io_tlb_default_mem != NULL;
+   return dev->dma_io_tlb_mem != NULL;
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 04/12] swiotlb: Update is_swiotlb_buffer to add a struct device argument

2021-06-24 Thread Claire Chang

Update is_swiotlb_buffer to add a struct device argument. This will be
useful later to allow for different pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 drivers/iommu/dma-iommu.c | 12 ++--
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/swiotlb.h   |  7 ---
 kernel/dma/direct.c   |  6 +++---
 kernel/dma/direct.h   |  6 +++---
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 3087d9fa6065..10997ef541f8 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -507,7 +507,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(dev, phys)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -578,7 +578,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -749,7 +749,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -762,7 +762,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -783,7 +783,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -800,7 +800,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4c89afc0df62..0c6ed09f8513 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -100,7 +100,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(dev, paddr);
return 0;
 }
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e513..d1f3d95881cd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include 
 #include 
 #include 
 #include 
@@ -101,9 +102,9 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;
 
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
 
return mem && paddr >= mem->start && paddr < mem->end;
 }
@@ -115,7 +116,7 @@ bool is_swiotlb_active(void);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..84c9feb5474a 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,
for_each_sg(sgl, sg, nents, i) {
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-   if (unlikely(is_swiotlb_buffer(paddr)))
+   if (unlikely(is_swiotlb_buffer(dev, paddr)))

[PATCH v15 03/12] swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used

2021-06-24 Thread Claire Chang

Always have the pointer to the swiotlb pool used in struct device. This
could help simplify the code for other pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 drivers/base/core.c| 4 
 include/linux/device.h | 4 
 kernel/dma/swiotlb.c   | 8 
 3 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index f29839382f81..cb3123e3954d 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include  /* for dma_default_coherent */
 
@@ -2736,6 +2737,9 @@ void device_initialize(struct device *dev)
 defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL)
dev->dma_coherent = dma_default_coherent;
 #endif
+#ifdef CONFIG_SWIOTLB
+   dev->dma_io_tlb_mem = io_tlb_default_mem;
+#endif
 }
 EXPORT_SYMBOL_GPL(device_initialize);
 
diff --git a/include/linux/device.h b/include/linux/device.h
index ba660731bd25..240d652a0696 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,6 +416,7 @@ struct dev_links_info {
  * @dma_pools: Dma pools (if dma'ble device).
  * @dma_mem:   Internal for coherent mem override.
  * @cma_area:  Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Pointer to the swiotlb pool used.  Not for driver use.
  * @archdata:  For arch-specific additions.
  * @of_node:   Associated device tree node.
  * @fwnode:Associated device node supplied by platform firmware.
@@ -518,6 +519,9 @@ struct device {
 #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_SWIOTLB
+   struct io_tlb_mem *dma_io_tlb_mem;
 #endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ede66df6835b..72a4289faed1 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -340,7 +340,7 @@ void __init swiotlb_exit(void)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t 
size,
   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
phys_addr_t orig_addr = mem->slots[index].orig_addr;
@@ -431,7 +431,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
 static int find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -508,7 +508,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int i;
int index;
@@ -559,7 +559,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
  size_t mapping_size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = hwdev->dma_io_tlb_mem;
unsigned long flags;
unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 02/12] swiotlb: Refactor swiotlb_create_debugfs

2021-06-24 Thread Claire Chang

Split the debugfs creation to make the code reusable for supporting
different bounce buffer pools.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
---
 kernel/dma/swiotlb.c | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1f9b2b9e7490..ede66df6835b 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -671,19 +671,26 @@ bool is_swiotlb_active(void)
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
+static struct dentry *debugfs_dir;
 
-static int __init swiotlb_create_debugfs(void)
+static void swiotlb_create_debugfs_files(struct io_tlb_mem *mem)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
-
-   if (!mem)
-   return 0;
-   mem->debugfs = debugfs_create_dir("swiotlb", NULL);
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, >nslabs);
debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, >used);
+}
+
+static int __init swiotlb_create_default_debugfs(void)
+{
+   struct io_tlb_mem *mem = io_tlb_default_mem;
+
+   debugfs_dir = debugfs_create_dir("swiotlb", NULL);
+   if (mem) {
+   mem->debugfs = debugfs_dir;
+   swiotlb_create_debugfs_files(mem);
+   }
return 0;
 }
 
-late_initcall(swiotlb_create_debugfs);
+late_initcall(swiotlb_create_default_debugfs);
 
 #endif
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 01/12] swiotlb: Refactor swiotlb init functions

2021-06-24 Thread Claire Chang

Add a new function, swiotlb_init_io_tlb_mem, for the io_tlb_mem struct
initialization to make the code reusable.

Signed-off-by: Claire Chang 
Reviewed-by: Christoph Hellwig 
Tested-by: Stefano Stabellini 
Tested-by: Will Deacon 
Acked-by: Stefano Stabellini 
---
 kernel/dma/swiotlb.c | 50 ++--
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 52e2ac526757..1f9b2b9e7490 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -168,9 +168,28 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
+   unsigned long nslabs, bool late_alloc)
 {
+   void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+
+   mem->nslabs = nslabs;
+   mem->start = start;
+   mem->end = mem->start + bytes;
+   mem->index = 0;
+   mem->late_alloc = late_alloc;
+   spin_lock_init(>lock);
+   for (i = 0; i < mem->nslabs; i++) {
+   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
+   mem->slots[i].alloc_size = 0;
+   }
+   memset(vaddr, 0, bytes);
+}
+
+int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+{
struct io_tlb_mem *mem;
size_t alloc_size;
 
@@ -186,16 +205,8 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
if (!mem)
panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
  __func__, alloc_size, PAGE_SIZE);
-   mem->nslabs = nslabs;
-   mem->start = __pa(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
+
+   swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
 
io_tlb_default_mem = mem;
if (verbose)
@@ -282,8 +293,8 @@ swiotlb_late_init_with_default_size(size_t default_size)
 int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
struct io_tlb_mem *mem;
+   unsigned long bytes = nslabs << IO_TLB_SHIFT;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
return 0;
@@ -297,20 +308,9 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem)
return -ENOMEM;
 
-   mem->nslabs = nslabs;
-   mem->start = virt_to_phys(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   mem->late_alloc = 1;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
-
+   memset(mem, 0, sizeof(*mem));
set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
-   memset(tlb, 0, bytes);
+   swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
io_tlb_default_mem = mem;
swiotlb_print_info();
-- 
2.32.0.288.g62a8d224e6-goog

[PATCH v15 00/12] Restricted DMA

2021-06-24 Thread Claire Chang

This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.

For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).

To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).

[1a] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] 
https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
[4] 
https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

v15:
- Apply Will's diff (https://lore.kernel.org/patchwork/patch/1448957/#1647521)
  to fix the crash reported by Qian.
- Add Stefano's Acked-by tag for patch 01/12 from v14

v14:
- Move set_memory_decrypted before swiotlb_init_io_tlb_mem (patch 01/12, 10,12)
- Add Stefano's Acked-by tag from v13
https://lore.kernel.org/patchwork/cover/1448954/

v13:
- Fix xen-swiotlb issues
  - memset in patch 01/12
  - is_swiotlb_force_bounce in patch 06/12
- Fix the dts example typo in reserved-memory.txt
- Add Stefano and Will's Tested-by tag from v12
https://lore.kernel.org/patchwork/cover/1448001/

v12:
Split is_dev_swiotlb_force into is_swiotlb_force_bounce (patch 06/12) and
is_swiotlb_for_alloc (patch 09/12)
https://lore.kernel.org/patchwork/cover/1447254/

v11:
- Rebase against swiotlb devel/for-linus-5.14
- s/mempry/memory/g
- exchange the order of patch 09/12 and 10/12
https://lore.kernel.org/patchwork/cover/1447216/

v10:
Address the comments in v9 to
  - fix the dev->dma_io_tlb_mem assignment
  - propagate swiotlb_force setting into io_tlb_default_mem->force
  - move set_memory_decrypted out of swiotlb_init_io_tlb_mem
  - move debugfs_dir declaration into the main CONFIG_DEBUG_FS block
  - add swiotlb_ prefix to find_slots and release_slots
  - merge the 3 alloc/free related patches
  - move the CONFIG_DMA_RESTRICTED_POOL later
https://lore.kernel.org/patchwork/cover/1446882/

v9:
Address the comments in v7 to
  - set swiotlb active pool to dev->dma_io_tlb_mem
  - get rid of get_io_tlb_mem
  - dig out the device struct for is_swiotlb_active
  - move debugfs_create_dir out of swiotlb_create_debugfs
  - do set_memory_decrypted conditionally in swiotlb_init_io_tlb_mem
  - use IS_ENABLED in kernel/dma/direct.c
  - fix redefinition of 'of_dma_set_restricted_buffer'
https://lore.kernel.org/patchwork/cover/1445081/

v8:
- Fix reserved-memory.txt and add the reg property in example.
- Fix sizeof for of_property_count_elems_of_size in
  drivers/of/address.c#of_dma_set_restricted_buffer.
- Apply Will's suggestion to try the OF node having DMA configuration in
  drivers/of/address.c#of_dma_set_restricted_buffer.
- Fix typo in the comment of drivers/of/address.c#of_dma_set_restricted_buffer.
- Add error message for PageHighMem in
  kernel/dma/swiotlb.c#rmem_swiotlb_device_init and move it to
  rmem_swiotlb_setup.
- Fix the message string in rmem_swiotlb_setup.
https://lore.kernel.org/patchwork/cover/1437112/

v7:
Fix debugfs, PageHighMem and comment style in rmem_swiotlb_device_init
https://lore.kernel.org/patchwork/cover/1431031/

v6:
Address the comments in v5
https://lore.kernel.org/patchwork/cover/1423201/

v5:
Rebase on latest linux-next
https://lore.kernel.org/patchwork/cover/1416899/

v4:
- Fix spinlock bad magic
- Use rmem->name for debugfs entry
- Address the comments in v3
https://lore.kernel.org/patchwork/cover/1378113/

v3:
Using only one reserved memory region for both streaming DMA and memory
allocation.
https://lore.kernel.org/patchwork/cover/1360992/

v2:
Building on top of swiotlb.
https://lore.kernel.org/patchwork/cover/1280705/

v1:
Using dma_map_ops.
https://lore.kernel.org/patchwork/cover/1271660/

Claire Chang (12):
  swiotlb: Refactor swiotlb init functions
  swiotlb: Refactor swiotlb_create_debugfs
  swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used

PowerPC guest getting "BUG: scheduling while atomic" on linux-next-20210623 during secondary CPUs bringup

2021-06-24 Thread Bharata B Rao

Hi,

A PowerPC KVM guest gets the following BUG message when booting
linux-next-20210623:

smp: Bringing up secondary CPUs ...
BUG: scheduling while atomic: swapper/1/0/0x
no locks held by swapper/1/0.
Modules linked in:
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.13.0-rc7-next-20210623
Call Trace:
[cae5bc20] [c0badc64] dump_stack_lvl+0x98/0xe0 (unreliable)
[cae5bc60] [c0210200] __schedule_bug+0xb0/0xe0
[cae5bcd0] [c1609e28] __schedule+0x1788/0x1c70
[cae5be20] [c160a8cc] schedule_idle+0x3c/0x70
[cae5be50] [c022984c] do_idle+0x2bc/0x420
[cae5bf00] [c0229d88] cpu_startup_entry+0x38/0x40
[cae5bf30] [c00666c0] start_secondary+0x290/0x2a0
[cae5bf90] [c000be54] start_secondary_prolog+0x10/0x14



smp: Brought up 2 nodes, 16 CPUs
numa: Node 0 CPUs: 0-7
numa: Node 1 CPUs: 8-15

This seems to have started from next-20210521 and isn't seen on
next-20210511.

Regards,
Bharata.

Re: [PATCH printk v3 3/6] printk: remove safe buffers

2021-06-24 Thread John Ogness

On 2021-06-24, Petr Mladek  wrote:
>> --- a/kernel/printk/printk.c
>> +++ b/kernel/printk/printk.c
>> @@ -1852,7 +1839,7 @@ static int console_trylock_spinning(void)
>>  if (console_trylock())
>>  return 1;
>>  
>> -printk_safe_enter_irqsave(flags);
>> +local_irq_save(flags);
>>  
>>  raw_spin_lock(_owner_lock);
>
> This spin_lock is in the printk() path. We must make sure that
> it does not cause deadlock.
>
> printk_safe_enter_irqsave(flags) prevented the recursion because
> it deferred the console handling.
>
> One danger might be a lockdep report triggered by
> raw_spin_lock(_owner_lock) itself. But it should be safe.
> lockdep is checked before the lock is actually taken
> and lockdep should disable itself before printing anything.
>
> Another danger might be any printk() called under the lock.
> The code just compares and assigns values to some variables
> (static, on stack) so we should be on the safe side.
>
> Well, I would feel more comfortable if we add printk_safe_enter_irqsave()
> back around the sections guarded by this lock. It can be done
> in a separate patch. The code looks safe at the moment.

You are correct. printk_safe should also be wrapping @console_owner_lock
locking.

>> @@ -2716,19 +2700,22 @@ void console_unlock(void)
>>   * were to occur on another CPU, it may wait for this one to
>>   * finish. This task can not be preempted if there is a
>>   * waiter waiting to take over.
>> + *
>> + * Interrupts are disabled because the hand over to a waiter
>> + * must not be interrupted until the hand over is completed
>> + * (@console_waiter is cleared).
>>   */
>> +local_irq_save(flags);
>>  console_lock_spinning_enable();
>
> Same here. console_lock_spinning_enable() takes console_owner_lock.
> I would feel more comfortable if we added printk_safe_enter_irqsave(flags)
> inside console_lock_spinning_enable() around the locked code. The code
> is safe at the moment but...

Agreed.

>>  stop_critical_timings();/* don't trace print latency */
>>  call_console_drivers(ext_text, ext_len, text, len);
>>  start_critical_timings();
>>  
>> -if (console_lock_spinning_disable_and_check()) {
>> -printk_safe_exit_irqrestore(flags);
>> +handover = console_lock_spinning_disable_and_check();
>
> Same here. Also console_lock_spinning_disable_and_check() takes
> console_owner_lock. It looks safe at the moment but...

Agreed.

>> --- a/kernel/printk/printk_safe.c
>> +++ b/kernel/printk/printk_safe.c
>> @@ -369,7 +70,10 @@ asmlinkage int vprintk(const char *fmt, va_list args)
>>   * Use the main logbuf even in NMI. But avoid calling console
>>   * drivers that might have their own locks.
>>   */
>> -if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) {
>> +if (this_cpu_read(printk_context) &
>> +(PRINTK_NMI_DIRECT_CONTEXT_MASK |
>> + PRINTK_NMI_CONTEXT_MASK |
>> + PRINTK_SAFE_CONTEXT_MASK)) {
>>  unsigned long flags;
>>  int len;
>>  
>
> There is the following code right below:
>
>   printk_safe_enter_irqsave(flags);
>   len = vprintk_store(0, LOGLEVEL_DEFAULT, NULL, fmt, args);
>   printk_safe_exit_irqrestore(flags);
>   defer_console_output();
>   return len;
>
> printk_safe_enter_irqsave(flags) is not needed here. Any nested
> printk() ends here as well.

Ah, I missed that one. Good eye!

> Against this can be done in a separate patch. Well, the commit message
> mentions that the printk_safe context is removed everywhere except
> for the code manipulating console lock. But is it just a detail.

I would prefer a v4 with these fixes:

- wrap @console_owner_lock with printk_safe usage

- remove unnecessary printk_safe usage from printk_safe.c

- update commit message to say that safe context tracking is left in
  place for both the console and console_owner locks

John Ogness

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 14:57, Nicholas Piggin wrote:

KVM: Fix page ref underflow for regions with valid but non-refcounted pages


It doesn't really fix the underflow, it disallows mapping them in the 
first place.  Since in principle things can break, I'd rather be 
explicit, so let's go with "KVM: do not allow mapping valid but 
non-reference-counted pages".



It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,


s/on the page/on valid pages/ (makes clear that invalid pages are fine 
without refcounting).


Thank you *so* much, I'm awful at Linux mm.

Paolo


which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Qian Cai




On 6/24/2021 7:48 AM, Will Deacon wrote:
> Ok, diff below which attempts to tackle the offset issue I mentioned as
> well. Qian Cai -- please can you try with these changes?

This works fine.

> 
> Will
> 
> --->8
> 
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> index 175b6c113ed8..39284ff2a6cd 100644
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
> phys_addr_t paddr)
>  
>  static inline bool is_swiotlb_force_bounce(struct device *dev)
>  {
> -   return dev->dma_io_tlb_mem->force_bounce;
> +   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
> +
> +   return mem && mem->force_bounce;
>  }
>  
>  void __init swiotlb_exit(void);
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 44be8258e27b..0ffbaae9fba2 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, 
> phys_addr_t orig_addr,
> dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
> unsigned int nslots = nr_slots(alloc_size), stride;
> unsigned int index, wrap, count = 0, i;
> +   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
> unsigned long flags;
>  
> BUG_ON(!nslots);
> @@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, 
> phys_addr_t orig_addr,
> for (i = index; i < index + nslots; i++) {
> mem->slots[i].list = 0;
> mem->slots[i].alloc_size =
> -   alloc_size - ((i - index) << IO_TLB_SHIFT);
> +   alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
> }
> for (i = index - 1;
>  io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
>

Re: [PATCH v2 0/9] powerpc: Add support for Microwatt soft-core

2021-06-24 Thread Michael Ellerman

On Fri, 18 Jun 2021 13:42:53 +1000, Paul Mackerras wrote:
> This series of patches adds support for the Microwatt soft-core.
> Microwatt is an open-source 64-bit Power ISA processor written in VHDL
> which targets medium-sized FPGAs such as the Xilinx Artix-7 or the
> Lattice ECP5.  Microwatt currently implements the scalar fixed plus
> floating-point subset of Power ISA v3.0B plus the radix MMU, but not
> logical partitioning (i.e. it does not have hypervisor mode or nested
> radix translation).
> 
> [...]

Applied to powerpc/next.

[1/9] powerpc: Add Microwatt platform
  https://git.kernel.org/powerpc/c/53d143fe08c24c2ce44ee329e41c2a6aad57ebb5
[2/9] powerpc: Add Microwatt device tree
  https://git.kernel.org/powerpc/c/151b88e8482167f6eb3117d82e4905efb5e72662
[3/9] powerpc/microwatt: Populate platform bus from device-tree
  https://git.kernel.org/powerpc/c/0d0f9e5f2fa7aacf22892078a1065fa5d0ce941b
[4/9] powerpc/xics: Add a native ICS backend for microwatt
  https://git.kernel.org/powerpc/c/aa9c5adf2f61da39c92280d9336e091852e292ff
[5/9] powerpc/microwatt: Use standard 16550 UART for console
  https://git.kernel.org/powerpc/c/48b545b8018db61ab4978d29c73c16b9fbfad12c
[6/9] powerpc/microwatt: Add support for hardware random number generator
  https://git.kernel.org/powerpc/c/c25769fddaec13509b6cdc7ad17458f239c4cee7
[7/9] powerpc/microwatt: Add microwatt_defconfig
  https://git.kernel.org/powerpc/c/4a1511eb342bd80c6ea0e8a7ce0bbe68aac96ac5
[8/9] powerpc/boot: Fixup device-tree on little endian
  https://git.kernel.org/powerpc/c/c93f80849bdd9b45d834053ae1336e28f0026c84
[9/9] powerpc/boot: Add a boot wrapper for Microwatt
  https://git.kernel.org/powerpc/c/4a21192e2796c3338c4b0083b494a84a61311aaf

cheers

Re: [PATCH] powerpc/powernv: Fix machine check reporting of async store errors

2021-06-24 Thread Michael Ellerman

On Tue, 18 May 2021 00:03:55 +1000, Nicholas Piggin wrote:
> POWER9 and POWER10 asynchronous machine checks due to stores have their
> cause reported in SRR1 but SRR1[42] is set, which in other cases
> indicates DSISR cause.
> 
> Check for these cases and clear SRR1[42], so the cause matching uses
> the i-side (SRR1) table.

Applied to powerpc/next.

[1/1] powerpc/powernv: Fix machine check reporting of async store errors
  https://git.kernel.org/powerpc/c/3729e0ec59a20825bd4c8c70996b2df63915e1dd

cheers

Re: [PATCH] powerpc/boot: add zImage.lds to targets

2021-06-24 Thread Michael Ellerman

On Fri, 11 Jun 2021 21:11:04 +1000, Nicholas Piggin wrote:
> This prevents spurious rebuilds of the lds and then wrappers.

Applied to powerpc/next.

[1/1] powerpc/boot: add zImage.lds to targets
  https://git.kernel.org/powerpc/c/710e682286784b50b882fc4befdf83c587059211

cheers

Re: [PATCH 0/4] powerpc/security mitigation updates

2021-06-24 Thread Michael Ellerman

On Mon, 3 May 2021 23:02:39 +1000, Nicholas Piggin wrote:
> This series adds a few missing bits added to recent pseries
> H_GET_CPU_CHARACTERISTICS and implements them, also removes
> a restriction from powernv for some of the flushes.
> 
> This is tested mianly in qemu where I just submitted a patch
> that adds support for these bits (not upstream yet).
> 
> [...]

Patches 1-3 applied to powerpc/next.

[1/4] powerpc/pseries: Get entry and uaccess flush required bits from 
H_GET_CPU_CHARACTERISTICS
  https://git.kernel.org/powerpc/c/65c7d070850e109a8a75a431f5a7f6eb4c007b77
[2/4] powerpc/security: Add a security feature for STF barrier
  https://git.kernel.org/powerpc/c/84ed26fd00c514da57cd46aa3728a48f1f9b35cd
[3/4] powerpc/pesries: Get STF barrier requirement from 
H_GET_CPU_CHARACTERISTICS
  https://git.kernel.org/powerpc/c/393eff5a7b357a23db3e786e24b5ba8762cc6820

cheers

Re: [PATCH v15 0/9] powerpc: Further Strict RWX support

2021-06-24 Thread Michael Ellerman

On Wed, 9 Jun 2021 11:34:22 +1000, Jordan Niethe wrote:
> Adding more Strict RWX support on powerpc, in particular Strict Module RWX.
> It is now rebased on ppc next.
> 
> For reference the previous revision is available here:
> https://lore.kernel.org/linuxppc-dev/20210517032810.129949-1-jniet...@gmail.com/
> 
> Changes for v15:
> 
> [...]

Applied to powerpc/next.

[1/9] powerpc/mm: Implement set_memory() routines
  https://git.kernel.org/powerpc/c/1f9ad21c3b384a8f16d8c46845a48a01d281a603
[2/9] powerpc/lib/code-patching: Set up Strict RWX patching earlier
  https://git.kernel.org/powerpc/c/71a5b3db9f209ea5d1e07371017e65398d3c6fbc
[3/9] powerpc/modules: Make module_alloc() Strict Module RWX aware
  https://git.kernel.org/powerpc/c/4fcc636615b1a309b39cab101a2b433cbf1f63f1
[4/9] powerpc/kprobes: Mark newly allocated probes as ROX
  https://git.kernel.org/powerpc/c/6a3a58e6230dc5b646ce3511436d7e74fc7f764b
[5/9] powerpc/bpf: Remove bpf_jit_free()
  https://git.kernel.org/powerpc/c/bc33cfdb0bb84d9e4b125a617a437c29ddcac4d9
[6/9] powerpc/bpf: Write protect JIT code
  https://git.kernel.org/powerpc/c/62e3d4210ac9c35783d0e8fc306df4239c540a79
[7/9] powerpc: Set ARCH_HAS_STRICT_MODULE_RWX
  https://git.kernel.org/powerpc/c/c35717c71e983ed55d61e523cbd11a798429bc82
[8/9] powerpc/mm: implement set_memory_attr()
  https://git.kernel.org/powerpc/c/4d1755b6a762149ae022a32fb2bbeefb6680baa6
[9/9] powerpc/32: use set_memory_attr()
  https://git.kernel.org/powerpc/c/c988cfd38e489d9390d253d4392590daf451d87a

cheers

Re: [PATCH v6 00/17] Enable VAS and NX-GZIP support on PowerVM

2021-06-24 Thread Michael Ellerman

On Thu, 17 Jun 2021 13:24:31 -0700, Haren Myneni wrote:
> Virtual Accelerator Switchboard (VAS) allows kernel subsystems
> and user space processes to directly access the Nest Accelerator
> (NX) engines which provides HW compression. The true user mode
> VAS/NX support on PowerNV is already included in Linux. Whereas
> PowerVM support is available from P10 onwards.
> 
> This patch series enables VAS / NX-GZIP on PowerVM which allows
> the user space to do copy/paste with the same existing interface
> that is available on PowerNV.
> 
> [...]

Applied to powerpc/next.

[01/17] powerpc/powernv/vas: Release reference to tgid during window close

https://git.kernel.org/powerpc/c/91cdbb955aa94ee0841af4685be40937345d29b8
[02/17] powerpc/vas: Move VAS API to book3s common platform

https://git.kernel.org/powerpc/c/413d6ed3eac387a2876893c337174f0c5b99d01d
[03/17] powerpc/powernv/vas: Rename register/unregister functions

https://git.kernel.org/powerpc/c/06c6fad9bfe0b6439e18ea1f1cf0d178405ccf25
[04/17] powerpc/vas: Add platform specific user window operations

https://git.kernel.org/powerpc/c/1a0d0d5ed5e3cd9e3fc1ad4459f1db2f3618fce0
[05/17] powerpc/vas: Create take/drop pid and mm reference functions

https://git.kernel.org/powerpc/c/3856aa542d90ed79cd5ed4cfd828b1fb04017131
[06/17] powerpc/vas: Move update_csb/dump_crb to common book3s platform

https://git.kernel.org/powerpc/c/3b26797350352479f37216d674c8e5d126faab66
[07/17] powerpc/vas: Define and use common vas_window struct

https://git.kernel.org/powerpc/c/7bc6f71bdff5f8921e324da0a8fad6f4e2e63a85
[08/17] powerpc/pseries/vas: Define VAS/NXGZIP hcalls and structs

https://git.kernel.org/powerpc/c/8f3a6c92802b7c48043954ba3b507e9b33d8c898
[09/17] powerpc/vas: Define QoS credit flag to allocate window

https://git.kernel.org/powerpc/c/540761b7f51067d76b301c64abc50328ded89b1c
[10/17] powerpc/pseries/vas: Add hcall wrappers for VAS handling

https://git.kernel.org/powerpc/c/f33ecfde30ce6909fff41339285e0274bb403fb8
[11/17] powerpc/pseries/vas: Implement getting capabilities from hypervisor

https://git.kernel.org/powerpc/c/ca77d48854177bb9749aef7329201f03b2382fbb
[12/17] powerpc/pseries/vas: Integrate API with open/close windows

https://git.kernel.org/powerpc/c/b22f2d88e435cdada32581ca1f11b9806adf459a
[13/17] powerpc/pseries/vas: Setup IRQ and fault handling

https://git.kernel.org/powerpc/c/6d0aaf5e0de00491de136f387ebed55604cedebe
[14/17] crypto/nx: Rename nx-842-pseries file name to nx-common-pseries

https://git.kernel.org/powerpc/c/7da00b0e71334aa1e3d8db1cc1f40eb47cb1e188
[15/17] crypto/nx: Get NX capabilities for GZIP coprocessor type

https://git.kernel.org/powerpc/c/b4ba22114c78de48fda3818f569f75e97d58c719
[16/17] crypto/nx: Add sysfs interface to export NX capabilities

https://git.kernel.org/powerpc/c/8c099490fd2bd3b012b3b6d0babbba3b90e69b55
[17/17] crypto/nx: Register and unregister VAS interface on PowerVM

https://git.kernel.org/powerpc/c/99cd49bb39516d1beb1c38ae629b15ccb923198c

cheers

Re: [PATCH v2] KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path

2021-06-24 Thread Michael Ellerman

On Wed, 26 May 2021 22:58:51 +1000, Nicholas Piggin wrote:
> Similar to commit 25edcc50d76c ("KVM: PPC: Book3S HV: Save and restore
> FSCR in the P9 path"), ensure the P7/8 path saves and restores the host
> FSCR. The logic explained in that patch actually applies there to the
> old path well: a context switch can be made before kvmppc_vcpu_run_hv
> restores the host FSCR and returns.
> 
> Now both the p9 and the p7/8 paths now save and restore their FSCR, it
> no longer needs to be restored at the end of kvmppc_vcpu_run_hv

Applied to powerpc/topic/ppc-kvm.

[1/1] KVM: PPC: Book3S HV: Save host FSCR in the P7/8 path
  https://git.kernel.org/powerpc/c/6ba53317d497dec029bfb040b1daf38328fa00ab

cheers

Re: [PATCH] KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 processors

2021-06-24 Thread Michael Ellerman

On Wed, 2 Jun 2021 14:04:41 +1000, Nicholas Piggin wrote:
> The POWER9 vCPU TLB management code assumes all threads in a core share
> a TLB, and that TLBIEL execued by one thread will invalidate TLBs for
> all threads. This is not the case for SMT8 capable POWER9 and POWER10
> (big core) processors, where the TLB is split between groups of threads.
> This results in TLB multi-hits, random data corruption, etc.
> 
> Fix this by introducing cpu_first_tlb_thread_sibling etc., to determine
> which siblings share TLBs, and use that in the guest TLB flushing code.
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[1/1] KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and POWER10 
processors
  https://git.kernel.org/powerpc/c/77bbbc0cf84834ed130838f7ac1988567f4d0288

cheers

Re: [PATCH v7 00/32] KVM: PPC: Book3S: C-ify the P9 entry/exit code

2021-06-24 Thread Michael Ellerman

On Fri, 28 May 2021 19:07:20 +1000, Nicholas Piggin wrote:
> Git tree here
> 
> https://github.com/npiggin/linux/tree/kvm-in-c-5.14-1
> 
> This series applies to upstream plus a couple of KVM regression fixes
> not yet in powerpc tree, which are included in the above git tree.
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[01/32] KVM: PPC: Book3S 64: move KVM interrupt entry to a common entry point

https://git.kernel.org/powerpc/c/f36011569b90b3973f07cea00c5872c4dc0c707f
[02/32] KVM: PPC: Book3S 64: Move GUEST_MODE_SKIP test into KVM

https://git.kernel.org/powerpc/c/f33e0702d98cc5ff21f44833525b07581862eb57
[03/32] KVM: PPC: Book3S 64: add hcall interrupt handler

https://git.kernel.org/powerpc/c/31c67cfe2a6a5a7364dc1552b877c6b7820dd556
[04/32] KVM: PPC: Book3S 64: Move hcall early register setup to KVM

https://git.kernel.org/powerpc/c/04ece7b60b689e1de38b9b0f597f8f94951e4367
[05/32] KVM: PPC: Book3S 64: Move interrupt early register setup to KVM

https://git.kernel.org/powerpc/c/69fdd67499716efca861f7cecabdfeee5e5d7b51
[06/32] KVM: PPC: Book3S 64: move bad_host_intr check to HV handler

https://git.kernel.org/powerpc/c/1b5821c630c219e3c6f643ebbefcf08c9fa714d8
[07/32] KVM: PPC: Book3S 64: Minimise hcall handler calling convention 
differences

https://git.kernel.org/powerpc/c/e2762743c6328dde14290cd58ddf2175b068ad80
[08/32] KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C

https://git.kernel.org/powerpc/c/023c3c96ca4d196c09d554d5a98900406e4d7ecb
[09/32] KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR

https://git.kernel.org/powerpc/c/413679e73bdfc2720dc2fa2172b65b7411185fa7
[10/32] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

https://git.kernel.org/powerpc/c/6ffe2c6e6dcefb971e4046f02086c4adadd0b310
[11/32] KVM: PPC: Book3S HV P9: Move xive vcpu context management into 
kvmhv_p9_guest_entry

https://git.kernel.org/powerpc/c/09512c29167bd3792820caf83bcca4d4e5ac2266
[12/32] KVM: PPC: Book3S HV P9: Move radix MMU switching instructions together

https://git.kernel.org/powerpc/c/48013cbc504e064d2318f24482cfbe3c53e0a812
[13/32] KVM: PPC: Book3S HV P9: Stop handling hcalls in real-mode in the P9 path

https://git.kernel.org/powerpc/c/9dc2babc185e0a24fbb48098daafd552cac157fa
[14/32] KVM: PPC: Book3S HV P9: Implement the rest of the P9 path in C

https://git.kernel.org/powerpc/c/89d35b23910158a9add33a206e973f4227906d3c
[15/32] KVM: PPC: Book3S HV P9: inline kvmhv_load_hv_regs_and_go into 
__kvmhv_vcpu_entry_p9

https://git.kernel.org/powerpc/c/c00366e2375408e43370cd7981af3354f7c83ed3
[16/32] KVM: PPC: Book3S HV P9: Read machine check registers while MSR[RI] is 0

https://git.kernel.org/powerpc/c/6d770e3fe9a120560cda66331ce5faa363400e97
[17/32] KVM: PPC: Book3S HV P9: Improve exit timing accounting coverage

https://git.kernel.org/powerpc/c/a32ed1bb70723ec7a6c888b6c7071d516cca0e8f
[18/32] KVM: PPC: Book3S HV P9: Move SPR loading after expiry time check

https://git.kernel.org/powerpc/c/68e3baaca8c56bbb336d2215f201f4047ce736e5
[19/32] KVM: PPC: Book3S HV P9: Add helpers for OS SPR handling

https://git.kernel.org/powerpc/c/edba6aff4f2c3893e168df6a2e9a20f3c39b0b30
[20/32] KVM: PPC: Book3S HV P9: Switch to guest MMU context as late as possible

https://git.kernel.org/powerpc/c/41f779917669fcc28a7f5646d1f7a85043c9d152
[21/32] KVM: PPC: Book3S HV: Implement radix prefetch workaround by disabling 
MMU

https://git.kernel.org/powerpc/c/2e1ae9cd56f8616a707185f3c6cb7ee2a20809e1
[22/32] KVM: PPC: Book3S HV: Remove support for dependent threads mode on P9

https://git.kernel.org/powerpc/c/aaae8c79005846eeafc7a0e5d3eda4e34ea8ca2e
[23/32] KVM: PPC: Book3S HV: Remove radix guest support from P7/8 path

https://git.kernel.org/powerpc/c/9769a7fd79b65a6a6f8362154ab59c36d0defbf3
[24/32] KVM: PPC: Book3S HV: Remove virt mode checks from real mode handlers

https://git.kernel.org/powerpc/c/dcbac73a5b374873bd6dfd8a0ee5d0b7fc844420
[25/32] KVM: PPC: Book3S HV: Remove unused nested HV tests in XICS emulation

https://git.kernel.org/powerpc/c/2ce008c8b25467ceacf45bcf0e183d660edb82f2
[26/32] KVM: PPC: Book3S HV P9: Allow all P9 processors to enable nested HV

https://git.kernel.org/powerpc/c/cbcff8b1c53e458ed4e23877048d7268fd13ab8a
[27/32] KVM: PPC: Book3S HV: small pseries_do_hcall cleanup

https://git.kernel.org/powerpc/c/a9aa86e08b3a0b2c273cdb772283c872e55f14bf
[28/32] KVM: PPC: Book3S HV: add virtual mode handlers for HPT hcalls and page 
faults

https://git.kernel.org/powerpc/c/6165d5dd99dbaec7a309491c3951bd81fc89978d
[29/32] KVM: PPC: Book3S HV P9: Reflect userspace hcalls to hash guests to 
support PR KVM

https://git.kernel.org/powerpc/c/ac3c8b41c27ea112daed031f852a4b361c11a03e
[30/32] KVM: PPC: Book3S HV P9: implement hash guest support

Re: [PATCH] KVM: PPC: Book3S HV: Workaround high stack usage with clang

2021-06-24 Thread Michael Ellerman

On Mon, 21 Jun 2021 11:24:40 -0700, Nathan Chancellor wrote:
> LLVM does not emit optimal byteswap assembly, which results in high
> stack usage in kvmhv_enter_nested_guest() due to the inlining of
> byteswap_pt_regs(). With LLVM 12.0.0:
> 
> arch/powerpc/kvm/book3s_hv_nested.c:289:6: error: stack frame size of
> 2512 bytes in function 'kvmhv_enter_nested_guest' 
> [-Werror,-Wframe-larger-than=]
> long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>  ^
> 1 error generated.
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[1/1] KVM: PPC: Book3S HV: Workaround high stack usage with clang
  https://git.kernel.org/powerpc/c/51696f39cbee5bb684e7959c0c98b5f54548aa34

cheers

Re: [PATCH v8 0/6] Support for H_RPT_INVALIDATE in PowerPC KVM

2021-06-24 Thread Michael Ellerman

On Mon, 21 Jun 2021 14:19:57 +0530, Bharata B Rao wrote:
> This patchset adds support for the new hcall H_RPT_INVALIDATE
> and replaces the nested tlb flush calls with this new hcall
> if support for the same exists.
> 
> Changes in v8:
> -
> - Used tlb_single_page_flush_ceiling in the process-scoped range
>   flush routine to switch to full PID invalation if
>   the number of pages is above the threshold
> - Moved iterating over page sizes into the actual routine that
>   handles the eventual flushing thereby limiting the page size
>   iteration only to range based flushing
> - Converted #if 0 section into a comment section to avoid
>   checkpatch from complaining.
> - Used a threshold in the partition-scoped range flushing
>   to switch to full LPID invalidation
> 
> [...]

Applied to powerpc/topic/ppc-kvm.

[1/6] KVM: PPC: Book3S HV: Fix comments of H_RPT_INVALIDATE arguments
  https://git.kernel.org/powerpc/c/f09216a190a4c2f62e1725f9d92e7c122b4ee423
[2/6] powerpc/book3s64/radix: Add H_RPT_INVALIDATE pgsize encodings to 
mmu_psize_def
  https://git.kernel.org/powerpc/c/d6265cb33b710789cbc390316eba50a883d6dcc8
[3/6] KVM: PPC: Book3S HV: Add support for H_RPT_INVALIDATE
  https://git.kernel.org/powerpc/c/f0c6fbbb90504fb7e9dbf0865463d3c2b4de49e5
[4/6] KVM: PPC: Book3S HV: Nested support in H_RPT_INVALIDATE
  https://git.kernel.org/powerpc/c/53324b51c5eee22d420a2df68b1820d929fa90f3
[5/6] KVM: PPC: Book3S HV: Add KVM_CAP_PPC_RPT_INVALIDATE capability
  https://git.kernel.org/powerpc/c/b87cc116c7e1bc62a84d8c46acd401db179edb11
[6/6] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM
  https://git.kernel.org/powerpc/c/81468083f3c76a08183813e3af63a7c3cea3f537

cheers

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from Paolo Bonzini's message of June 24, 2021 10:41 pm:
> On 24/06/21 13:42, Nicholas Piggin wrote:
>> Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
>>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
 KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
 follow_pte in gfn_to_pfn. However, the resolved pfns may not have
 assoicated struct pages, so they should not be passed to pfn_to_page.
 This series removes such calls from the x86 and arm64 secondary MMU. To
 do this, this series modifies gfn_to_pfn to return a struct page in
 addition to a pfn, if the hva was resolved by gup. This allows the
 caller to call put_page only when necessated by gup.

 This series provides a helper function that unwraps the new return type
 of gfn_to_pfn to provide behavior identical to the old behavior. As I
 have no hardware to test powerpc/mips changes, the function is used
 there for minimally invasive changes. Additionally, as gfn_to_page and
 gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
 easily changed over to only use pfns.

 This addresses CVE-2021-22543 on x86 and arm64.
>>>
>>> Does this fix the problem? (untested I don't have a POC setup at hand,
>>> but at least in concept)
>> 
>> This one actually compiles at least. Unfortunately I don't have much
>> time in the near future to test, and I only just found out about this
>> CVE a few hours ago.
> 
> And it also works (the reproducer gets an infinite stream of userspace 
> exits and especially does not crash).  We can still go for David's 
> solution later since MMU notifiers are able to deal with this pages, but 
> it's a very nice patch for stable kernels.

Oh nice, thanks for testing. How's this?

Thanks,
Nick

---

KVM: Fix page ref underflow for regions with valid but non-refcounted pages

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

Signed-off-by: Nicholas Piggin 
---
 virt/kvm/kvm_main.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
bool write_fault)
return true;
 }
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;
+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */ 
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
 
 out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
 }
 
 /*
-- 
2.23.0

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 13:42, Nicholas Piggin wrote:

Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.


Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)


This one actually compiles at least. Unfortunately I don't have much
time in the near future to test, and I only just found out about this
CVE a few hours ago.


And it also works (the reproducer gets an infinite stream of userspace 
exits and especially does not crash).  We can still go for David's 
solution later since MMU notifiers are able to deal with this pages, but 
it's a very nice patch for stable kernels.


If you provide a Signed-off-by, I can integrate it.

Paolo


---


It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

---
  virt/kvm/kvm_main.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
bool write_fault)
return true;
  }
  
+static int kvm_try_get_pfn(kvm_pfn_t pfn)

+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;
+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
  
  out:

pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
  }
  
  /*

[PATCH] powerpc/64s: Fix boot failure with 4K Radix

2021-06-24 Thread Michael Ellerman

When using the Radix MMU our PGD is always 64K, and must be naturally
aligned.

For a 4K page size kernel that means page alignment of swapper_pg_dir is
not sufficient, leading to failure to boot.

Use the existing MAX_PTRS_PER_PGD which has the correct value, and
avoids us hard-coding 64K here.

Fixes: e72421a085a8 ("powerpc: Define swapper_pg_dir[] in C")
Reported-by: Daniel Axtens 
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/mm/pgtable.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 1707ab580ee2..cd16b407f47e 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -28,7 +28,13 @@
 #include 
 #include 
 
-pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __page_aligned_bss;
+#ifdef CONFIG_PPC64
+#define PGD_ALIGN (sizeof(pgd_t) * MAX_PTRS_PER_PGD)
+#else
+#define PGD_ALIGN PAGE_SIZE
+#endif
+
+pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __section(".bss..page_aligned") 
__aligned(PGD_ALIGN);
 
 static inline int is_exec_fault(void)
 {
-- 
2.25.1

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 13:42, Nicholas Piggin wrote:

+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;


So !pfn_valid would always return true.  Yeah, this should work and is 
certainly appealing!


Paolo



+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
  static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
  
  out:

pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
  }
  
  /*

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Will Deacon

On Thu, Jun 24, 2021 at 12:34:09PM +0100, Robin Murphy wrote:
> On 2021-06-24 12:18, Will Deacon wrote:
> > On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:
> > > On 2021-06-24 07:05, Claire Chang wrote:
> > > > On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:
> > > > > 
> > > > > On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > > > > > is_swiotlb_force_bounce at 
> > > > > > /usr/src/linux-next/./include/linux/swiotlb.h:119
> > > > > > 
> > > > > > is_swiotlb_force_bounce() was the new function introduced in this 
> > > > > > patch here.
> > > > > > 
> > > > > > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > > > > > +{
> > > > > > + return dev->dma_io_tlb_mem->force_bounce;
> > > > > > +}
> > > > > 
> > > > > To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> > > > > turn this into :
> > > > > 
> > > > >   return dev->dma_io_tlb_mem && 
> > > > > dev->dma_io_tlb_mem->force_bounce;
> > > > > 
> > > > > for a quick debug check?
> > > > 
> > > > I just realized that dma_io_tlb_mem might be NULL like Christoph
> > > > pointed out since swiotlb might not get initialized.
> > > > However,  `Unable to handle kernel paging request at virtual address
> > > > dfff800e` looks more like the address is garbage rather than
> > > > NULL?
> > > > I wonder if that's because dev->dma_io_tlb_mem is not assigned
> > > > properly (which means device_initialize is not called?).
> > > 
> > > What also looks odd is that the base "address" 0xdfff8000 is held 
> > > in
> > > a couple of registers, but the offset 0xe looks too small to match up to 
> > > any
> > > relevant structure member in that dereference chain :/
> > 
> > FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
> > been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
> > Christoph's suggestion is needed regardless.
> 
> Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. The
> massive jump in KernelCI baseline failures as of yesterday looks like every
> arm64 machine with less than 4GB of RAM blowing up...

Ok, diff below which attempts to tackle the offset issue I mentioned as
well. Qian Cai -- please can you try with these changes?

Will

--->8

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 175b6c113ed8..39284ff2a6cd 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -116,7 +116,9 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 
 static inline bool is_swiotlb_force_bounce(struct device *dev)
 {
-   return dev->dma_io_tlb_mem->force_bounce;
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+
+   return mem && mem->force_bounce;
 }
 
 void __init swiotlb_exit(void);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 44be8258e27b..0ffbaae9fba2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -449,6 +449,7 @@ static int swiotlb_find_slots(struct device *dev, 
phys_addr_t orig_addr,
dma_get_min_align_mask(dev) & ~(IO_TLB_SIZE - 1);
unsigned int nslots = nr_slots(alloc_size), stride;
unsigned int index, wrap, count = 0, i;
+   unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned long flags;
 
BUG_ON(!nslots);
@@ -497,7 +498,7 @@ static int swiotlb_find_slots(struct device *dev, 
phys_addr_t orig_addr,
for (i = index; i < index + nslots; i++) {
mem->slots[i].list = 0;
mem->slots[i].alloc_size =
-   alloc_size - ((i - index) << IO_TLB_SHIFT);
+   alloc_size - (offset + ((i - index) << IO_TLB_SHIFT));
}
for (i = index - 1;
 io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from Nicholas Piggin's message of June 24, 2021 8:34 pm:
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
>> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
>> assoicated struct pages, so they should not be passed to pfn_to_page.
>> This series removes such calls from the x86 and arm64 secondary MMU. To
>> do this, this series modifies gfn_to_pfn to return a struct page in
>> addition to a pfn, if the hva was resolved by gup. This allows the
>> caller to call put_page only when necessated by gup.
>> 
>> This series provides a helper function that unwraps the new return type
>> of gfn_to_pfn to provide behavior identical to the old behavior. As I
>> have no hardware to test powerpc/mips changes, the function is used
>> there for minimally invasive changes. Additionally, as gfn_to_page and
>> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
>> easily changed over to only use pfns.
>> 
>> This addresses CVE-2021-22543 on x86 and arm64.
> 
> Does this fix the problem? (untested I don't have a POC setup at hand,
> but at least in concept)

This one actually compiles at least. Unfortunately I don't have much 
time in the near future to test, and I only just found out about this
CVE a few hours ago.

---


It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on the page if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

---
 virt/kvm/kvm_main.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..46fb042837d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2055,6 +2055,13 @@ static bool vma_is_valid(struct vm_area_struct *vma, 
bool write_fault)
return true;
 }
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;
+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 static int hva_to_pfn_remapped(struct vm_area_struct *vma,
   unsigned long addr, bool *async,
   bool write_fault, bool *writable,
@@ -2104,13 +2111,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */ 
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
 
 out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
 }
 
 /*
-- 
2.23.0

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy


On 2021-06-24 12:18, Will Deacon wrote:

On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:

On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

  return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is held in
a couple of registers, but the offset 0xe looks too small to match up to any
relevant structure member in that dereference chain :/


FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
Christoph's suggestion is needed regardless.


Ack to that - for SWIOTLB_NO_FORCE, io_tlb_default_mem will remain NULL. 
The massive jump in KernelCI baseline failures as of yesterday looks 
like every arm64 machine with less than 4GB of RAM blowing up...


Robin.


But I agree that it won't help
with the issue reported by Qian Cai.

Qian Cai: please can you share your .config and your command line?

Thanks,

Will

[PATCH printk v3 4/6] printk: remove NMI tracking

2021-06-24 Thread John Ogness

All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.

Signed-off-by: John Ogness 
Reviewed-by: Petr Mladek 
---
 arch/arm/kernel/smp.c   |  2 --
 arch/powerpc/kexec/crash.c  |  3 ---
 include/linux/hardirq.h |  2 --
 include/linux/printk.h  | 12 
 init/Kconfig|  5 -
 kernel/printk/internal.h|  6 --
 kernel/printk/printk_safe.c | 37 +
 kernel/trace/trace.c|  2 --
 8 files changed, 1 insertion(+), 68 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index 74679240a9d8..0dd2d733ad62 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -668,9 +668,7 @@ static void do_handle_IPI(int ipinr)
break;
 
case IPI_CPU_BACKTRACE:
-   printk_nmi_enter();
nmi_cpu_backtrace(get_irq_regs());
-   printk_nmi_exit();
break;
 
default:
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index c9a889880214..d488311efab1 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -311,9 +311,6 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
unsigned int i;
int (*old_handler)(struct pt_regs *regs);
 
-   /* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
-   printk_nmi_enter();
-
/*
 * This function is only called after the system
 * has panicked or is otherwise in a critical state.
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 69bc86ea382c..76878b357ffa 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -116,7 +116,6 @@ extern void rcu_nmi_exit(void);
do {\
lockdep_off();  \
arch_nmi_enter();   \
-   printk_nmi_enter(); \
BUG_ON(in_nmi() == NMI_MASK);   \
__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);   \
} while (0)
@@ -135,7 +134,6 @@ extern void rcu_nmi_exit(void);
do {\
BUG_ON(!in_nmi());  \
__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);   \
-   printk_nmi_exit();  \
arch_nmi_exit();\
lockdep_on();   \
} while (0)
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 664612f75dac..14a18fa15c92 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -149,18 +149,6 @@ static inline __printf(1, 2) __cold
 void early_printk(const char *s, ...) { }
 #endif
 
-#ifdef CONFIG_PRINTK_NMI
-extern void printk_nmi_enter(void);
-extern void printk_nmi_exit(void);
-extern void printk_nmi_direct_enter(void);
-extern void printk_nmi_direct_exit(void);
-#else
-static inline void printk_nmi_enter(void) { }
-static inline void printk_nmi_exit(void) { }
-static inline void printk_nmi_direct_enter(void) { }
-static inline void printk_nmi_direct_exit(void) { }
-#endif /* PRINTK_NMI */
-
 struct dev_printk_info;
 
 #ifdef CONFIG_PRINTK
diff --git a/init/Kconfig b/init/Kconfig
index 5babea38e346..4053588db7a0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1488,11 +1488,6 @@ config PRINTK
  very difficult to diagnose system problems, saying N here is
  strongly discouraged.
 
-config PRINTK_NMI
-   def_bool y
-   depends on PRINTK
-   depends on HAVE_NMI
-
 config BUG
bool "BUG() support" if EXPERT
default y
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 6cc35c5de890..ff890ae3ee6a 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -6,12 +6,6 @@
 
 #ifdef CONFIG_PRINTK
 
-#define PRINTK_SAFE_CONTEXT_MASK   0x007ff
-#define PRINTK_NMI_DIRECT_CONTEXT_MASK 0x00800
-#define PRINTK_NMI_CONTEXT_MASK0xff000
-
-#define PRINTK_NMI_CONTEXT_OFFSET  0x01000
-
 __printf(4, 0)
 int vprintk_store(int facility, int level,
  const struct dev_printk_info *dev_info,
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index 0456cd48d01c..47d961149a06 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -4,12 +4,9 @@
  */
 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -17,35 +14,6 @@
 
 static DEFINE_PER_CPU(int, printk_context);
 
-#ifdef CONFIG_PRINTK_NMI
-void noinstr printk_nmi_enter(void)
-{
-   this_cpu_add(printk_context, PRINTK_NMI_CONTEXT_OFFSET);
-}
-
-void noinstr

[PATCH printk v3 3/6] printk: remove safe buffers

2021-06-24 Thread John Ogness

With @logbuf_lock removed, the high level printk functions for
storing messages are lockless. Messages can be stored from any
context, so there is no need for the NMI and safe buffers anymore.
Remove the NMI and safe buffers.

Although the safe buffers are removed, the NMI and safe context
tracking is still in place. In these contexts, store the message
immediately but still use irq_work to defer the console printing.

Since printk recursion tracking is in place, safe context tracking
for most of printk is not needed. Remove it. Only safe context
tracking relating to the console lock is left in place. This is
because the console lock is needed for the actual printing.

Signed-off-by: John Ogness 
---
 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 include/linux/printk.h |  10 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  17 --
 kernel/printk/printk.c | 126 +
 kernel/printk/printk_safe.c| 332 +
 lib/nmi_backtrace.c|   6 -
 9 files changed, 51 insertions(+), 450 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index a44a30b0688c..5828c83eaca6 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -171,7 +171,6 @@ extern void panic_flush_kmsg_start(void)
 
 extern void panic_flush_kmsg_end(void)
 {
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
bust_spinlocks(0);
debug_locks_off();
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index c9a8f4781a10..dc17d8903d4f 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 
wd_smp_unlock();
 
-   printk_safe_flush();
-   /*
-* printk_safe_flush() seems to require another print
-* before anything actually goes out to console.
-*/
if (sysctl_hardlockup_all_cpu_backtrace)
trigger_allbutself_cpu_backtrace();
 
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 1790a5521fd9..664612f75dac 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char 
*fmt, ...);
 void dump_stack_print_info(const char *log_lvl);
 void show_regs_print_info(const char *log_lvl);
 extern asmlinkage void dump_stack(void) __cold;
-extern void printk_safe_flush(void);
-extern void printk_safe_flush_on_panic(void);
 #else
 static inline __printf(1, 0)
 int vprintk(const char *s, va_list args)
@@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char 
*log_lvl)
 static inline void dump_stack(void)
 {
 }
-
-static inline void printk_safe_flush(void)
-{
-}
-
-static inline void printk_safe_flush_on_panic(void)
-{
-}
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index a0b6780740c8..480d5f77ef4f 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -977,7 +977,6 @@ void crash_kexec(struct pt_regs *regs)
old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, this_cpu);
if (old_cpu == PANIC_CPU_INVALID) {
/* This is the 1st CPU which comes here, so go ahead. */
-   printk_safe_flush_on_panic();
__crash_kexec(regs);
 
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a72a58..1f0df42f8d0c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -247,7 +247,6 @@ void panic(const char *fmt, ...)
 * Bypass the panic_cpu check and call __crash_kexec directly.
 */
if (!_crash_kexec_post_notifiers) {
-   printk_safe_flush_on_panic();
__crash_kexec(NULL);
 
/*
@@ -271,8 +270,6 @@ void panic(const char *fmt, ...)
 */
atomic_notifier_call_chain(_notifier_list, 0, buf);
 
-   /* Call flush even twice. It tries harder with a single online CPU */
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
 
/*
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 51615c909b2f..6cc35c5de890 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list 
args);
 void __printk_safe_enter(void);
 void __printk_safe_exit(void);
 
-void printk_safe_init(void);
 bool printk_percpu_data_ready(void);
 
 #define printk_safe_enter_irqsave(flags)   \
@@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void);
local_irq_restore(flags);   \
} while (0)
 
-#define printk_safe_enter_irq()\
-   do {\
-   local_irq_disable();\
-   __printk_safe_enter();  \
-   } while (0)
-
-#define printk_safe_exit_irq()

[PATCH printk v3 0/6] printk: remove safe buffers

2021-06-24 Thread John Ogness

Hi,

Here is v3 of a series to remove the safe buffers. v2 can be
found here [0]. The safe buffers are no longer needed because
messages can be stored directly into the log buffer from any
context.

However, the safe buffers also provided a form of recursion
protection. For that reason, explicit recursion protection is
implemented for this series.

The safe buffers also implicitly provided serialization
between multiple CPUs executing in NMI context. This was
particularly necessary for the nmi_backtrace() output. This
serializiation is now preserved by using the printk_cpu_lock.

And finally, with the removal of the safe buffers, there is no
need for extra NMI enter/exit tracking. So this is also removed
(which includes removing config option CONFIG_PRINTK_NMI).

Changes since v2:

- Move irq disabling/enabling out of the
  console_lock_spinning_*() functions to simplify the patches
  keep the function prototypes simple.

- Change printk_enter_irqsave()/printk_exit_irqrestore() to
  macros to allow a more common calling convention for irq
  flags.

- Use the counter pointer from printk_enter_irqsave() in
  printk_exit_irqrestore() rather than fetching it again. This
  avoids any possible race conditions when printk's percpu
  flag is set.

- Use the printk_cpu_lock to serialize banner and regs with
  the stack dump in nmi_cpu_backtrace().

John Ogness

[0] https://lore.kernel.org/lkml/20210330153512.1182-1-john.ogn...@linutronix.de

John Ogness (6):
  lib/nmi_backtrace: explicitly serialize banner and regs
  printk: track/limit recursion
  printk: remove safe buffers
  printk: remove NMI tracking
  printk: convert @syslog_lock to mutex
  printk: syslog: close window between wait and read

 arch/arm/kernel/smp.c  |   2 -
 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 arch/powerpc/kexec/crash.c |   3 -
 include/linux/hardirq.h|   2 -
 include/linux/printk.h |  22 --
 init/Kconfig   |   5 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  23 ---
 kernel/printk/printk.c | 273 +++--
 kernel/printk/printk_safe.c| 361 +
 kernel/trace/trace.c   |   2 -
 lib/nmi_backtrace.c|  13 +-
 14 files changed, 176 insertions(+), 540 deletions(-)


base-commit: 48e72544d6f06daedbf1d9b14610be89dba67526
-- 
2.20.1

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Will Deacon

On Thu, Jun 24, 2021 at 12:14:39PM +0100, Robin Murphy wrote:
> On 2021-06-24 07:05, Claire Chang wrote:
> > On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:
> > > 
> > > On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > > > is_swiotlb_force_bounce at 
> > > > /usr/src/linux-next/./include/linux/swiotlb.h:119
> > > > 
> > > > is_swiotlb_force_bounce() was the new function introduced in this patch 
> > > > here.
> > > > 
> > > > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > > > +{
> > > > + return dev->dma_io_tlb_mem->force_bounce;
> > > > +}
> > > 
> > > To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> > > turn this into :
> > > 
> > >  return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
> > > 
> > > for a quick debug check?
> > 
> > I just realized that dma_io_tlb_mem might be NULL like Christoph
> > pointed out since swiotlb might not get initialized.
> > However,  `Unable to handle kernel paging request at virtual address
> > dfff800e` looks more like the address is garbage rather than
> > NULL?
> > I wonder if that's because dev->dma_io_tlb_mem is not assigned
> > properly (which means device_initialize is not called?).
> 
> What also looks odd is that the base "address" 0xdfff8000 is held in
> a couple of registers, but the offset 0xe looks too small to match up to any
> relevant structure member in that dereference chain :/

FWIW, I've managed to trigger a NULL dereference locally when swiotlb hasn't
been initialised but we dereference 'dev->dma_io_tlb_mem', so I think
Christoph's suggestion is needed regardless. But I agree that it won't help
with the issue reported by Qian Cai.

Qian Cai: please can you share your .config and your command line?

Thanks,

Will

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Robin Murphy


On 2021-06-24 07:05, Claire Chang wrote:

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:


On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+ return dev->dma_io_tlb_mem->force_bounce;
+}


To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

 return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


What also looks odd is that the base "address" 0xdfff8000 is 
held in a couple of registers, but the offset 0xe looks too small to 
match up to any relevant structure member in that dereference chain :/


Robin.

Re: [PATCH v2] powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() on ppc32

2021-06-24 Thread Christophe Leroy





Le 24/06/2021 à 12:59, Naveen N. Rao a écrit :

Christophe Leroy wrote:

From: Naveen N. Rao 

Trying to use a kprobe on ppc32 results in the below splat:
    BUG: Unable to handle kernel data access on read at 0x7c0802a6
    Faulting instruction address: 0xc002e9f0
    Oops: Kernel access of bad area, sig: 11 [#1]
    BE PAGE_SIZE=4K PowerPC 44x Platform
    Modules linked in:
    CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
    NIP:  c002e9f0 LR: c0011858 CTR: 8a47
    REGS: c292fd50 TRAP: 0300   Not tainted  (5.13.0-rc1-01824-g3a81c0495fdb)
    MSR:  9000   CR: 24002002  XER: 2000
    DEAR: 7c0802a6 ESR: 
    
    NIP [c002e9f0] emulate_step+0x28/0x324
    LR [c0011858] optinsn_slot+0x128/0x1
    Call Trace:
 opt_pre_handler+0x7c/0xb4 (unreliable)
 optinsn_slot+0x128/0x1
 ret_from_syscall+0x0/0x28

The offending instruction is:
    81 24 00 00 lwz r9,0(r4)

Here, we are trying to load the second argument to emulate_step():
struct ppc_inst, which is the instruction to be emulated. On ppc64,
structures are passed in registers when passed by value. However, per
the ppc32 ABI, structures are always passed to functions as pointers.
This isn't being adhered to when setting up the call to emulate_step()
in the optprobe trampoline. Fix the same.

Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
Cc: sta...@vger.kernel.org
Signed-off-by: Naveen N. Rao 
---
v2: Rebased on powerpc/merge 7f030e9d57b8
Signed-off-by: Christophe Leroy 


Thanks for rebasing this!

I think git am drops everything after three dashes, so applying this patch will drop your SOB. The 
recommended style (*) for adding a changelog is to include it within [] before the second SOB.




Yes, I saw that after sending the mail. Usually I add a signed-off-by with 'git --amend -s' when I 
add the history, so it goes before the '---' I'm adding.


This time I must have forgotten the '-s' so it was added by the 'git format-patch -s' which is in my 
submit script, and so it was added at the end.


It's not really a big deal, up to Michael to either move it at the right place or discard it, I 
don't really mind. Do the easiest for you.


Thanks
Christophe

Re: [PATCH v2] powerpc/kprobes: Fix Oops by passing ppc_inst as a pointer to emulate_step() on ppc32

2021-06-24 Thread Naveen N. Rao


Christophe Leroy wrote:

From: Naveen N. Rao 

Trying to use a kprobe on ppc32 results in the below splat:
BUG: Unable to handle kernel data access on read at 0x7c0802a6
Faulting instruction address: 0xc002e9f0
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K PowerPC 44x Platform
Modules linked in:
CPU: 0 PID: 89 Comm: sh Not tainted 5.13.0-rc1-01824-g3a81c0495fdb #7
NIP:  c002e9f0 LR: c0011858 CTR: 8a47
REGS: c292fd50 TRAP: 0300   Not tainted  (5.13.0-rc1-01824-g3a81c0495fdb)
MSR:  9000   CR: 24002002  XER: 2000
DEAR: 7c0802a6 ESR: 

NIP [c002e9f0] emulate_step+0x28/0x324
LR [c0011858] optinsn_slot+0x128/0x1
Call Trace:
 opt_pre_handler+0x7c/0xb4 (unreliable)
 optinsn_slot+0x128/0x1
 ret_from_syscall+0x0/0x28

The offending instruction is:
81 24 00 00 lwz r9,0(r4)

Here, we are trying to load the second argument to emulate_step():
struct ppc_inst, which is the instruction to be emulated. On ppc64,
structures are passed in registers when passed by value. However, per
the ppc32 ABI, structures are always passed to functions as pointers.
This isn't being adhered to when setting up the call to emulate_step()
in the optprobe trampoline. Fix the same.

Fixes: eacf4c0202654a ("powerpc: Enable OPTPROBES on PPC32")
Cc: sta...@vger.kernel.org
Signed-off-by: Naveen N. Rao 
---
v2: Rebased on powerpc/merge 7f030e9d57b8
Signed-off-by: Christophe Leroy 


Thanks for rebasing this!

I think git am drops everything after three dashes, so applying this 
patch will drop your SOB. The recommended style (*) for adding a 
changelog is to include it within [] before the second SOB.



- Naveen

(*) https://www.kernel.org/doc/html/latest/maintainer/modifying-patches.html

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Aneesh Kumar K.V


On 6/24/21 4:03 PM, Laurent Dufour wrote:

Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :
PAPR interface currently supports two different ways of communicating 
resource

grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
  Documentation/powerpc/associativity.rst   | 135 
  arch/powerpc/include/asm/firmware.h   |   3 +-
  arch/powerpc/include/asm/prom.h   |   1 +
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c    | 149 +-
  arch/powerpc/platforms/pseries/firmware.c |   1 +
  6 files changed, 286 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst

new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform 
resources into
+domains of substantially similar mean performance relative to 
resources outside

+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources 
subsets
+are represented as being members of a sub-grouping domain. This 
performance
+characteristic is presented in terms of NUMA node distance within the 
Linux kernel.

+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating 
these resource
+grouping details to the OS. These are referred to as Form 0, Form 1 
and Form2
+associativity grouping. Form 0 is the older format and is now 
considered deprecated.

+
+Hypervisor indicates the type/form of associativity used via 
"ibm,arcitecture-vec-5 property".

    architecture ^



fixed

+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates 
usage of Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity

+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance 
between resource groups/domains.

+
+The “ibm,associativity” property contains one or more lists of 
numbers (domainID)

+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or 
more list of numbers
+(domainID index) that represents the 1 based ordinal in the 
associativity lists.
+The list of domainID index represnets increasing hierachy of resource 
grouping.

     represents ^



fixed


+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID 
index.. }

+
+Linux kernel uses the domainID at the primary domainID index as the 
NUMA node id.
+Linux kernel computes NUMA distance between two domains by 
recursively comparing
+if they belong to the same higher-level domains. For mismatch at 
every higher
+level of the resource group, the kernel doubles the NUMA distance 
between the

+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties 
representing NUMA node distance
+thereby making the node distance computation flexible. Form 2 also 
allows flexible primary
+domain numbering. With numa distance computation now detached from 
the index value of
+"ibm,associativity" property, Form 2 allows a large number of primary 
domain ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.

+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of 
byte 5 in the

+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list 
numbers representing
+the domainIDs present in the system. The offset of the domainID in 
this property is considered

+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with 
encode-int, followed by

+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index 
for domainID 8 is 1.

+
+"ibm,numa-distance-table" property contains one or more list of 
numbers representing the NUMA

+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the

[PATCH] ASoC: fsl_xcvr: Omit superfluous error message in fsl_xcvr_probe()

2021-06-24 Thread Tang Bin

In the function fsl_xcvr__probe(), when get irq failed,
the function platform_get_irq() logs an error message, so remove
redundant message here.

Signed-off-by: Tang Bin 
---
 sound/soc/fsl/fsl_xcvr.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/sound/soc/fsl/fsl_xcvr.c b/sound/soc/fsl/fsl_xcvr.c
index 5e8284db857b..711d738f8de1 100644
--- a/sound/soc/fsl/fsl_xcvr.c
+++ b/sound/soc/fsl/fsl_xcvr.c
@@ -1190,10 +1190,8 @@ static int fsl_xcvr_probe(struct platform_device *pdev)
 
/* get IRQs */
irq = platform_get_irq(pdev, 0);
-   if (irq < 0) {
-   dev_err(dev, "no irq[0]: %d\n", irq);
+   if (irq < 0)
return irq;
-   }
 
ret = devm_request_irq(dev, irq, irq0_isr, 0, pdev->name, xcvr);
if (ret) {
-- 
2.18.2

Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from Marc Zyngier's message of June 24, 2021 8:06 pm:
> On Thu, 24 Jun 2021 09:58:00 +0100,
> Nicholas Piggin  wrote:
>> 
>> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
>> > From: David Stevens 
>> >  out_unlock:
>> >if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>> >read_unlock(>kvm->mmu_lock);
>> >else
>> >write_unlock(>kvm->mmu_lock);
>> > -  kvm_release_pfn_clean(pfn);
>> > +  if (pfnpg.page)
>> > +  put_page(pfnpg.page);
>> >return r;
>> >  }
>> 
>> How about
>> 
>>   kvm_release_pfn_page_clean(pfnpg);
> 
> I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
> doesn't mark the page 'clean'. I find put_page() more correct.
> 
> Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
> bad at naming things that I could just as well call it 'bob()'.

That seems like a fine name to me. A little better than bob.

Thanks,
Nick

Re: [PATCH 4/6] KVM: arm64/mmu: avoid struct page in MMU

2021-06-24 Thread Marc Zyngier

On Thu, 24 Jun 2021 04:57:47 +0100,
David Stevens  wrote:
> 
> From: David Stevens 
> 
> Avoid converting pfns returned by follow_fault_pfn to struct pages to
> transiently take a reference. The reference was originally taken to
> match the reference taken by gup. However, pfns returned by
> follow_fault_pfn may not have a struct page set up for reference
> counting.
> 
> Signed-off-by: David Stevens 
> ---
>  arch/arm64/kvm/mmu.c | 43 +++
>  1 file changed, 23 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 896b3644b36f..a741972cb75f 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c

[...]

> @@ -968,16 +968,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>*/
>   if (vma_pagesize == PAGE_SIZE && !force_pte)
>   vma_pagesize = transparent_hugepage_adjust(memslot, hva,
> -, _ipa);
> +, _ipa);
>   if (writable)
>   prot |= KVM_PGTABLE_PROT_W;
>  
>   if (fault_status != FSC_PERM && !device)
> - clean_dcache_guest_page(pfn, vma_pagesize);
> + clean_dcache_guest_page(pfnpg.pfn, vma_pagesize);
>  
>   if (exec_fault) {
>   prot |= KVM_PGTABLE_PROT_X;
> - invalidate_icache_guest_page(pfn, vma_pagesize);
> + invalidate_icache_guest_page(pfnpg.pfn, vma_pagesize);

This is going to clash with what is currently in -next, specially with
MTE.

Paolo, if you really want to take this in 5.13, can you please push a
stable branch based on -rc4 or older so that I can merge it in and
test it?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Nicholas Piggin

Excerpts from Paolo Bonzini's message of June 24, 2021 8:21 pm:
> On 24/06/21 12:17, Nicholas Piggin wrote:
>>> If all callers were updated that is one thing, but from the changelog
>>> it sounds like that would not happen and there would be some gfn_to_pfn
>>> users left over.
>>>
>>> But yes in the end you would either need to make gfn_to_pfn never return
>>> a page found via follow_pte, or change all callers to the new way. If
>>> the plan is for the latter then I guess that's fine.
>>
>> Actually in that case anyway I don't see the need -- the existence of
>> gfn_to_pfn is enough to know it might be buggy. It can just as easily
>> be grepped for as kvm_pfn_page_unwrap.
> 
> Sure, but that would leave us with longer function names 
> (gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
> that looks worse and the unsafe use is the one that looks safe.

The churn isn't justified because of function name length. Choose g2pp() 
if you want a non-descriptive but short name.

The existing name isn't good anyway because it not only looks up a pfn 
but also a page, and more importantly it gets a ref on the page. The
name should be changed if you introduce a new API.

>> And are gfn_to_page cases also
>> vulernable to the same issue?
> 
> No, they're just broken for the VM_IO|VM_PFNMAP case.

No they aren't vulnerable, or they are vunlerable but also broken in 
other cases?

Thanks,
Nick

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
> follow_pte in gfn_to_pfn. However, the resolved pfns may not have
> assoicated struct pages, so they should not be passed to pfn_to_page.
> This series removes such calls from the x86 and arm64 secondary MMU. To
> do this, this series modifies gfn_to_pfn to return a struct page in
> addition to a pfn, if the hva was resolved by gup. This allows the
> caller to call put_page only when necessated by gup.
> 
> This series provides a helper function that unwraps the new return type
> of gfn_to_pfn to provide behavior identical to the old behavior. As I
> have no hardware to test powerpc/mips changes, the function is used
> there for minimally invasive changes. Additionally, as gfn_to_page and
> gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
> easily changed over to only use pfns.
> 
> This addresses CVE-2021-22543 on x86 and arm64.

Does this fix the problem? (untested I don't have a POC setup at hand,
but at least in concept)

I have no problem with improving the API and probably in the direction 
of your series is good. But there seems to be a lot of unfixed arch code 
and broken APIs remaining left to do after your series too. This might 
be most suitable to backport and as a base for your series that can take
more time to convert to new APIs.

Thanks,
Nick

---

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6a6bc7af0e28..e208c279d903 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2104,13 +2104,21 @@ static int hva_to_pfn_remapped(struct vm_area_struct 
*vma,
 * Whoever called remap_pfn_range is also going to call e.g.
 * unmap_mapping_range before the underlying pages are freed,
 * causing a call to our MMU notifier.
+*
+* Certain IO or PFNMAP mappings can be backed with valid
+* struct pages, but be allocated without refcounting e.g.,
+* tail pages of non-compound higher order allocations, which
+* would then underflow the refcount when the caller does the
+* required put_page. Don't allow those pages here.
 */ 
-   kvm_get_pfn(pfn);
+   if (!kvm_try_get_pfn(pfn))
+   r = -EFAULT;
 
 out:
pte_unmap_unlock(ptep, ptl);
*p_pfn = pfn;
-   return 0;
+
+   return r;
 }
 
 /*
@@ -2487,6 +2495,13 @@ void kvm_set_pfn_accessed(kvm_pfn_t pfn)
 }
 EXPORT_SYMBOL_GPL(kvm_set_pfn_accessed);
 
+static int kvm_try_get_pfn(kvm_pfn_t pfn)
+{
+   if (kvm_is_reserved_pfn(pfn))
+   return 1;
+   return get_page_unless_zero(pfn_to_page(pfn));
+}
+
 void kvm_get_pfn(kvm_pfn_t pfn)
 {
if (!kvm_is_reserved_pfn(pfn))

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Laurent Dufour


Hi Aneesh,

A little bit of wordsmithing below...

Le 17/06/2021 à 18:51, Aneesh Kumar K.V a écrit :

PAPR interface currently supports two different ways of communicating resource
grouping details to the OS. These are referred to as Form 0 and Form 1
associativity grouping. Form 0 is the older format and is now considered
deprecated. This patch adds another resource grouping named FORM2.

Signed-off-by: Daniel Henrique Barboza 
Signed-off-by: Aneesh Kumar K.V 
---
  Documentation/powerpc/associativity.rst   | 135 
  arch/powerpc/include/asm/firmware.h   |   3 +-
  arch/powerpc/include/asm/prom.h   |   1 +
  arch/powerpc/kernel/prom_init.c   |   3 +-
  arch/powerpc/mm/numa.c| 149 +-
  arch/powerpc/platforms/pseries/firmware.c |   1 +
  6 files changed, 286 insertions(+), 6 deletions(-)
  create mode 100644 Documentation/powerpc/associativity.rst

diff --git a/Documentation/powerpc/associativity.rst 
b/Documentation/powerpc/associativity.rst
new file mode 100644
index ..93be604ac54d
--- /dev/null
+++ b/Documentation/powerpc/associativity.rst
@@ -0,0 +1,135 @@
+
+NUMA resource associativity
+=
+
+Associativity represents the groupings of the various platform resources into
+domains of substantially similar mean performance relative to resources outside
+of that domain. Resources subsets of a given domain that exhibit better
+performance relative to each other than relative to other resources subsets
+are represented as being members of a sub-grouping domain. This performance
+characteristic is presented in terms of NUMA node distance within the Linux 
kernel.
+From the platform view, these groups are also referred to as domains.
+
+PAPR interface currently supports different ways of communicating these 
resource
+grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
+associativity grouping. Form 0 is the older format and is now considered 
deprecated.
+
+Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 
property".

   architecture ^


+Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
Form 0 or Form 1.
+A value of 1 indicates the usage of Form 1 associativity. For Form 2 
associativity
+bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
+
+Form 0
+-
+Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
+
+Form 1
+-
+With Form 1 a combination of ibm,associativity-reference-points and 
ibm,associativity
+device tree properties are used to determine the NUMA distance between 
resource groups/domains.
+
+The “ibm,associativity” property contains one or more lists of numbers 
(domainID)
+representing the resource’s platform grouping domains.
+
+The “ibm,associativity-reference-points” property contains one or more list of 
numbers
+(domainID index) that represents the 1 based ordinal in the associativity 
lists.
+The list of domainID index represnets increasing hierachy of resource grouping.

represents ^


+
+ex:
+{ primary domainID index, secondary domainID index, tertiary domainID index.. }
+
+Linux kernel uses the domainID at the primary domainID index as the NUMA node 
id.
+Linux kernel computes NUMA distance between two domains by recursively 
comparing
+if they belong to the same higher-level domains. For mismatch at every higher
+level of the resource group, the kernel doubles the NUMA distance between the
+comparing domains.
+
+Form 2
+---
+Form 2 associativity format adds separate device tree properties representing 
NUMA node distance
+thereby making the node distance computation flexible. Form 2 also allows 
flexible primary
+domain numbering. With numa distance computation now detached from the index 
value of
+"ibm,associativity" property, Form 2 allows a large number of primary domain 
ids at the
+same domainID index representing resource groups of different 
performance/latency characteristics.
+
+Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in 
the
+"ibm,architecture-vec-5" property.
+
+"ibm,numa-lookup-index-table" property contains one or more list numbers 
representing
+the domainIDs present in the system. The offset of the domainID in this 
property is considered
+the domainID index.
+
+prop-encoded-array: The number N of the domainIDs encoded as with encode-int, 
followed by
+N domainID encoded as with encode-int
+
+For ex:
+ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for 
domainID 8 is 1.
+
+"ibm,numa-distance-table" property contains one or more list of numbers 
representing the NUMA
+distance between resource groups/domains present in the system.
+
+prop-encoded-array: The number N of the distance values encoded as with 
encode-int, followed by
+N distance values encoded as with

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini


On 24/06/21 12:17, Nicholas Piggin wrote:

If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.

But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If
the plan is for the latter then I guess that's fine.


Actually in that case anyway I don't see the need -- the existence of
gfn_to_pfn is enough to know it might be buggy. It can just as easily
be grepped for as kvm_pfn_page_unwrap.


Sure, but that would leave us with longer function names 
(gfn_to_pfn_page* instead of gfn_to_pfn*).  So the "safe" use is the one 
that looks worse and the unsafe use is the one that looks safe.



And are gfn_to_page cases also
vulernable to the same issue?


No, they're just broken for the VM_IO|VM_PFNMAP case.

Paolo


So I think it could be marked deprecated or something if not everything
will be converted in the one series, and don't need to touch all that
arch code with this patch.

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Nicholas Piggin

Excerpts from Nicholas Piggin's message of June 24, 2021 7:57 pm:
> Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
>> On 24/06/21 10:52, Nicholas Piggin wrote:
 For now, wrap all calls to gfn_to_pfn functions in the new helper
 function. Callers which don't need the page struct will be updated in
 follow-up patches.
>>> Hmm. You mean callers that do need the page will be updated? Normally
>>> if there will be leftover users that don't need the struct page then
>>> you would go the other way and keep the old call the same, and add a new
>>> one (gfn_to_pfn_page) just for those that need it.
>> 
>> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
>> it's a good idea to move the short name to the common case and the ugly 
>> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
>> sure there should be any kvm_pfn_page_unwrap in the end.
> 
> If all callers were updated that is one thing, but from the changelog
> it sounds like that would not happen and there would be some gfn_to_pfn
> users left over.
> 
> But yes in the end you would either need to make gfn_to_pfn never return
> a page found via follow_pte, or change all callers to the new way. If 
> the plan is for the latter then I guess that's fine.

Actually in that case anyway I don't see the need -- the existence of
gfn_to_pfn is enough to know it might be buggy. It can just as easily
be grepped for as kvm_pfn_page_unwrap. And are gfn_to_page cases also
vulernable to the same issue?

So I think it could be marked deprecated or something if not everything 
will be converted in the one series, and don't need to touch all that 
arch code with this patch.

Thanks,
Nick

Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 12:06, Marc Zyngier wrote:

On Thu, 24 Jun 2021 09:58:00 +0100,
Nicholas Piggin  wrote:


Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

From: David Stevens 
  out_unlock:
if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
read_unlock(>kvm->mmu_lock);
else
write_unlock(>kvm->mmu_lock);
-   kvm_release_pfn_clean(pfn);
+   if (pfnpg.page)
+   put_page(pfnpg.page);
return r;
  }


How about

   kvm_release_pfn_page_clean(pfnpg);


I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
doesn't mark the page 'clean'. I find put_page() more correct.

Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
bad at naming things that I could just as well call it 'bob()'.


The best way to go would be to get rid of kvm_release_pfn_clean() and 
always go through a pfn_page.  Then we could or could not introduce 
wrappers kvm_put_pfn_page{,_dirty}.


I think for now it's best to limit the churn since these patches will go 
in the stable releases too, and clean up the resulting API once we have 
a clear idea of how all architectures are using kvm_pfn_page.


Paolo

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini


On 24/06/21 11:57, Nicholas Piggin wrote:

Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so
it's a good idea to move the short name to the common case and the ugly
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not
sure there should be any kvm_pfn_page_unwrap in the end.


If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.


In this patches there are, so yeah the plan is to always change the 
callers to the new way.



But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If
the plan is for the latter then I guess that's fine.

Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Marc Zyngier

On Thu, 24 Jun 2021 09:58:00 +0100,
Nicholas Piggin  wrote:
> 
> Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> > From: David Stevens 
> >  out_unlock:
> > if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
> > read_unlock(>kvm->mmu_lock);
> > else
> > write_unlock(>kvm->mmu_lock);
> > -   kvm_release_pfn_clean(pfn);
> > +   if (pfnpg.page)
> > +   put_page(pfnpg.page);
> > return r;
> >  }
> 
> How about
> 
>   kvm_release_pfn_page_clean(pfnpg);

I'm not sure. I always found kvm_release_pfn_clean() ugly, because it
doesn't mark the page 'clean'. I find put_page() more correct.

Something like 'kvm_put_pfn_page()' would make more sense, but I'm so
bad at naming things that I could just as well call it 'bob()'.

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Nicholas Piggin

Excerpts from Paolo Bonzini's message of June 24, 2021 7:42 pm:
> On 24/06/21 10:52, Nicholas Piggin wrote:
>>> For now, wrap all calls to gfn_to_pfn functions in the new helper
>>> function. Callers which don't need the page struct will be updated in
>>> follow-up patches.
>> Hmm. You mean callers that do need the page will be updated? Normally
>> if there will be leftover users that don't need the struct page then
>> you would go the other way and keep the old call the same, and add a new
>> one (gfn_to_pfn_page) just for those that need it.
> 
> Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
> it's a good idea to move the short name to the common case and the ugly 
> kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
> sure there should be any kvm_pfn_page_unwrap in the end.

If all callers were updated that is one thing, but from the changelog
it sounds like that would not happen and there would be some gfn_to_pfn
users left over.

But yes in the end you would either need to make gfn_to_pfn never return
a page found via follow_pte, or change all callers to the new way. If 
the plan is for the latter then I guess that's fine.

Thanks,
Nick

Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

2021-06-24 Thread Paolo Bonzini


On 24/06/21 10:43, Nicholas Piggin wrote:

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:

From: David Stevens 


Changelog? This looks like a bug, should it have a Fixes: tag?


Probably has been there forever... The best way to fix the bug would be 
to nuke mmu_audit.c, which I've threatened to do many times but never 
followed up on.


Paolo


Thanks,
Nick



Signed-off-by: David Stevens 
---
  arch/x86/kvm/mmu/mmu_audit.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index cedc17b2f60e..97ff184084b4 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
 "ent %llxn", vcpu->arch.mmu->root_level, pfn,
 hpa, *sptep);
+
+   kvm_release_pfn_clean(pfn);
  }
  
  static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)

--
2.32.0.93.g670b81a890-goog

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Paolo Bonzini


On 24/06/21 10:52, Nicholas Piggin wrote:

For now, wrap all calls to gfn_to_pfn functions in the new helper
function. Callers which don't need the page struct will be updated in
follow-up patches.

Hmm. You mean callers that do need the page will be updated? Normally
if there will be leftover users that don't need the struct page then
you would go the other way and keep the old call the same, and add a new
one (gfn_to_pfn_page) just for those that need it.


Needing kvm_pfn_page_unwrap is a sign that something might be buggy, so 
it's a good idea to move the short name to the common case and the ugly 
kvm_pfn_page_unwrap(gfn_to_pfn(...)) for the weird one.  In fact I'm not 
sure there should be any kvm_pfn_page_unwrap in the end.


Paolo

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Marc Zyngier

Hi David,

On Thu, 24 Jun 2021 04:57:45 +0100,
David Stevens  wrote:
> 
> From: David Stevens 
> 
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
> 
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.
> 
> Signed-off-by: David Stevens 
> ---
>  arch/arm64/kvm/mmu.c   |   5 +-
>  arch/mips/kvm/mmu.c|   3 +-
>  arch/powerpc/kvm/book3s.c  |   3 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c|   5 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
>  arch/powerpc/kvm/book3s_hv_uvmem.c |   4 +-
>  arch/powerpc/kvm/e500_mmu_host.c   |   2 +-
>  arch/x86/kvm/mmu/mmu.c |  11 ++-
>  arch/x86/kvm/mmu/mmu_audit.c   |   2 +-
>  arch/x86/kvm/x86.c |   2 +-
>  drivers/gpu/drm/i915/gvt/kvmgt.c   |   2 +-
>  include/linux/kvm_host.h   |  27 --
>  include/linux/kvm_types.h  |   5 +
>  virt/kvm/kvm_main.c| 121 +
>  14 files changed, 109 insertions(+), 88 deletions(-)
> 

[...]

> +kvm_pfn_t kvm_pfn_page_unwrap(struct kvm_pfn_page pfnpg)
> +{
> + if (pfnpg.page)
> + return pfnpg.pfn;
> +
> + kvm_get_pfn(pfnpg.pfn);
> + return pfnpg.pfn;
> +}
> +EXPORT_SYMBOL_GPL(kvm_pfn_page_unwrap);

I'd really like to see a tiny bit of documentation explaining that
calls to kvm_pfn_page_unwrap() are not idempotent.

Otherwise, looks good to me.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.

Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Nicholas Piggin

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens 
>  out_unlock:
>   if (is_tdp_mmu_root(vcpu->kvm, vcpu->arch.mmu->root_hpa))
>   read_unlock(>kvm->mmu_lock);
>   else
>   write_unlock(>kvm->mmu_lock);
> - kvm_release_pfn_clean(pfn);
> + if (pfnpg.page)
> + put_page(pfnpg.page);
>   return r;
>  }

How about

  kvm_release_pfn_page_clean(pfnpg);

Thanks,
Nick

Re: [PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-24 Thread Nicholas Piggin

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens 
> 
> Return a struct kvm_pfn_page containing both a pfn and an optional
> struct page from the gfn_to_pfn family of functions. This differentiates
> the gup and follow_fault_pfn cases, which allows callers that only need
> a pfn to avoid touching the page struct in the latter case. For callers
> that need a struct page, introduce a helper function that unwraps a
> struct kvm_pfn_page into a struct page. This helper makes the call to
> kvm_get_pfn which had previously been in hva_to_pfn_remapped.
> 
> For now, wrap all calls to gfn_to_pfn functions in the new helper
> function. Callers which don't need the page struct will be updated in
> follow-up patches.

Hmm. You mean callers that do need the page will be updated? Normally 
if there will be leftover users that don't need the struct page then
you would go the other way and keep the old call the same, and add a new
one (gfn_to_pfn_page) just for those that need it.

Most kernel code I look at passes back multiple values by updating 
pointers to struct or variables rather than returning a struct, I 
suppose that's not really a big deal and a matter of taste.

Thanks,
Nick

Re: [PATCH 3/3] powerpc/pseries: fail quicker in dlpar_memory_add_by_ic()

2021-06-24 Thread Laurent Dufour


Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

The validation done at the start of dlpar_memory_add_by_ic() is an all
of nothing scenario - if any LMBs in the range is marked as RESERVED we
can fail right away.

We then can remove the 'lmbs_available' var and its check with
'lmbs_to_add' since the whole LMB range was already validated in the
previous step.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 14 ++
  1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c0a03e1537cb..377d852f5a9a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -796,7 +796,6 @@ static int dlpar_memory_add_by_index(u32 drc_index)
  static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 drc_index)
  {
struct drmem_lmb *lmb, *start_lmb, *end_lmb;
-   int lmbs_available = 0;
int rc;
  
  	pr_info("Attempting to hot-add %u LMB(s) at index %x\n",

@@ -811,15 +810,14 @@ static int dlpar_memory_add_by_ic(u32 lmbs_to_add, u32 
drc_index)
  
  	/* Validate that the LMBs in this range are not reserved */

for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
-   if (lmb->flags & DRCONF_MEM_RESERVED)
-   break;
-
-   lmbs_available++;
+   /* Fail immediately if the whole range can't be hot-added */
+   if (lmb->flags & DRCONF_MEM_RESERVED) {
+   pr_err("Memory at %llx (drc index %x) is reserved\n",
+   lmb->base_addr, lmb->drc_index);
+   return -EINVAL;
+   }
}
  
-	if (lmbs_available < lmbs_to_add)

-   return -EINVAL;
-
for_each_drmem_lmb_in_range(lmb, start_lmb, end_lmb) {
if (lmb->flags & DRCONF_MEM_ASSIGNED)
continue;

Re: [PATCH 2/3] powerpc/pseries: break early in dlpar_memory_add_by_count() loops

2021-06-24 Thread Laurent Dufour


Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

After a successful dlpar_add_lmb() call the LMB is marked as reserved.
Later on, depending whether we added enough LMBs or not, we rely on
the marked LMBs to see which ones might need to be removed, and we
remove the reservation of all of them.

These are done in for_each_drmem_lmb() loops without any break
condition. This means that we're going to check all LMBs of the partition
even after going through all the reserved ones.

This patch adds break conditions in both loops to avoid this. The
'lmbs_added' variable was renamed to 'lmbs_reserved', and it's now
being decremented each time a lmb reservation is removed, indicating
if there are still marked LMBs to be processed.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 17 -
  1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 28a7fd90232f..c0a03e1537cb 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -673,7 +673,7 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
  {
struct drmem_lmb *lmb;
int lmbs_available = 0;
-   int lmbs_added = 0;
+   int lmbs_reserved = 0;
int rc;
  
  	pr_info("Attempting to hot-add %d LMB(s)\n", lmbs_to_add);

@@ -714,13 +714,12 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
 * requested LMBs cannot be added.
 */
drmem_mark_lmb_reserved(lmb);
-
-   lmbs_added++;
-   if (lmbs_added == lmbs_to_add)
+   lmbs_reserved++;
+   if (lmbs_reserved == lmbs_to_add)
break;
}
  
-	if (lmbs_added != lmbs_to_add) {

+   if (lmbs_reserved != lmbs_to_add) {
pr_err("Memory hot-add failed, removing any added LMBs\n");
  
  		for_each_drmem_lmb(lmb) {

@@ -735,6 +734,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
dlpar_release_drc(lmb->drc_index);
  
  			drmem_remove_lmb_reservation(lmb);

+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = -EINVAL;
} else {
@@ -745,6 +748,10 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
pr_debug("Memory at %llx (drc index %x) was 
hot-added\n",
 lmb->base_addr, lmb->drc_index);
drmem_remove_lmb_reservation(lmb);
+   lmbs_reserved--;
+
+   if (lmbs_reserved == 0)
+   break;
}
rc = 0;
}

Re: [PATCH 1/6] KVM: x86/mmu: release audited pfns

2021-06-24 Thread Nicholas Piggin

Excerpts from David Stevens's message of June 24, 2021 1:57 pm:
> From: David Stevens 

Changelog? This looks like a bug, should it have a Fixes: tag?

Thanks,
Nick

> 
> Signed-off-by: David Stevens 
> ---
>  arch/x86/kvm/mmu/mmu_audit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
> index cedc17b2f60e..97ff184084b4 100644
> --- a/arch/x86/kvm/mmu/mmu_audit.c
> +++ b/arch/x86/kvm/mmu/mmu_audit.c
> @@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
> *sptep, int level)
>   audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
>"ent %llxn", vcpu->arch.mmu->root_level, pfn,
>hpa, *sptep);
> +
> + kvm_release_pfn_clean(pfn);
>  }
>  
>  static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
> -- 
> 2.32.0.93.g670b81a890-goog
> 
>

Re: [PATCH 1/3] powerpc/pseries: skip reserved LMBs in dlpar_memory_add_by_count()

2021-06-24 Thread Laurent Dufour


Le 22/06/2021 à 15:39, Daniel Henrique Barboza a écrit :

The function is counting reserved LMBs as available to be added, but
they aren't. This will cause the function to miscalculate the available
LMBs and can trigger errors later on when executing dlpar_add_lmb().


Indeed I'm wondering if dlpar_add_lmb() would fail in that case, so that's even 
better to check for that flag earlier.


Reviewed-by: Laurent Dufour 


Signed-off-by: Daniel Henrique Barboza 
---
  arch/powerpc/platforms/pseries/hotplug-memory.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 36f66556a7c6..28a7fd90232f 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -683,6 +683,9 @@ static int dlpar_memory_add_by_count(u32 lmbs_to_add)
  
  	/* Validate that there are enough LMBs to satisfy the request */

for_each_drmem_lmb(lmb) {
+   if (lmb->flags & DRCONF_MEM_RESERVED)
+   continue;
+
if (!(lmb->flags & DRCONF_MEM_ASSIGNED))
lmbs_available++;

Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-24 Thread Aneesh Kumar K.V

David Gibson  writes:

> On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
>> PAPR interface currently supports two different ways of communicating 
>> resource
>> grouping details to the OS. These are referred to as Form 0 and Form 1
>> associativity grouping. Form 0 is the older format and is now considered
>> deprecated. This patch adds another resource grouping named FORM2.
>> 
>> Signed-off-by: Daniel Henrique Barboza 
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  Documentation/powerpc/associativity.rst   | 135 
>>  arch/powerpc/include/asm/firmware.h   |   3 +-
>>  arch/powerpc/include/asm/prom.h   |   1 +
>>  arch/powerpc/kernel/prom_init.c   |   3 +-
>>  arch/powerpc/mm/numa.c| 149 +-
>>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>>  6 files changed, 286 insertions(+), 6 deletions(-)
>>  create mode 100644 Documentation/powerpc/associativity.rst
>> 
>> diff --git a/Documentation/powerpc/associativity.rst 
>> b/Documentation/powerpc/associativity.rst
>> new file mode 100644
>> index ..93be604ac54d
>> --- /dev/null
>> +++ b/Documentation/powerpc/associativity.rst
>> @@ -0,0 +1,135 @@
>> +
>> +NUMA resource associativity
>> +=
>> +
>> +Associativity represents the groupings of the various platform resources 
>> into
>> +domains of substantially similar mean performance relative to resources 
>> outside
>> +of that domain. Resources subsets of a given domain that exhibit better
>> +performance relative to each other than relative to other resources subsets
>> +are represented as being members of a sub-grouping domain. This performance
>> +characteristic is presented in terms of NUMA node distance within the Linux 
>> kernel.
>> +From the platform view, these groups are also referred to as domains.
>> +
>> +PAPR interface currently supports different ways of communicating these 
>> resource
>> +grouping details to the OS. These are referred to as Form 0, Form 1 and 
>> Form2
>> +associativity grouping. Form 0 is the older format and is now considered 
>> deprecated.
>> +
>> +Hypervisor indicates the type/form of associativity used via 
>> "ibm,arcitecture-vec-5 property".
>> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
>> Form 0 or Form 1.
>> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
>> associativity
>> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>> +
>> +Form 0
>> +-
>> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> +
>> +Form 1
>> +-
>> +With Form 1 a combination of ibm,associativity-reference-points and 
>> ibm,associativity
>> +device tree properties are used to determine the NUMA distance between 
>> resource groups/domains.
>> +
>> +The “ibm,associativity” property contains one or more lists of numbers 
>> (domainID)
>> +representing the resource’s platform grouping domains.
>> +
>> +The “ibm,associativity-reference-points” property contains one or more list 
>> of numbers
>> +(domainID index) that represents the 1 based ordinal in the associativity 
>> lists.
>> +The list of domainID index represnets increasing hierachy of
>> resource grouping.
>
> Typo "represnets".  Also s/hierachy/hierarchy/
>
>> +
>> +ex:
>> +{ primary domainID index, secondary domainID index, tertiary domainID 
>> index.. }
>
>> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
>> node id.
>> +Linux kernel computes NUMA distance between two domains by recursively 
>> comparing
>> +if they belong to the same higher-level domains. For mismatch at every 
>> higher
>> +level of the resource group, the kernel doubles the NUMA distance between 
>> the
>> +comparing domains.
>
> The Form1 description is still kinda confusing, but I don't really
> care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.
>
>> +
>> +Form 2
>> +---
>> +Form 2 associativity format adds separate device tree properties 
>> representing NUMA node distance
>> +thereby making the node distance computation flexible. Form 2 also allows 
>> flexible primary
>> +domain numbering. With numa distance computation now detached from the 
>> index value of
>> +"ibm,associativity" property, Form 2 allows a large number of primary 
>> domain ids at the
>> +same domainID index representing resource groups of different 
>> performance/latency characteristics.
>
> So, see you've removed the special handling of secondary IDs for pmem
> - big improvement, thanks.  IIUC, in this revised version, for Form2
> there's really no reason for ibm,associativity-reference-points to
> have >1 entry.  Is that right?
>
> In Form2 everything revolves around the primary domain ID - so much
> that I suggest we come up with a short name for it.  How about just
> "node id" since that's how Linux uses it.

We don't really refer primary domain ID in rest of FORM2

[RESEND-PATCH v2] powerpc/papr_scm: Add support for reporting dirty-shutdown-count

2021-06-24 Thread Vaibhav Jain

Persistent memory devices like NVDIMMs can loose cached writes in case
something prevents flush on power-fail. Such situations are termed as
dirty shutdown and are exposed to applications as
last-shutdown-state (LSS) flag and a dirty-shutdown-counter(DSC) as
described at [1]. The latter being useful in conditions where multiple
applications want to detect a dirty shutdown event without racing with
one another.

PAPR-NVDIMMs have so far only exposed LSS style flags to indicate a
dirty-shutdown-state. This patch further adds support for DSC via the
"ibm,persistence-failed-count" device tree property of an NVDIMM. This
property is a monotonic increasing 64-bit counter thats an indication
of number of times an NVDIMM has encountered a dirty-shutdown event
causing persistence loss.

Since this value is not expected to change after system-boot hence
papr_scm reads & caches its value during NVDIMM probe and exposes it
as a PAPR sysfs attributed named 'dirty_shutdown' to match the name of
similarly named NFIT sysfs attribute. Also this value is available to
libnvdimm via PAPR_PDSM_HEALTH payload. 'struct nd_papr_pdsm_health'
has been extended to add a new member called 'dimm_dsc' presence of
which is indicated by the newly introduced PDSM_DIMM_DSC_VALID flag.

References:
[1] https://pmem.io/documents/Dirty_Shutdown_Handling-V1.0.pdf

Signed-off-by: Vaibhav Jain 
Reviewed-by: Aneesh Kumar K.V 
---
Changelog:

Resend:
* Added ack by Aneesh.
* Minor fix to changelog of v2 patch

v2:
* Rebased the patch on latest ppc-next tree
* s/psdm/pdsm/g   [ Santosh ]
---
 arch/powerpc/include/uapi/asm/papr_pdsm.h |  6 +
 arch/powerpc/platforms/pseries/papr_scm.c | 30 +++
 2 files changed, 36 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
b/arch/powerpc/include/uapi/asm/papr_pdsm.h
index 50ef95e2f5b1..82488b1e7276 100644
--- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
+++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
@@ -77,6 +77,9 @@
 /* Indicate that the 'dimm_fuel_gauge' field is valid */
 #define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
 
+/* Indicate that the 'dimm_dsc' field is valid */
+#define PDSM_DIMM_DSC_VALID 2
+
 /*
  * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
  * Various flags indicate the health status of the dimm.
@@ -105,6 +108,9 @@ struct nd_papr_pdsm_health {
 
/* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
__u16 dimm_fuel_gauge;
+
+   /* Extension flag PDSM_DIMM_DSC_VALID */
+   __u64 dimm_dsc;
};
__u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
};
diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 11e7b90a3360..eb8be0eb6119 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -114,6 +114,9 @@ struct papr_scm_priv {
/* Health information for the dimm */
u64 health_bitmap;
 
+   /* Holds the last known dirty shutdown counter value */
+   u64 dirty_shutdown_counter;
+
/* length of the stat buffer as expected by phyp */
size_t stat_buffer_len;
 };
@@ -603,6 +606,16 @@ static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
return rc;
 }
 
+/* Add the dirty-shutdown-counter value to the pdsm */
+static int papr_pdsm_dsc(struct papr_scm_priv *p,
+union nd_pdsm_payload *payload)
+{
+   payload->health.extension_flags |= PDSM_DIMM_DSC_VALID;
+   payload->health.dimm_dsc = p->dirty_shutdown_counter;
+
+   return sizeof(struct nd_papr_pdsm_health);
+}
+
 /* Fetch the DIMM health info and populate it in provided package. */
 static int papr_pdsm_health(struct papr_scm_priv *p,
union nd_pdsm_payload *payload)
@@ -646,6 +659,8 @@ static int papr_pdsm_health(struct papr_scm_priv *p,
 
/* Populate the fuel gauge meter in the payload */
papr_pdsm_fuel_gauge(p, payload);
+   /* Populate the dirty-shutdown-counter field */
+   papr_pdsm_dsc(p, payload);
 
rc = sizeof(struct nd_papr_pdsm_health);
 
@@ -907,6 +922,16 @@ static ssize_t flags_show(struct device *dev,
 }
 DEVICE_ATTR_RO(flags);
 
+static ssize_t dirty_shutdown_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct nvdimm *dimm = to_nvdimm(dev);
+   struct papr_scm_priv *p = nvdimm_provider_data(dimm);
+
+   return sysfs_emit(buf, "%llu\n", p->dirty_shutdown_counter);
+}
+DEVICE_ATTR_RO(dirty_shutdown);
+
 static umode_t papr_nd_attribute_visible(struct kobject *kobj,
 struct attribute *attr, int n)
 {
@@ -925,6 +950,7 @@ static umode_t papr_nd_attribute_visible(struct kobject 
*kobj,
 static struct attribute *papr_nd_attributes[] = {
_attr_flags.attr,
_attr_perf_stats.attr,
+   _attr_dirty_shutdown.attr,

Re: linux-next: Signed-off-by missing for commit in the powerpc tree

2021-06-24 Thread Michael Ellerman

Stephen Rothwell  writes:
> Hi all,
>
> Commit
>
>   77bbbc0cf848 ("KVM: PPC: Book3S HV: Fix TLB management on SMT8 POWER9 and 
> POWER10 processors")
>
> is missing a Signed-off-by from its author.

That was actually deliberate.

Suraj wrote the patch when he was at IBM, but never sent it.

Paul & Nick then resurrected it from some internal tree, and are now
submitting it on behalf of IBM.

So they are asserting option b) of the DCO AIUI:

  (b) The contribution is based upon previous work that, to the best
  of my knowledge, is covered under an appropriate open source
  license and I have the right under that license to submit that
  work with modifications, whether created in whole or in part
  by me, under the same open source license (unless I am
  permitted to submit under a different license), as indicated
  in the file


cheers

Re: [PATCH] crypto: nx: Fix memcpy() over-reading in nonce

2021-06-24 Thread Herbert Xu

On Wed, Jun 16, 2021 at 01:34:59PM -0700, Kees Cook wrote:
> Fix typo in memcpy() where size should be CTR_RFC3686_NONCE_SIZE.
> 
> Fixes: 030f4e968741 ("crypto: nx - Fix reentrancy bugs")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
>  drivers/crypto/nx/nx-aes-ctr.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 05:57, David Stevens wrote:

  static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-   int map_writable, int max_level, kvm_pfn_t pfn,
+   int map_writable, int max_level,
+   const struct kvm_pfn_page *pfnpg,
bool prefault, bool is_tdp)
  {
bool nx_huge_page_workaround_enabled = is_nx_huge_pa


So the main difference with my tentative patches is that here I was 
passing the struct by value.  I'll try to clean this up for 5.15, but 
for now it's all good like this.  Thanks again!


Paolo

Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-24 Thread Paolo Bonzini


On 24/06/21 05:57, David Stevens wrote:

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.


Thank you very much for this.  I agree that it makes sense to have a 
minimal change; I had similar changes almost ready, but was stuck with 
deadlocks in the gfn_to_pfn_cache case.  In retrospect I should have 
posted something similar to your patches.


I have started reviewing the patches, and they look good.  I will try to 
include them in 5.13.


Paolo

[PATCH] powerpc/sysfs: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE

2021-06-24 Thread Jason Wang

The ARRAY_SIZE macro is more compact and more formal in linux source.

Signed-off-by: Jason Wang 
---
 arch/powerpc/kernel/sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 2e08640bb3b4..5ff0e55d0db1 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -843,14 +843,14 @@ static int register_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
case PPC_PMC_IBM:
attrs = ibm_common_attrs;
-   nattrs = sizeof(ibm_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(ibm_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
case PPC_PMC_G4:
attrs = g4_common_attrs;
-   nattrs = sizeof(g4_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(g4_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -858,7 +858,7 @@ static int register_cpu_online(unsigned int cpu)
case PPC_PMC_PA6T:
/* PA Semi starts counting at PMC0 */
attrs = pa6t_attrs;
-   nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+   nattrs = ARRAY_SIZE(pa6t_attrs);
pmc_attrs = NULL;
break;
 #endif
@@ -940,14 +940,14 @@ static int unregister_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
case PPC_PMC_IBM:
attrs = ibm_common_attrs;
-   nattrs = sizeof(ibm_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(ibm_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
case PPC_PMC_G4:
attrs = g4_common_attrs;
-   nattrs = sizeof(g4_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(g4_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -955,7 +955,7 @@ static int unregister_cpu_online(unsigned int cpu)
case PPC_PMC_PA6T:
/* PA Semi starts counting at PMC0 */
attrs = pa6t_attrs;
-   nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+   nattrs = ARRAY_SIZE(pa6t_attrs);
pmc_attrs = NULL;
break;
 #endif
-- 
2.31.1

[PATCH] powerpc: ps3: remove unneeded semicolon

2021-06-24 Thread 13145886936

From: gushengxian 

Remove unneeded semocolons.

Signed-off-by: gushengxian 
---
 arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/system-bus.c 
b/arch/powerpc/platforms/ps3/system-bus.c
index 1a5665875165..f57f37fe038c 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -604,7 +604,7 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, 
struct page *page,
default:
/* not happned */
BUG();
-   };
+   }
result = ps3_dma_map(dev->d_region, (unsigned long)ptr, size,
 _addr, iopte_flag);
 
@@ -763,7 +763,7 @@ int ps3_system_bus_device_register(struct 
ps3_system_bus_device *dev)
break;
default:
BUG();
-   };
+   }
 
dev->core.of_node = NULL;
set_dev_node(>core, 0);
-- 
2.25.1

[PATCH] crypto: scatterwalk - Remove obsolete PageSlab check

2021-06-24 Thread Herbert Xu

On Fri, Jun 18, 2021 at 11:12:58AM -0700, Ira Weiny wrote:
>
> Interesting!  Thanks!
> 
> Digging around a bit more I found:
> 
> https://lore.kernel.org/patchwork/patch/439637/

Nice find.  So we can at least get rid of the PageSlab call from
the Crypto API.

---8<---
As it is now legal to call flush_dcache_page on slab pages we
no longer need to do the check in the Crypto API.

Reported-by: Ira Weiny 
Signed-off-by: Herbert Xu 

diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index c837d0775474..7af08174a721 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -81,12 +81,7 @@ static inline void scatterwalk_pagedone(struct scatter_walk 
*walk, int out,
struct page *page;
 
page = sg_page(walk->sg) + ((walk->offset - 1) >> PAGE_SHIFT);
-   /* Test ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE first as
-* PageSlab cannot be optimised away per se due to
-* use of volatile pointer.
-*/
-   if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE && !PageSlab(page))
-   flush_dcache_page(page);
+   flush_dcache_page(page);
}
 
if (more && walk->offset >= walk->sg->offset + walk->sg->length)
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-24 Thread Claire Chang

On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:
>
> On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> >
> > is_swiotlb_force_bounce() was the new function introduced in this patch 
> > here.
> >
> > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > +{
> > + return dev->dma_io_tlb_mem->force_bounce;
> > +}
>
> To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> turn this into :
>
> return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
>
> for a quick debug check?

I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).

91 matches

Mail list logo