date:20200820

Re: kernel since 5.6 do not boot anymore on Apple PowerBook

2020-08-20 Thread Christophe Leroy


Hi Giuseppe,

Le 08/07/2020 à 20:44, Christophe Leroy a écrit :



Le 08/07/2020 à 19:36, Giuseppe Sacco a écrit :

Hi Cristophe,

Il giorno mer, 08/07/2020 alle 19.09 +0200, Christophe Leroy ha
scritto:

Hi

Le 08/07/2020 à 19:00, Giuseppe Sacco a écrit :

Hello,
while trying to debug a problem using git bisect, I am now at a point
where I cannot build the kernel at all. This is the error message I
get:

$ LANG=C make ARCH=powerpc \
   CROSS_COMPILE=powerpc-linux- \
   CONFIG_MODULE_COMPRESS_GZIP=true \
   INSTALL_MOD_STRIP=1 CONFIG_MODULE_COMPRESS=1 \
   -j4 INSTALL_MOD_PATH=$BOOT INSTALL_PATH=$BOOT \
   CONFIG_DEBUG_INFO_COMPRESSED=1 \
   install modules_install
make[2]: *** No rule to make target 'vmlinux', needed by


Surprising.

Did you make any change to Makefiles ?


No


Are you in the middle of a bisect ? If so, if the previous builds
worked, I'd do 'git bisect skip'


Yes, the previous one worked.


What's the result with:

LANG=C make ARCH=powerpc CROSS_COMPILE=powerpc-linux- vmlinux


$ LANG=C make ARCH=powerpc CROSS_COMPILE=powerpc-linux- vmlinux
   CALL    scripts/checksyscalls.sh
   CALL    scripts/atomic/check-atomics.sh
   CHK include/generated/compile.h
   CC  kernel/module.o
kernel/module.c: In function 'do_init_module':
kernel/module.c:3593:2: error: implicit declaration of function
'module_enable_ro'; did you mean 'module_enable_x'? [-Werror=implicit-
function-declaration]
  3593 |  module_enable_ro(mod, true);
   |  ^~~~
   |  module_enable_x
cc1: some warnings being treated as errors
make[1]: *** [scripts/Makefile.build:267: kernel/module.o] Error 1
make: *** [Makefile:1735: kernel] Error 2

So, should I 'git bisect skip'?


Ah yes, I had the exact same problem last time I bisected.

So yes do 'git bisect skip'. You'll probably hit this problem half a 
dozen of times, but at the end you should get a usefull bisect anyway.




Were you able to progress ?

Christophe

Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes

2020-08-20 Thread Aneesh Kumar K.V


On 8/21/20 9:03 AM, Anshuman Khandual wrote:



On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote:

"Aneesh Kumar K.V"  writes:


This patch series includes fixes for debug_vm_pgtable test code so that
they follow page table updates rules correctly. The first two patches introduce
changes w.r.t ppc64. The patches are included in this series for completeness. 
We can
merge them via ppc64 tree if required.

Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
page table update rules.

Changes from V1:
* Address review feedback
* drop test specific pfn_pte and pfn_pmd.
* Update ppc64 page table helper to add _PAGE_PTE

Aneesh Kumar K.V (13):
   powerpc/mm: Add DEBUG_VM WARN for pmd_clear
   powerpc/mm: Move setting pte specific flags to pfn_pte
   mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
   mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
 vmap support.
   mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
 CONFIG_NUMA_BALANCING
   mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
 set_pmd/pud_at
   mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
 existing pte entry
   mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
   mm/debug_vm_pgtable/locks: Move non page table modifying test together
   mm/debug_vm_pgtable/locks: Take correct page table lock
   mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
   mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
   mm/debug_vm_pgtable: populate a pte entry before fetching it

  arch/powerpc/include/asm/book3s/64/pgtable.h |  29 +++-
  arch/powerpc/include/asm/nohash/pgtable.h|   5 -
  arch/powerpc/mm/book3s64/pgtable.c   |   2 +-
  arch/powerpc/mm/pgtable.c|   5 -
  include/linux/io.h   |  12 ++
  mm/debug_vm_pgtable.c| 151 +++
  6 files changed, 127 insertions(+), 77 deletions(-)



BTW I picked a wrong branch when sending this. Attaching the diff
against what I want to send.  pfn_pmd() no more updates _PAGE_PTE
because that is handled by pmd_mkhuge().

diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 3b4da7c63e28..e18ae50a275c 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
unsigned long pmdv;
  
  	pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;

-   return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE);
+   return pmd_set_protbits(__pmd(pmdv), pgprot);
  }
  
  pmd_t mk_pmd(struct page *page, pgprot_t pgprot)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 7d9f8e1d790f..cad61d22f33a 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
long pfn, pgprot_t prot)
  
  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)

  {
-   pmd_t pmd = pfn_pmd(pfn, prot);
+   pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
  
  	if (!IS_ENABLED(CONFIG_NUMA_BALANCING))

return;



Cover letter does not mention which branch or tag this series applies on.
Just assumed it to be 5.9-rc1. Should the above changes be captured as a
pre-requisite patch ?

Anyways, the series fails to be build on arm64.

A) Without CONFIG_TRANSPARENT_HUGEPAGE

mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’:
mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function 
‘pmd_advanced_tests’
   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
   ^~
mm/debug_vm_pgtable.c:366:20: note: declared here
  static void __init pmd_advanced_tests(struct mm_struct *mm,
 ^~

B) As mentioned previously, this should be solved by including 

mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’:
mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function 
‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? 
[-Werror=implicit-function-declaration]
   if (!arch_ioremap_pmd_supported())
^~

Please make sure that the series builds on all enabled platforms i.e x86,
arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling
all the features that make various #ifdefs in the test.



I was hoping to get kernel test robot build report to verify that. But 
if you can help with that i have pushed a branch to github with reported 
build failure fixes.


https://github.com/kvaneesh/linux/tree/debug_vm_pgtable

I still haven't looked at the PMD_FOLDED feedback from Christophe 
because I am not sure i follow why we are checking for PMD folded there.


-aneesh

Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules

2020-08-20 Thread Christophe Leroy





On 08/21/2020 05:11 AM, Christophe Leroy wrote:



Le 21/08/2020 à 00:00, Andreas Schwab a écrit :

On Jun 29 2020, Christophe Leroy wrote:


Instead of leaving NX unset on all segments above the start
of vmalloc space, only leave NX unset on segments used for
modules.


I'm getting this crash:

kernel tried to execute exec-protected page (f294b000) - exploit 
attempt (uid: 0)

BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xf294b000
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in: pata_macio(+)
CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49
NIP:  f294b000 LR: 0005c60 CTR: f294b000
REGS: f18d9cc0 TRAP: 0400  Not tainted  (5.8.0-rc2-test)
MSR:  10009032   CR: 84222422  XER: 2000
GPR00: c0005c14 f18d9d78 ef30ca20  efe0 c00993d0 ef6da038 
005e
GPR08: c09050b8 c08b  f18d9d78 44222422 10072070  
0fefaca4
GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 
ef2776e4
GPR24: f18fd6e8 0001 c086fe64 c086fe04  c08b f294b000 


NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio]
LR [c0005c60] do_one_initcall+0x6c/0x160
Call Trace:
[f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable)
[f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0
[f18d9df8] [c00993d8] load_module+0x16a8/0x1c14
[f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94
[f18d9f38] [c0012174] ret_from_syscall+0x0/0x34
--- interrupt: c01 at 0xfdb4318
    LR = 0xfeee9c0
Instruction dump:
       
    <3d20c08b> 3d40c086 9421ffe0 8129106c
---[ end trace 85a98cc836109871 ]---



Please try the patch at 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.le...@csgroup.eu/ 



And if you are using KAsan, also take 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.le...@csgroup.eu/ 



Allthough I have some doubt that it will fix it, because the faulting 
instruction address is at 0xf294b000 which is within the vmalloc area. 
In the likely case the patch doesn't fix the issue, can you provide your 
.config and a dump of /sys/kernel/debug/powerpc/segment_registers (You 
have to have CONFIG_PPC_PTDUMP enabled for that) and also the below part 
from boot log.


[    0.00] Memory: 509556K/524288K available (7088K kernel code, 
592K rwdata, 1304K rodata, 356K init, 803K bss, 14732K reserved, 0K 
cma-reserved)

[    0.00] Kernel virtual memory layout:
[    0.00]   * 0xff7ff000..0xf000  : fixmap
[    0.00]   * 0xff7fd000..0xff7ff000  : early ioremap
[    0.00]   * 0xe100..0xff7fd000  : vmalloc & ioremap




I found the issue, when VMALLOC_END is above 0xf000, 
ALIGN(VMALLOC_END, SZ_256M) is 0 so the test is always false.


The below change should fix it.

diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c
index 82ae9e06a773..d426eaf76bb0 100644
--- a/arch/powerpc/mm/book3s32/mmu.c
+++ b/arch/powerpc/mm/book3s32/mmu.c
@@ -194,12 +194,12 @@ static bool is_module_segment(unsigned long addr)
 #ifdef MODULES_VADDR
if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M))
return false;
-   if (addr >= ALIGN(MODULES_END, SZ_256M))
+   if (addr > ALIGN(MODULES_END, SZ_256M) - 1)
return false;
 #else
if (addr < ALIGN_DOWN(VMALLOC_START, SZ_256M))
return false;
-   if (addr >= ALIGN(VMALLOC_END, SZ_256M))
+   if (addr > ALIGN(VMALLOC_END, SZ_256M) - 1)
return false;
 #endif
return true;


Christophe

Re: [PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c

2020-08-20 Thread Christophe Leroy





Le 21/08/2020 à 06:44, Nicholas Piggin a écrit :

This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
  include/linux/vmalloc.h |   2 +
  mm/ioremap.c| 192 
  mm/vmalloc.c| 191 +++
  3 files changed, 193 insertions(+), 192 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
  extern struct vm_struct *find_vm_area(const void *addr);
  
  #ifdef CONFIG_MMU

+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t 
phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);


extern keyword is useless on function prototypes and deprecated. Please 
don't add new function prototypes with that keyword.



  extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
  int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,


Christophe

Re: [PATCH v5 0/8] huge vmalloc mappings

2020-08-20 Thread Christophe Leroy





Le 21/08/2020 à 06:44, Nicholas Piggin a écrit :

I made this powerpc-only for the time being. It shouldn't be too hard to
add support for other archs that define HUGE_VMAP. I have booted x86
with it enabled, just may not have audited everything.


I like this series, but if I understand correctly it enables huge 
vmalloc mappings only for hugepages sizes matching a page directory 
levels, ie on a PPC32 it would work only for 4M hugepages.


On the 8xx, we only have 8M and 512k hugepages. Any change that it can 
support these as well one day ?


Christophe



Hi Andrew, would you care to put this in your tree?

Thanks,
Nick

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
   Cameron for debugging the cause of this (hopefully).

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
   struct vm_struct is an interface (this should fix x86 vmap stack debug
   assert). [Thanks Zefan]

Nicholas Piggin (8):
   mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
   mm: apply_to_pte_range warn and fail if a large pte is encountered
   mm/vmalloc: rename vmap_*_range vmap_pages_*_range
   lib/ioremap: rename ioremap_*_range to vmap_*_range
   mm: HUGE_VMAP arch support cleanup
   mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
   mm/vmalloc: add vmap_range_noflush variant
   mm/vmalloc: Hugepage vmalloc mappings

  .../admin-guide/kernel-parameters.txt |   2 +
  arch/Kconfig  |   4 +
  arch/arm64/mm/mmu.c   |  12 +-
  arch/powerpc/Kconfig  |   1 +
  arch/powerpc/mm/book3s64/radix_pgtable.c  |  10 +-
  arch/x86/mm/ioremap.c |  12 +-
  include/linux/io.h|   9 -
  include/linux/vmalloc.h   |  13 +
  init/main.c   |   1 -
  mm/ioremap.c  | 231 +
  mm/memory.c   |  60 ++-
  mm/page_alloc.c   |   4 +-
  mm/vmalloc.c  | 456 +++---
  13 files changed, 476 insertions(+), 339 deletions(-)

Re: [PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c

2020-08-20 Thread Christoph Hellwig

On Fri, Aug 21, 2020 at 02:44:25PM +1000, Nicholas Piggin wrote:
> This is a generic kernel virtual memory mapper, not specific to ioremap.

lib/ioremap doesn't exist any more.

> 
> Signed-off-by: Nicholas Piggin 
> ---
>  include/linux/vmalloc.h |   2 +
>  mm/ioremap.c| 192 
>  mm/vmalloc.c| 191 +++
>  3 files changed, 193 insertions(+), 192 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 787d77ad7536..e3590e93bfff 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
>  extern struct vm_struct *find_vm_area(const void *addr);
>  
>  #ifdef CONFIG_MMU
> +extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t 
> phys_addr, pgprot_t prot,
> + unsigned int max_page_shift);

Please avoid the pointlessly long line.  And don't add the pointless
extern.

Re: [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup

2020-08-20 Thread Christoph Hellwig

>  static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long 
> end,
> - phys_addr_t phys_addr, pgprot_t prot)
> + phys_addr_t phys_addr, pgprot_t prot, unsigned int 
> max_page_shift)
>  {

... and here.

Re: [PATCH v5 4/8] lib/ioremap: rename ioremap__range to vmap__range

2020-08-20 Thread Christoph Hellwig

On Fri, Aug 21, 2020 at 02:44:23PM +1000, Nicholas Piggin wrote:
> This will be moved to mm/ and used as a generic kernel virtual mapping
> function, so re-name it in preparation.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  mm/ioremap.c | 55 ++--
>  1 file changed, 23 insertions(+), 32 deletions(-)
> 
> diff --git a/mm/ioremap.c b/mm/ioremap.c
> index 5fa1ab41d152..6016ae3227ad 100644
> --- a/mm/ioremap.c
> +++ b/mm/ioremap.c
> @@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
>  static inline int ioremap_pmd_enabled(void) { return 0; }
>  #endif   /* CONFIG_HAVE_ARCH_HUGE_VMAP */
>  
> -static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
> - unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
> - pgtbl_mod_mask *mask)
> +static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> + phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
> *mask)

Same here.

Re: [PATCH v5 3/8] mm/vmalloc: rename vmap__range vmap_pages__range

2020-08-20 Thread Christoph Hellwig

On Fri, Aug 21, 2020 at 02:44:22PM +1000, Nicholas Piggin wrote:
> The vmalloc mapper operates on a struct page * array rather than a
> linear physical address, re-name it to make this distinction clear.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  mm/vmalloc.c | 28 
>  1 file changed, 12 insertions(+), 16 deletions(-)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 49f225b0f855..3a1e45fd1626 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, 
> unsigned long size)
>   arch_sync_kernel_mappings(start, end);
>  }
>  
> -static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
> - unsigned long end, pgprot_t prot, struct page **pages, int *nr,
> - pgtbl_mod_mask *mask)
> +static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned 
> long end,
> + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
> *mask)

Please don't add > 80 lines without any good reason.

Re: [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup

2020-08-20 Thread Christophe Leroy





Le 21/08/2020 à 06:44, Nicholas Piggin a écrit :

This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.


I think that in order to allow constant-folding of dead code for 
unsupported levels, you must define arch_vmap_xxx_supported() as static 
inline in a .h


If you have them in .c files, you'll get calls to tiny functions that 
will always return false, but will still be called and dead code won't 
be eliminated. And performance wise, that's probably not optimal either.


Christophe




This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Signed-off-by: Nicholas Piggin 
---
  arch/arm64/mm/mmu.c  | 12 +--
  arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++-
  arch/x86/mm/ioremap.c| 12 +--
  include/linux/io.h   |  9 ---
  include/linux/vmalloc.h  | 10 +++
  init/main.c  |  1 -
  mm/ioremap.c | 96 +++-
  7 files changed, 73 insertions(+), 77 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..bbb3ccf6a7ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
  }
  
-int __init arch_ioremap_p4d_supported(void)

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
  {
-   return 0;
+   return false;
  }
  
-int __init arch_ioremap_pud_supported(void)

+bool arch_vmap_pud_supported(pgprot_t prot)
  {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
  }
  
-int __init arch_ioremap_pmd_supported(void)

+bool arch_vmap_pmd_supported(pgprot_t prot)
  {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
  }
+#endif
  
  int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)

  {
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index ae823bba29f2..7d3a620c5adf 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1182,13 +1182,14 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
  }
  
-int __init arch_ioremap_pud_supported(void)

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_pud_supported(pgprot_t prot)
  {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
  }
  
-int __init arch_ioremap_pmd_supported(void)

+bool arch_vmap_pmd_supported(pgprot_t prot)
  {
return radix_enabled();
  }
@@ -1197,6 +1198,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
  {
return 0;
  }
+#endif
  
  int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)

  {
@@ -1282,7 +1284,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
  }
  
-int __init arch_ioremap_p4d_supported(void)

+bool arch_vmap_p4d_supported(pgprot_t prot)
  {
-   return 0;
+   return false;
  }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..5b8b495ab4ed 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr)
  }
  EXPORT_SYMBOL(iounmap);
  
-int __init arch_ioremap_p4d_supported(void)

+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
  {
-   return 0;
+   return false;
  }
  
-int __init arch_ioremap_pud_supported(void)

+bool arch_vmap_pud_supported(pgprot_t prot)
  {
  #ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
  #else
-   return 0;
+   return false;
  #endif
  }
  
-int __init arch_ioremap_pmd_supported(void)

+bool arch_vmap_pmd_supported(pgprot_t prot)
  {
return boot_cpu_has(X86_FEATURE_PSE);
  }
+#endif
  
  /*

   * Convert a physical pointer to a virtual kernel pointer for /dev/mem
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, 
unsigned long end,
  }
  #endif
  
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP

-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline voi

[PATCH 2/2] powerpc/64s: Disallow PROT_SAO in LPARs by default

2020-08-20 Thread Shawn Anastasio

Since migration of guests using SAO to ISA 3.1 hosts may cause issues,
disable PROT_SAO in LPARs by default and introduce a new Kconfig option
PPC_PROT_SAO_LPAR to allow users to enable it if desired.

Signed-off-by: Shawn Anastasio 
---
 arch/powerpc/Kconfig| 12 
 arch/powerpc/include/asm/mman.h |  9 +++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..65bed1fdeaad 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -860,6 +860,18 @@ config PPC_SUBPAGE_PROT
 
  If unsure, say N here.
 
+config PPC_PROT_SAO_LPAR
+   bool "Support PROT_SAO mappings in LPARs"
+   depends on PPC_BOOK3S_64
+   help
+ This option adds support for PROT_SAO mappings from userspace
+ inside LPARs on supported CPUs.
+
+ This may cause issues when performing guest migration from
+ a CPU that supports SAO to one that does not.
+
+ If unsure, say N here.
+
 config PPC_COPRO_BASE
bool
 
diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
index 4ba303ea27f5..7cb6d18f5cd6 100644
--- a/arch/powerpc/include/asm/mman.h
+++ b/arch/powerpc/include/asm/mman.h
@@ -40,8 +40,13 @@ static inline bool arch_validate_prot(unsigned long prot, 
unsigned long addr)
 {
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
return false;
-   if ((prot & PROT_SAO) && !cpu_has_feature(CPU_FTR_SAO))
-   return false;
+   if (prot & PROT_SAO) {
+   if (!cpu_has_feature(CPU_FTR_SAO))
+   return false;
+   if (firmware_has_feature(FW_FEATURE_LPAR) &&
+   !IS_ENABLED(CONFIG_PPC_PROT_SAO_LPAR))
+   return false;
+   }
return true;
 }
 #define arch_validate_prot arch_validate_prot
-- 
2.28.0

Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules

2020-08-20 Thread Christophe Leroy





Le 21/08/2020 à 00:00, Andreas Schwab a écrit :

On Jun 29 2020, Christophe Leroy wrote:


Instead of leaving NX unset on all segments above the start
of vmalloc space, only leave NX unset on segments used for
modules.


I'm getting this crash:

kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 
0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xf294b000
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in: pata_macio(+)
CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49
NIP:  f294b000 LR: 0005c60 CTR: f294b000
REGS: f18d9cc0 TRAP: 0400  Not tainted  (5.8.0-rc2-test)
MSR:  10009032   CR: 84222422  XER: 2000
GPR00: c0005c14 f18d9d78 ef30ca20  efe0 c00993d0 ef6da038 005e
GPR08: c09050b8 c08b  f18d9d78 44222422 10072070  0fefaca4
GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 ef2776e4
GPR24: f18fd6e8 0001 c086fe64 c086fe04  c08b f294b000 
NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio]
LR [c0005c60] do_one_initcall+0x6c/0x160
Call Trace:
[f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable)
[f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0
[f18d9df8] [c00993d8] load_module+0x16a8/0x1c14
[f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94
[f18d9f38] [c0012174] ret_from_syscall+0x0/0x34
--- interrupt: c01 at 0xfdb4318
LR = 0xfeee9c0
Instruction dump:
       
    <3d20c08b> 3d40c086 9421ffe0 8129106c
---[ end trace 85a98cc836109871 ]---



Please try the patch at 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.le...@csgroup.eu/


And if you are using KAsan, also take 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.le...@csgroup.eu/


Allthough I have some doubt that it will fix it, because the faulting 
instruction address is at 0xf294b000 which is within the vmalloc area. 
In the likely case the patch doesn't fix the issue, can you provide your 
.config and a dump of /sys/kernel/debug/powerpc/segment_registers (You 
have to have CONFIG_PPC_PTDUMP enabled for that) and also the below part 
from boot log.


[0.00] Memory: 509556K/524288K available (7088K kernel code, 
592K rwdata, 1304K rodata, 356K init, 803K bss, 14732K reserved, 0K 
cma-reserved)

[0.00] Kernel virtual memory layout:
[0.00]   * 0xff7ff000..0xf000  : fixmap
[0.00]   * 0xff7fd000..0xff7ff000  : early ioremap
[0.00]   * 0xe100..0xff7fd000  : vmalloc & ioremap


Thanks
Christophe

[PATCH 1/2] Revert "powerpc/64s: Remove PROT_SAO support"

2020-08-20 Thread Shawn Anastasio

This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05.

Since PROT_SAO can still be useful for certain classes of software,
reintroduce it. Concerns about guest migration for LPARs using SAO
will be addressed next.

Signed-off-by: Shawn Anastasio 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  8 ++--
 arch/powerpc/include/asm/cputable.h   | 10 ++---
 arch/powerpc/include/asm/mman.h   | 26 ++--
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  2 +
 arch/powerpc/include/uapi/asm/mman.h  |  2 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c |  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  2 +
 include/linux/mm.h|  2 +
 include/trace/events/mmflags.h|  2 +
 mm/ksm.c  |  4 ++
 tools/testing/selftests/powerpc/mm/.gitignore |  1 +
 tools/testing/selftests/powerpc/mm/Makefile   |  4 +-
 tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++
 13 files changed, 90 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6de56c3b33c4..495fc0ccb453 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -20,13 +20,9 @@
 #define _PAGE_RW   (_PAGE_READ | _PAGE_WRITE)
 #define _PAGE_RWX  (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
 #define _PAGE_PRIVILEGED   0x8 /* kernel access only */
-
-#define _PAGE_CACHE_CTL0x00030 /* Bits for the folowing cache 
modes */
-   /*  No bits set is normal cacheable memory */
-   /*  0x00010 unused, is SAO bit on radix POWER9 */
+#define _PAGE_SAO  0x00010 /* Strong access order */
 #define _PAGE_NON_IDEMPOTENT   0x00020 /* non idempotent memory */
 #define _PAGE_TOLERANT 0x00030 /* tolerant memory, cache inhibited */
-
 #define _PAGE_DIRTY0x00080 /* C: page changed */
 #define _PAGE_ACCESSED 0x00100 /* R: page referenced */
 /*
@@ -828,6 +824,8 @@ static inline void __set_pte_at(struct mm_struct *mm, 
unsigned long addr,
return hash__set_pte_at(mm, addr, ptep, pte, percpu);
 }
 
+#define _PAGE_CACHE_CTL(_PAGE_SAO | _PAGE_NON_IDEMPOTENT | 
_PAGE_TOLERANT)
+
 #define pgprot_noncached pgprot_noncached
 static inline pgprot_t pgprot_noncached(pgprot_t prot)
 {
diff --git a/arch/powerpc/include/asm/cputable.h 
b/arch/powerpc/include/asm/cputable.h
index fdddb822d564..f89205eff691 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -191,7 +191,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_SPURR  LONG_ASM_CONST(0x0100)
 #define CPU_FTR_DSCR   LONG_ASM_CONST(0x0200)
 #define CPU_FTR_VSXLONG_ASM_CONST(0x0400)
-// Free
LONG_ASM_CONST(0x0800)
+#define CPU_FTR_SAOLONG_ASM_CONST(0x0800)
 #define CPU_FTR_CP_USE_DCBTZ   LONG_ASM_CONST(0x1000)
 #define CPU_FTR_UNALIGNED_LD_STD   LONG_ASM_CONST(0x2000)
 #define CPU_FTR_ASYM_SMT   LONG_ASM_CONST(0x4000)
@@ -436,7 +436,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
-   CPU_FTR_DSCR | CPU_FTR_ASYM_SMT | \
+   CPU_FTR_DSCR | CPU_FTR_SAO  | CPU_FTR_ASYM_SMT | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | \
CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX )
@@ -445,7 +445,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
-   CPU_FTR_DSCR | \
+   CPU_FTR_DSCR | CPU_FTR_SAO  | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \
@@ -456,7 +456,7 @@ static inline void cpu_feature_keys_init(void) { }
CPU_FTR_MMCRA | CPU_FTR_SMT | \
CPU_FTR_COHERENT_ICACHE | \
CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \
-   CPU_FTR_DSCR | \
+   CPU_FTR_DSCR | CPU_FTR_SAO  | \
CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \
CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \
CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \
@@ -474,7 +474,7 @@ static inline void cpu_feature_keys_init

[PATCH 0/2] Reintroduce PROT_SAO

2020-08-20 Thread Shawn Anastasio

This set re-introduces the PROT_SAO prot flag removed in
Commit 5c9fa16e8abd ("powerpc/64s: Remove PROT_SAO support").

To address concerns regarding live migration of guests using SAO
to P10 hosts without SAO support, the flag is disabled by default
in LPARs. A new config option, PPC_PROT_SAO_LPAR was added to
allow users to explicitly enable it if they will not be running
in an environment where this is a conern.

Shawn Anastasio (2):
  Revert "powerpc/64s: Remove PROT_SAO support"
  powerpc/64s: Disallow PROT_SAO in LPARs by default

 arch/powerpc/Kconfig  | 12 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  8 ++--
 arch/powerpc/include/asm/cputable.h   | 10 ++---
 arch/powerpc/include/asm/mman.h   | 31 --
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  2 +
 arch/powerpc/include/uapi/asm/mman.h  |  2 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c |  2 +-
 arch/powerpc/mm/book3s64/hash_utils.c |  2 +
 include/linux/mm.h|  2 +
 include/trace/events/mmflags.h|  2 +
 mm/ksm.c  |  4 ++
 tools/testing/selftests/powerpc/mm/.gitignore |  1 +
 tools/testing/selftests/powerpc/mm/Makefile   |  4 +-
 tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++
 14 files changed, 107 insertions(+), 17 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c

-- 
2.28.0

[powerpc:fixes-test] BUILD SUCCESS 90a9b102eddf6a3f987d15f4454e26a2532c1c98

2020-08-20 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
fixes-test
branch HEAD: 90a9b102eddf6a3f987d15f4454e26a2532c1c98  powerpc/pseries: Do not 
initiate shutdown when system is running on UPS

elapsed time: 927m

configs tested: 75
configs skipped: 75

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
m68k   m5275evb_defconfig
armkeystone_defconfig
s390 alldefconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
i386 randconfig-a002-20200820
i386 randconfig-a004-20200820
i386 randconfig-a005-20200820
i386 randconfig-a003-20200820
i386 randconfig-a006-20200820
i386 randconfig-a001-20200820
x86_64   randconfig-a015-20200820
x86_64   randconfig-a012-20200820
x86_64   randconfig-a016-20200820
x86_64   randconfig-a014-20200820
x86_64   randconfig-a011-20200820
x86_64   randconfig-a013-20200820
i386 randconfig-a013-20200820
i386 randconfig-a012-20200820
i386 randconfig-a011-20200820
i386 randconfig-a016-20200820
i386 randconfig-a014-20200820
i386 randconfig-a015-20200820
i386 randconfig-a013-20200821
i386 randconfig-a012-20200821
i386 randconfig-a011-20200821
i386 randconfig-a016-20200821
i386 randconfig-a014-20200821
i386 randconfig-a015-20200821
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[powerpc:merge] BUILD SUCCESS 7c25bda14d66718f9fa428808dae289dd84f1da3

2020-08-20 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  
merge
branch HEAD: 7c25bda14d66718f9fa428808dae289dd84f1da3  Automatic merge of 
'master', 'next' and 'fixes' (2020-08-20 23:20)

elapsed time: 926m

configs tested: 69
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

arm defconfig
arm64allyesconfig
arm64   defconfig
arm  allyesconfig
arm  allmodconfig
m68k   m5275evb_defconfig
armkeystone_defconfig
s390 alldefconfig
ia64 allmodconfig
ia64defconfig
ia64 allyesconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
c6x  allyesconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
powerpc defconfig
i386     randconfig-a002-20200820
i386     randconfig-a004-20200820
i386     randconfig-a005-20200820
i386     randconfig-a003-20200820
i386     randconfig-a006-20200820
i386     randconfig-a001-20200820
x86_64   randconfig-a015-20200820
x86_64   randconfig-a012-20200820
x86_64   randconfig-a016-20200820
x86_64   randconfig-a014-20200820
x86_64   randconfig-a011-20200820
x86_64   randconfig-a013-20200820
i386     randconfig-a013-20200820
i386     randconfig-a012-20200820
i386     randconfig-a011-20200820
i386     randconfig-a016-20200820
i386     randconfig-a014-20200820
i386     randconfig-a015-20200820
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[PATCH v5 8/8] mm/vmalloc: Hugepage vmalloc mappings

2020-08-20 Thread Nicholas Piggin

On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps,
vmalloc will attempt to allocate PMD-sized pages first, before falling
back to small pages.

Allocations which use something other than PAGE_KERNEL protections are
not permitted to use huge pages yet, not all callers expect this (e.g.,
module allocations vs strict module rwx).

This reduces TLB misses by nearly 30x on a `git diff` workload on a
2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%.

This can result in more internal fragmentation and memory overhead for a
given allocation, an option nohugevmalloc is added to disable at boot.

Signed-off-by: Nicholas Piggin 
---
 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |   4 +
 arch/powerpc/Kconfig  |   1 +
 include/linux/vmalloc.h   |   1 +
 mm/page_alloc.c   |   4 +-
 mm/vmalloc.c  | 188 +-
 6 files changed, 152 insertions(+), 48 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index bdc1f33fd3d1..6f0b41289a90 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3190,6 +3190,8 @@
 
nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings.
 
+   nohugevmalloc   [PPC] Disable kernel huge vmalloc mappings.
+
nosmt   [KNL,S390] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
 
diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..b2b89d629317 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 config HAVE_ARCH_HUGE_VMAP
bool
 
+config HAVE_ARCH_HUGE_VMALLOC
+   depends on HAVE_ARCH_HUGE_VMAP
+   bool
+
 config ARCH_WANT_HUGE_PMD_SHARE
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 95dfd8ef3d4b..044e5a94967a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -175,6 +175,7 @@ config PPC
select GENERIC_TIME_VSYSCALL
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP  if PPC_BOOK3S_64 && 
PPC_RADIX_MMU
+   select HAVE_ARCH_HUGE_VMALLOC   if HAVE_ARCH_HUGE_VMAP
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KASAN  if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KASAN_VMALLOC  if PPC32 && PPC_PAGE_SHIFT <= 14
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index e3590e93bfff..8f25dbaca0a1 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -58,6 +58,7 @@ struct vm_struct {
unsigned long   size;
unsigned long   flags;
struct page **pages;
+   unsigned intpage_order;
unsigned intnr_pages;
phys_addr_t phys_addr;
const void  *caller;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bab486fea..d785e5335529 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8102,6 +8102,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
void *table = NULL;
gfp_t gfp_flags;
bool virt;
+   bool huge;
 
/* allow the kernel cmdline to have a say */
if (!numentries) {
@@ -8169,6 +8170,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
} else if (get_order(size) >= MAX_ORDER || hashdist) {
table = __vmalloc(size, gfp_flags);
virt = true;
+   huge = (find_vm_area(table)->page_order > 0);
} else {
/*
 * If bucketsize is not a power-of-two, we may free
@@ -8185,7 +8187,7 @@ void *__init alloc_large_system_hash(const char 
*tablename,
 
pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n",
tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size,
-   virt ? "vmalloc" : "linear");
+   virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear");
 
if (_hash_shift)
*_hash_shift = log2qty;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4e5cb7c7f780..564d7497e551 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -45,6 +45,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+static bool __ro_after_init vmap_allow_huge = true;
+
+static int __init set_nohugevmalloc(char *str)
+{
+   vmap_allow_huge = false;
+   return 0;
+}
+early_param("nohugevmalloc", set_nohugevmalloc);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+static const bool vmap_allow_huge = false;
+#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */
+
 bool is_vmalloc_addr(const void *x)
 {
unsigned long addr = (unsigned long)x;
@@ -468,31

[PATCH v5 7/8] mm/vmalloc: add vmap_range_noflush variant

2020-08-20 Thread Nicholas Piggin

As a side-effect, the order of flush_cache_vmap() and
arch_sync_kernel_mappings() calls are switched, but that now matches
the other callers in this file.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 129f10545bb1..4e5cb7c7f780 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, 
unsigned long end,
return 0;
 }
 
-int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
-   unsigned int max_page_shift)
+static int vmap_range_noflush(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr,
+   pgprot_t prot, unsigned int max_page_shift)
 {
pgd_t *pgd;
unsigned long start;
@@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, 
phys_addr_t phys_addr, pgp
break;
} while (pgd++, phys_addr += (next - addr), addr = next, addr != end);
 
-   flush_cache_vmap(start, end);
-
if (mask & ARCH_PAGE_TABLE_SYNC_MASK)
arch_sync_kernel_mappings(start, end);
 
return err;
 }
 
+int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, 
pgprot_t prot,
+   unsigned int max_page_shift)
+{
+   int err;
+
+   err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+   flush_cache_vmap(addr, end);
+
+   return err;
+}
+
 static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 pgtbl_mod_mask *mask)
 {
-- 
2.23.0

[PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c

2020-08-20 Thread Nicholas Piggin

This is a generic kernel virtual memory mapper, not specific to ioremap.

Signed-off-by: Nicholas Piggin 
---
 include/linux/vmalloc.h |   2 +
 mm/ioremap.c| 192 
 mm/vmalloc.c| 191 +++
 3 files changed, 193 insertions(+), 192 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 787d77ad7536..e3590e93bfff 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
 #ifdef CONFIG_MMU
+extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t 
phys_addr, pgprot_t prot,
+   unsigned int max_page_shift);
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
pgprot_t prot, struct page **pages);
 int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot,
diff --git a/mm/ioremap.c b/mm/ioremap.c
index b0032dbadaf7..cdda0e022740 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap);
 static const bool iomap_allow_huge = false;
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
-{
-   pte_t *pte;
-   u64 pfn;
-
-   pfn = phys_addr >> PAGE_SHIFT;
-   pte = pte_alloc_kernel_track(pmd, addr, mask);
-   if (!pte)
-   return -ENOMEM;
-   do {
-   BUG_ON(!pte_none(*pte));
-   set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot));
-   pfn++;
-   } while (pte++, addr += PAGE_SIZE, addr != end);
-   *mask |= PGTBL_PTE_MODIFIED;
-   return 0;
-}
-
-static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PMD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pmd_supported(prot))
-   return 0;
-
-   if ((end - addr) != PMD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PMD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PMD_SIZE))
-   return 0;
-
-   if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr))
-   return 0;
-
-   return pmd_set_huge(pmd, phys_addr, prot);
-}
-
-static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pmd_t *pmd;
-   unsigned long next;
-
-   pmd = pmd_alloc_track(&init_mm, pud, addr, mask);
-   if (!pmd)
-   return -ENOMEM;
-   do {
-   next = pmd_addr_end(addr, end);
-
-   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PMD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
-   return -ENOMEM;
-   } while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
-   return 0;
-}
-
-static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift)
-{
-   if (max_page_shift < PUD_SHIFT)
-   return 0;
-
-   if (!arch_vmap_pud_supported(prot))
-   return 0;
-
-   if ((end - addr) != PUD_SIZE)
-   return 0;
-
-   if (!IS_ALIGNED(addr, PUD_SIZE))
-   return 0;
-
-   if (!IS_ALIGNED(phys_addr, PUD_SIZE))
-   return 0;
-
-   if (pud_present(*pud) && !pud_free_pmd_page(pud, addr))
-   return 0;
-
-   return pud_set_huge(pud, phys_addr, prot);
-}
-
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
-   phys_addr_t phys_addr, pgprot_t prot, unsigned int 
max_page_shift,
-   pgtbl_mod_mask *mask)
-{
-   pud_t *pud;
-   unsigned long next;
-
-   pud = pud_alloc_track(&init_mm, p4d, addr, mask);
-   if (!pud)
-   return -ENOMEM;
-   do {
-   next = pud_addr_end(addr, end);
-
-   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, 
max_page_shift)) {
-   *mask |= PGTBL_PUD_MODIFIED;
-   continue;
-   }
-
-   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, 
max_page_shift, mask))
-   return -ENOMEM;
-   } while (pud++, phys_addr += (next - addr), addr = next, addr != end);
-   return

[PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup

2020-08-20 Thread Nicholas Piggin

This changes the awkward approach where architectures provide init
functions to determine which levels they can provide large mappings for,
to one where the arch is queried for each call.

This removes code and indirection, and allows constant-folding of dead
code for unsupported levels.

This also adds a prot argument to the arch query. This is unused
currently but could help with some architectures (e.g., some powerpc
processors can't map uncacheable memory with large pages).

Signed-off-by: Nicholas Piggin 
---
 arch/arm64/mm/mmu.c  | 12 +--
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++-
 arch/x86/mm/ioremap.c| 12 +--
 include/linux/io.h   |  9 ---
 include/linux/vmalloc.h  | 10 +++
 init/main.c  |  1 -
 mm/ioremap.c | 96 +++-
 7 files changed, 73 insertions(+), 77 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 75df62fea1b6..bbb3ccf6a7ce 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int 
*size, pgprot_t prot)
return dt_virt;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/*
 * Only 4k granule supports level 1 block mappings.
@@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void)
   !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
-   /* See arch_ioremap_pud_supported() */
+   /* See arch_vmap_pud_supported() */
return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS);
 }
+#endif
 
 int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index ae823bba29f2..7d3a620c5adf 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1182,13 +1182,14 @@ void radix__ptep_modify_prot_commit(struct 
vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
 }
 
-int __init arch_ioremap_pud_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
/* HPT does not cope with large pages in the vmalloc area */
return radix_enabled();
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return radix_enabled();
 }
@@ -1197,6 +1198,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
 {
return 0;
 }
+#endif
 
 int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
 {
@@ -1282,7 +1284,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
return 1;
 }
 
-int __init arch_ioremap_p4d_supported(void)
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index 84d85dbd1dad..5b8b495ab4ed 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr)
 }
 EXPORT_SYMBOL(iounmap);
 
-int __init arch_ioremap_p4d_supported(void)
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot)
 {
-   return 0;
+   return false;
 }
 
-int __init arch_ioremap_pud_supported(void)
+bool arch_vmap_pud_supported(pgprot_t prot)
 {
 #ifdef CONFIG_X86_64
return boot_cpu_has(X86_FEATURE_GBPAGES);
 #else
-   return 0;
+   return false;
 #endif
 }
 
-int __init arch_ioremap_pmd_supported(void)
+bool arch_vmap_pmd_supported(pgprot_t prot)
 {
return boot_cpu_has(X86_FEATURE_PSE);
 }
+#endif
 
 /*
  * Convert a physical pointer to a virtual kernel pointer for /dev/mem
diff --git a/include/linux/io.h b/include/linux/io.h
index 8394c56babc2..f1effd4d7a3c 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, 
unsigned long end,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-void __init ioremap_huge_init(void);
-int arch_ioremap_p4d_supported(void);
-int arch_ioremap_pud_supported(void);
-int arch_ioremap_pmd_supported(void);
-#else
-static inline void ioremap_huge_init(void) { }
-#endif
-
 /*
  * Managed iomap interface
  */
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 0221f852a7e1..787d77ad7536 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -84,6 +84,16 @@ struct vmap_area {
};
 };
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+bool arch_vmap_p4d_supported(pgprot_t prot);
+bool arch_vmap_pud_supported(pgprot_t prot);
+bool arch_vmap_pmd_supported(pgprot_t prot);
+#else
+static inline bool arch_vmap_p4d_suppo

[PATCH v5 4/8] lib/ioremap: rename ioremap__range to vmap__range

2020-08-20 Thread Nicholas Piggin

This will be moved to mm/ and used as a generic kernel virtual mapping
function, so re-name it in preparation.

Signed-off-by: Nicholas Piggin 
---
 mm/ioremap.c | 55 ++--
 1 file changed, 23 insertions(+), 32 deletions(-)

diff --git a/mm/ioremap.c b/mm/ioremap.c
index 5fa1ab41d152..6016ae3227ad 100644
--- a/mm/ioremap.c
+++ b/mm/ioremap.c
@@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; }
 static inline int ioremap_pmd_enabled(void) { return 0; }
 #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */
 
-static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pte_t *pte;
u64 pfn;
@@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pmd_enabled())
return 0;
@@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long 
addr,
return pmd_set_huge(pmd, phys_addr, prot);
 }
 
-static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned 
long addr,
do {
next = pmd_addr_end(addr, end);
 
-   if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PMD_MODIFIED;
continue;
}
 
-   if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask))
+   if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pmd++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_pud_enabled())
return 0;
@@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long 
addr,
return pud_set_huge(pud, phys_addr, prot);
 }
 
-static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t prot,
-   pgtbl_mod_mask *mask)
+static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask 
*mask)
 {
pud_t *pud;
unsigned long next;
@@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned 
long addr,
do {
next = pud_addr_end(addr, end);
 
-   if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
+   if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) {
*mask |= PGTBL_PUD_MODIFIED;
continue;
}
 
-   if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask))
+   if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask))
return -ENOMEM;
} while (pud++, phys_addr += (next - addr), addr = next, addr != end);
return 0;
 }
 
-static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr,
-   pgprot_t prot)
+static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end,
+   phys_addr_t phys_addr, pgprot_t prot)
 {
if (!ioremap_p4d_enabled())
return 0;
@@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long 
addr,
return p4d_set_huge(p4d, phys_addr, prot);
 }
 
-static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
-   unsigned long end, phys_addr_t phys_addr, pgprot_t p

[PATCH v5 3/8] mm/vmalloc: rename vmap__range vmap_pages__range

2020-08-20 Thread Nicholas Piggin

The vmalloc mapper operates on a struct page * array rather than a
linear physical address, re-name it to make this distinction clear.

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 28 
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 49f225b0f855..3a1e45fd1626 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, 
unsigned long size)
arch_sync_kernel_mappings(start, end);
 }
 
-static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pte_t *pte;
 
@@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
 }
 
-static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pmd_t *pmd;
unsigned long next;
@@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr,
return -ENOMEM;
do {
next = pmd_addr_end(addr, end);
-   if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pmd++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
pud_t *pud;
unsigned long next;
@@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
return -ENOMEM;
do {
next = pud_addr_end(addr, end);
-   if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (pud++, addr = next, addr != end);
return 0;
 }
 
-static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-   unsigned long end, pgprot_t prot, struct page **pages, int *nr,
-   pgtbl_mod_mask *mask)
+static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long 
end,
+   pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask 
*mask)
 {
p4d_t *p4d;
unsigned long next;
@@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
return -ENOMEM;
do {
next = p4d_addr_end(addr, end);
-   if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask))
+   if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, 
mask))
return -ENOMEM;
} while (p4d++, addr = next, addr != end);
return 0;
@@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned 
long size,
next = pgd_addr_end(addr, end);
if (pgd_bad(*pgd))
mask |= PGTBL_PGD_MODIFIED;
-   err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask);
+   err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, 
&mask);
if (err)
return err;
} while (pgd++, addr = next, addr != end);
-- 
2.23.0

[PATCH v5 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered

2020-08-20 Thread Nicholas Piggin

Signed-off-by: Nicholas Piggin 
---
 mm/memory.c | 60 +++--
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f95edbb77326..19986af291e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2261,13 +2261,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, 
pud_t *pud,
}
do {
next = pmd_addr_end(addr, end);
-   if (create || !pmd_none_or_clear_bad(pmd)) {
-   err = apply_to_pte_range(mm, pmd, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pmd_none(*pmd) && !create)
+   continue;
+   if (WARN_ON_ONCE(pmd_leaf(*pmd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pmd_bad(*pmd))) {
+   if (!create)
+   continue;
+   pmd_clear_bad(pmd);
}
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pmd++, addr = next, addr != end);
+
return err;
 }
 
@@ -2288,13 +2295,20 @@ static int apply_to_pud_range(struct mm_struct *mm, 
p4d_t *p4d,
}
do {
next = pud_addr_end(addr, end);
-   if (create || !pud_none_or_clear_bad(pud)) {
-   err = apply_to_pmd_range(mm, pud, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (pud_none(*pud) && !create)
+   continue;
+   if (WARN_ON_ONCE(pud_leaf(*pud)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pud_bad(*pud))) {
+   if (!create)
+   continue;
+   pud_clear_bad(pud);
}
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create);
+   if (err)
+   break;
} while (pud++, addr = next, addr != end);
+
return err;
 }
 
@@ -2315,13 +2329,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, 
pgd_t *pgd,
}
do {
next = p4d_addr_end(addr, end);
-   if (create || !p4d_none_or_clear_bad(p4d)) {
-   err = apply_to_pud_range(mm, p4d, addr, next, fn, data,
-create);
-   if (err)
-   break;
+   if (p4d_none(*p4d) && !create)
+   continue;
+   if (WARN_ON_ONCE(p4d_leaf(*p4d)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(p4d_bad(*p4d))) {
+   if (!create)
+   continue;
+   p4d_clear_bad(p4d);
}
+   err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create);
+   if (err)
+   break;
} while (p4d++, addr = next, addr != end);
+
return err;
 }
 
@@ -2340,8 +2361,15 @@ static int __apply_to_page_range(struct mm_struct *mm, 
unsigned long addr,
pgd = pgd_offset(mm, addr);
do {
next = pgd_addr_end(addr, end);
-   if (!create && pgd_none_or_clear_bad(pgd))
+   if (pgd_none(*pgd) && !create)
continue;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return -EINVAL;
+   if (WARN_ON_ONCE(pgd_bad(*pgd))) {
+   if (!create)
+   continue;
+   pgd_clear_bad(pgd);
+   }
err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create);
if (err)
break;
-- 
2.23.0

[PATCH v5 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings

2020-08-20 Thread Nicholas Piggin

vmalloc_to_page returns NULL for addresses mapped by larger pages[*].
Whether or not a vmap is huge depends on the architecture details,
alignments, boot options, etc., which the caller can not be expected
to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page.

This change teaches vmalloc_to_page about larger pages, and returns
the struct page that corresponds to the offset within the large page.
This makes the API agnostic to mapping implementation details.

[*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap:
fail gracefully on unexpected huge vmap mappings")

Signed-off-by: Nicholas Piggin 
---
 mm/vmalloc.c | 40 ++--
 1 file changed, 26 insertions(+), 14 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index b482d240f9a2..49f225b0f855 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -38,6 +38,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 
@@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x)
 }
 
 /*
- * Walk a vmap address to the struct page it maps.
+ * Walk a vmap address to the struct page it maps. Huge vmap mappings will
+ * return the tail page that corresponds to the base page address, which
+ * matches small vmap mappings.
  */
 struct page *vmalloc_to_page(const void *vmalloc_addr)
 {
@@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
 
if (pgd_none(*pgd))
return NULL;
+   if (WARN_ON_ONCE(pgd_leaf(*pgd)))
+   return NULL; /* XXX: no allowance for huge pgd */
+   if (WARN_ON_ONCE(pgd_bad(*pgd)))
+   return NULL;
+
p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d))
return NULL;
-   pud = pud_offset(p4d, addr);
+   if (p4d_leaf(*p4d))
+   return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(p4d_bad(*p4d)))
+   return NULL;
 
-   /*
-* Don't dereference bad PUD or PMD (below) entries. This will also
-* identify huge mappings, which we may encounter on architectures
-* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be
-* identified as vmalloc addresses by is_vmalloc_addr(), but are
-* not [unambiguously] associated with a struct page, so there is
-* no correct value to return for them.
-*/
-   WARN_ON_ONCE(pud_bad(*pud));
-   if (pud_none(*pud) || pud_bad(*pud))
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return NULL;
+   if (pud_leaf(*pud))
+   return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pud_bad(*pud)))
return NULL;
+
pmd = pmd_offset(pud, addr);
-   WARN_ON_ONCE(pmd_bad(*pmd));
-   if (pmd_none(*pmd) || pmd_bad(*pmd))
+   if (pmd_none(*pmd))
+   return NULL;
+   if (pmd_leaf(*pmd))
+   return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+   if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
 
ptep = pte_offset_map(pmd, addr);
@@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (pte_present(pte))
page = pte_page(pte);
pte_unmap(ptep);
+
return page;
 }
 EXPORT_SYMBOL(vmalloc_to_page);
-- 
2.23.0

[PATCH v5 0/8] huge vmalloc mappings

2020-08-20 Thread Nicholas Piggin

I made this powerpc-only for the time being. It shouldn't be too hard to
add support for other archs that define HUGE_VMAP. I have booted x86
with it enabled, just may not have audited everything.

Hi Andrew, would you care to put this in your tree?

Thanks,
Nick

Since v4:
- Fixed an off-by-page-order bug in v4
- Several minor cleanups.
- Added page order to /proc/vmallocinfo
- Added hugepage to alloc_large_system_hage output.
- Made an architecture config option, powerpc only for now.

Since v3:
- Fixed an off-by-one bug in a loop
- Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail
- Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan
  Cameron for debugging the cause of this (hopefully).

Since v2:
- Rebased on vmalloc cleanups, split series into simpler pieces.
- Fixed several compile errors and warnings
- Keep the page array and accounting in small page units because
  struct vm_struct is an interface (this should fix x86 vmap stack debug
  assert). [Thanks Zefan]

Nicholas Piggin (8):
  mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
  mm: apply_to_pte_range warn and fail if a large pte is encountered
  mm/vmalloc: rename vmap_*_range vmap_pages_*_range
  lib/ioremap: rename ioremap_*_range to vmap_*_range
  mm: HUGE_VMAP arch support cleanup
  mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
  mm/vmalloc: add vmap_range_noflush variant
  mm/vmalloc: Hugepage vmalloc mappings

 .../admin-guide/kernel-parameters.txt |   2 +
 arch/Kconfig  |   4 +
 arch/arm64/mm/mmu.c   |  12 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  10 +-
 arch/x86/mm/ioremap.c |  12 +-
 include/linux/io.h|   9 -
 include/linux/vmalloc.h   |  13 +
 init/main.c   |   1 -
 mm/ioremap.c  | 231 +
 mm/memory.c   |  60 ++-
 mm/page_alloc.c   |   4 +-
 mm/vmalloc.c  | 456 +++---
 13 files changed, 476 insertions(+), 339 deletions(-)

-- 
2.23.0

Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes

2020-08-20 Thread Anshuman Khandual




On 08/21/2020 09:03 AM, Anshuman Khandual wrote:
> 
> 
> On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote:
>> "Aneesh Kumar K.V"  writes:
>>
>>> This patch series includes fixes for debug_vm_pgtable test code so that
>>> they follow page table updates rules correctly. The first two patches 
>>> introduce
>>> changes w.r.t ppc64. The patches are included in this series for 
>>> completeness. We can
>>> merge them via ppc64 tree if required.
>>>
>>> Hugetlb test is disabled on ppc64 because that needs larger change to 
>>> satisfy
>>> page table update rules.
>>>
>>> Changes from V1:
>>> * Address review feedback
>>> * drop test specific pfn_pte and pfn_pmd.
>>> * Update ppc64 page table helper to add _PAGE_PTE 
>>>
>>> Aneesh Kumar K.V (13):
>>>   powerpc/mm: Add DEBUG_VM WARN for pmd_clear
>>>   powerpc/mm: Move setting pte specific flags to pfn_pte
>>>   mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
>>>   mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
>>> vmap support.
>>>   mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
>>> CONFIG_NUMA_BALANCING
>>>   mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
>>> set_pmd/pud_at
>>>   mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
>>> existing pte entry
>>>   mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
>>>   mm/debug_vm_pgtable/locks: Move non page table modifying test together
>>>   mm/debug_vm_pgtable/locks: Take correct page table lock
>>>   mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
>>>   mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
>>>   mm/debug_vm_pgtable: populate a pte entry before fetching it
>>>
>>>  arch/powerpc/include/asm/book3s/64/pgtable.h |  29 +++-
>>>  arch/powerpc/include/asm/nohash/pgtable.h|   5 -
>>>  arch/powerpc/mm/book3s64/pgtable.c   |   2 +-
>>>  arch/powerpc/mm/pgtable.c|   5 -
>>>  include/linux/io.h   |  12 ++
>>>  mm/debug_vm_pgtable.c| 151 +++
>>>  6 files changed, 127 insertions(+), 77 deletions(-)
>>>
>>
>> BTW I picked a wrong branch when sending this. Attaching the diff
>> against what I want to send.  pfn_pmd() no more updates _PAGE_PTE
>> because that is handled by pmd_mkhuge().
>>
>> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
>> b/arch/powerpc/mm/book3s64/pgtable.c
>> index 3b4da7c63e28..e18ae50a275c 100644
>> --- a/arch/powerpc/mm/book3s64/pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/pgtable.c
>> @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
>>  unsigned long pmdv;
>>  
>>  pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;
>> -return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE);
>> +return pmd_set_protbits(__pmd(pmdv), pgprot);
>>  }
>>  
>>  pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
>> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
>> index 7d9f8e1d790f..cad61d22f33a 100644
>> --- a/mm/debug_vm_pgtable.c
>> +++ b/mm/debug_vm_pgtable.c
>> @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
>> long pfn, pgprot_t prot)
>>  
>>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>>  {
>> -pmd_t pmd = pfn_pmd(pfn, prot);
>> +pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>>  
>>  if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>>  return;
>>
> 
> Cover letter does not mention which branch or tag this series applies on.
> Just assumed it to be 5.9-rc1. Should the above changes be captured as a
> pre-requisite patch ?
> 
> Anyways, the series fails to be build on arm64.
> 
> A) Without CONFIG_TRANSPARENT_HUGEPAGE
> 
> mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’:
> mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function 
> ‘pmd_advanced_tests’
>   pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
>   ^~
> mm/debug_vm_pgtable.c:366:20: note: declared here
>  static void __init pmd_advanced_tests(struct mm_struct *mm,
> ^~
> 
> B) As mentioned previously, this should be solved by including 
> 
> mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’:
> mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function 
> ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? 
> [-Werror=implicit-function-declaration]
>   if (!arch_ioremap_pmd_supported())
>^~
> 
> Please make sure that the series builds on all enabled platforms i.e x86,
> arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling
> all the features that make various #ifdefs in the test.
> 
> - Anshuman

Here is another build failure on x86.

mm/debug_vm_pgtable.c: In function ‘pud_advanced_tests’:
mm/debug_vm_pgtable.c:306:31: error: passing argument 1 of 
‘pudp_huge_get_and_clear_full’ from incompatible pointer type 
[-Werror=incompatibl

Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes

2020-08-20 Thread Anshuman Khandual




On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote:
> "Aneesh Kumar K.V"  writes:
> 
>> This patch series includes fixes for debug_vm_pgtable test code so that
>> they follow page table updates rules correctly. The first two patches 
>> introduce
>> changes w.r.t ppc64. The patches are included in this series for 
>> completeness. We can
>> merge them via ppc64 tree if required.
>>
>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy
>> page table update rules.
>>
>> Changes from V1:
>> * Address review feedback
>> * drop test specific pfn_pte and pfn_pmd.
>> * Update ppc64 page table helper to add _PAGE_PTE 
>>
>> Aneesh Kumar K.V (13):
>>   powerpc/mm: Add DEBUG_VM WARN for pmd_clear
>>   powerpc/mm: Move setting pte specific flags to pfn_pte
>>   mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value
>>   mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge
>> vmap support.
>>   mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with
>> CONFIG_NUMA_BALANCING
>>   mm/debug_vm_pgtable/THP: Mark the pte entry huge before using
>> set_pmd/pud_at
>>   mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an
>> existing pte entry
>>   mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP
>>   mm/debug_vm_pgtable/locks: Move non page table modifying test together
>>   mm/debug_vm_pgtable/locks: Take correct page table lock
>>   mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries
>>   mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64
>>   mm/debug_vm_pgtable: populate a pte entry before fetching it
>>
>>  arch/powerpc/include/asm/book3s/64/pgtable.h |  29 +++-
>>  arch/powerpc/include/asm/nohash/pgtable.h|   5 -
>>  arch/powerpc/mm/book3s64/pgtable.c   |   2 +-
>>  arch/powerpc/mm/pgtable.c|   5 -
>>  include/linux/io.h   |  12 ++
>>  mm/debug_vm_pgtable.c| 151 +++
>>  6 files changed, 127 insertions(+), 77 deletions(-)
>>
> 
> BTW I picked a wrong branch when sending this. Attaching the diff
> against what I want to send.  pfn_pmd() no more updates _PAGE_PTE
> because that is handled by pmd_mkhuge().
> 
> diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
> b/arch/powerpc/mm/book3s64/pgtable.c
> index 3b4da7c63e28..e18ae50a275c 100644
> --- a/arch/powerpc/mm/book3s64/pgtable.c
> +++ b/arch/powerpc/mm/book3s64/pgtable.c
> @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot)
>   unsigned long pmdv;
>  
>   pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK;
> - return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE);
> + return pmd_set_protbits(__pmd(pmdv), pgprot);
>  }
>  
>  pmd_t mk_pmd(struct page *page, pgprot_t pgprot)
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index 7d9f8e1d790f..cad61d22f33a 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned 
> long pfn, pgprot_t prot)
>  
>  static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot)
>  {
> - pmd_t pmd = pfn_pmd(pfn, prot);
> + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
>  
>   if (!IS_ENABLED(CONFIG_NUMA_BALANCING))
>   return;
> 

Cover letter does not mention which branch or tag this series applies on.
Just assumed it to be 5.9-rc1. Should the above changes be captured as a
pre-requisite patch ?

Anyways, the series fails to be build on arm64.

A) Without CONFIG_TRANSPARENT_HUGEPAGE

mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’:
mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function 
‘pmd_advanced_tests’
  pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep);
  ^~
mm/debug_vm_pgtable.c:366:20: note: declared here
 static void __init pmd_advanced_tests(struct mm_struct *mm,
^~

B) As mentioned previously, this should be solved by including 

mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’:
mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function 
‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? 
[-Werror=implicit-function-declaration]
  if (!arch_ioremap_pmd_supported())
   ^~

Please make sure that the series builds on all enabled platforms i.e x86,
arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling
all the features that make various #ifdefs in the test.

- Anshuman

[PATCH] tty: hvcs: Don't NULL tty->driver_data until hvcs_cleanup()

2020-08-20 Thread Tyrel Datwyler

The code currently NULLs tty->driver_data in hvcs_close() with the
intent of informing the next call to hvcs_open() that device needs to be
reconfigured. However, when hvcs_cleanup() is called we copy hvcsd from
tty->driver_data which was previoulsy NULLed by hvcs_close() and our
call to tty_port_put(&hvcsd->port) doesn't actually do anything since
&hvcsd->port ends up translating to NULL by chance. This has the side
effect that when hvcs_remove() is called we have one too many port
references preventing hvcs_destuct_port() from ever being called. This
also prevents us from reusing the /dev/hvcsX node in a future
hvcs_probe() and we can eventually run out of /dev/hvcsX devices.

Fix this by waiting to NULL tty->driver_data in hvcs_cleanup().

Signed-off-by: Tyrel Datwyler 
---
 drivers/tty/hvc/hvcs.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c
index 55105ac38f89..509d1042825a 100644
--- a/drivers/tty/hvc/hvcs.c
+++ b/drivers/tty/hvc/hvcs.c
@@ -1216,13 +1216,6 @@ static void hvcs_close(struct tty_struct *tty, struct 
file *filp)
 
tty_wait_until_sent(tty, HVCS_CLOSE_WAIT);
 
-   /*
-* This line is important because it tells hvcs_open that this
-* device needs to be re-configured the next time hvcs_open is
-* called.
-*/
-   tty->driver_data = NULL;
-
free_irq(irq, hvcsd);
return;
} else if (hvcsd->port.count < 0) {
@@ -1237,6 +1230,13 @@ static void hvcs_cleanup(struct tty_struct * tty)
 {
struct hvcs_struct *hvcsd = tty->driver_data;
 
+   /*
+* This line is important because it tells hvcs_open that this
+* device needs to be re-configured the next time hvcs_open is
+* called.
+*/
+   tty->driver_data = NULL;
+
tty_port_put(&hvcsd->port);
 }
 
-- 
2.27.0

Re: [PATCH net-next v2 0/4] refactoring of ibmvnic code

2020-08-20 Thread David Miller

From: Lijun Pan 
Date: Wed, 19 Aug 2020 17:52:22 -0500

> This patch series refactor reset_init and init functions,
> and make some other cosmetic changes to make the code
> easier to read and debug. v2 removes __func__ and v1's 1/5.

Series applied, thank you.

[RFT][PATCH 1/7] powerpc/iommu: Avoid overflow at boundary_size

2020-08-20 Thread Nicolin Chen

The boundary_size might be as large as ULONG_MAX, which means
that a device has no specific boundary limit. So either "+ 1"
or passing it to ALIGN() would potentially overflow.

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So fixing a potential overflow with the safer shortcut.

Reported-by: Stephen Rothwell 
Signed-off-by: Nicolin Chen 
Cc: Christoph Hellwig 
---
 arch/powerpc/kernel/iommu.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..c01ccbf8afdd 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -236,15 +236,14 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
}
 
-   if (dev)
-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
-   else
-   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
/* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
+   boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX;
+
+   /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */
+   boundary_size = (boundary_size >> tbl->it_page_shift) + 1;
 
n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
-boundary_size >> tbl->it_page_shift, align_mask);
+boundary_size, align_mask);
if (n == -1) {
if (likely(pass == 0)) {
/* First try the pool from the start */
-- 
2.17.1

[RFT][PATCH 0/7] Avoid overflow at boundary_size

2020-08-20 Thread Nicolin Chen

We are expending the default DMA segmentation boundary to its
possible maximum value (ULONG_MAX) to indicate that a device
doesn't specify a boundary limit. So all dma_get_seg_boundary
callers should take a precaution with the return values since
it would easily get overflowed.

I scanned the entire kernel tree for all the existing callers
and found that most of callers may get overflowed in two ways:
either "+ 1" or passing it to ALIGN() that does "+ mask".

According to kernel defines:
#define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask))
#define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1)

We can simplify the logic here:
  ALIGN(boundary + 1, 1 << shift) >> shift
= ALIGN_MASK(b + 1, (1 << s) - 1) >> s
= {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s
= [b + 1 + (1 << s) - 1] >> s
= [b + (1 << s)] >> s
= (b >> s) + 1

So this series of patches fix the potential overflow with this
overflow-free shortcut.

As I don't think that I have these platforms, marking RFT.

Thanks
Nic

Nicolin Chen (7):
  powerpc/iommu: Avoid overflow at boundary_size
  alpha: Avoid overflow at boundary_size
  ia64/sba_iommu: Avoid overflow at boundary_size
  s390/pci_dma: Avoid overflow at boundary_size
  sparc: Avoid overflow at boundary_size
  x86/amd_gart: Avoid overflow at boundary_size
  parisc: Avoid overflow at boundary_size

 arch/alpha/kernel/pci_iommu.c| 10 --
 arch/ia64/hp/common/sba_iommu.c  |  4 ++--
 arch/powerpc/kernel/iommu.c  | 11 +--
 arch/s390/pci/pci_dma.c  |  4 ++--
 arch/sparc/kernel/iommu-common.c |  9 +++--
 arch/sparc/kernel/iommu.c|  4 ++--
 arch/sparc/kernel/pci_sun4v.c|  4 ++--
 arch/x86/kernel/amd_gart_64.c|  4 ++--
 drivers/parisc/ccio-dma.c|  4 ++--
 drivers/parisc/sba_iommu.c   |  4 ++--
 10 files changed, 26 insertions(+), 32 deletions(-)

-- 
2.17.1

Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules

2020-08-20 Thread Andreas Schwab

On Jun 29 2020, Christophe Leroy wrote:

> Instead of leaving NX unset on all segments above the start
> of vmalloc space, only leave NX unset on segments used for
> modules.

I'm getting this crash:

kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 
0)
BUG: Unable to handle kernel instruction fetch
Faulting instruction address: 0xf294b000
Oops: Kernel access of bad area, sig: 11 [#1]
BE PAGE_SIZE=4K MMU=Hash PowerMac
Modules linked in: pata_macio(+)
CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49
NIP:  f294b000 LR: 0005c60 CTR: f294b000
REGS: f18d9cc0 TRAP: 0400  Not tainted  (5.8.0-rc2-test)
MSR:  10009032   CR: 84222422  XER: 2000
GPR00: c0005c14 f18d9d78 ef30ca20  efe0 c00993d0 ef6da038 005e
GPR08: c09050b8 c08b  f18d9d78 44222422 10072070  0fefaca4
GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 ef2776e4
GPR24: f18fd6e8 0001 c086fe64 c086fe04  c08b f294b000 
NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio]
LR [c0005c60] do_one_initcall+0x6c/0x160
Call Trace:
[f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable)
[f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0
[f18d9df8] [c00993d8] load_module+0x16a8/0x1c14
[f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94
[f18d9f38] [c0012174] ret_from_syscall+0x0/0x34
--- interrupt: c01 at 0xfdb4318
   LR = 0xfeee9c0
Instruction dump:
       
    <3d20c08b> 3d40c086 9421ffe0 8129106c
---[ end trace 85a98cc836109871 ]---

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Re: [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit

2020-08-20 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> @@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node 
> *np)
>   /*
>* Find the base address and size of the memblock
>*/
> - regs = of_get_property(np, "reg", NULL);
> - if (!regs)
> + prop = of_get_property(np, "reg", NULL);
> + if (!prop)
>   return ret;
>  
> - base = be64_to_cpu(*(unsigned long *)regs);
> - lmb_size = be32_to_cpu(regs[3]);
> + /*
> +  * "reg" property represents (addr,size) tuple.
> +  */
> + base = of_read_number(prop, mem_addr_cells);
> + prop += mem_addr_cells;
> + lmb_size = of_read_number(prop, mem_size_cells);

Would of_n_size_cells() and of_n_addr_cells() work here?

Re: [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit

2020-08-20 Thread Nathan Lynch

"Aneesh Kumar K.V"  writes:
> Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block 
> panic")
> make sure different variables tracking lmb_size are updated to be 64 bit.
>
> This was found by code audit.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/drmem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/drmem.h 
> b/arch/powerpc/include/asm/drmem.h
> index 17ccc6474ab6..d719cbac34b2 100644
> --- a/arch/powerpc/include/asm/drmem.h
> +++ b/arch/powerpc/include/asm/drmem.h
> @@ -21,7 +21,7 @@ struct drmem_lmb {
>  struct drmem_lmb_info {
>   struct drmem_lmb*lmbs;
>   int n_lmbs;
> - u32 lmb_size;
> + u64 lmb_size;
>  };
>  
>  extern struct drmem_lmb_info *drmem_info;
> @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 {
>  #define DRCONF_MEM_RESERVED  0x0080
>  #define DRCONF_MEM_HOTREMOVABLE  0x0100
>  
> -static inline u32 drmem_lmb_size(void)
> +static inline u64 drmem_lmb_size(void)
>  {
>   return drmem_info->lmb_size;
>  }

Looks fine.
Acked-by: Nathan Lynch

Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero

2020-08-20 Thread Christophe Leroy





Le 20/08/2020 à 15:10, Guohua Zhong a écrit :

When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
which call stack is like this:

[17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
[17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW  O4.4.176 
#1
[17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
[17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 
[17179954.674349]REGS: e6fe1f10 TRAP: 3202   Tainted: PW  O 
(4.4.176)
[17179954.674355]MSR: 00021002   CR: 28002224  XER: 
[17179954.674364]
GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0  025d ffe15b24 
GPR08: de86aead  03ff  2800 0084d1c0  
GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062  000f
GPR24:  d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c
[17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
[17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
[17179954.674434]Call Trace:
[17179961.832693]Call Trace:
[17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
[17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
[17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
[17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
[17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
[17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
[17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
[17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
[17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c

do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
->div_u64_rem->do_div->__div64_32

In some corner case, stime + utime = 0 if overflow. Even in v5.8.2  kernel
the cputime has changed from unsigned long to u64 data type. About 200
days, the lowwer 32 bit will be 0x. Because divisor for __div64_32
is unsigned long data type,which is 32 bit for powepc 32, the bug still
exists.

So it is also a bug in the cputime_adjust which does not check if
stime + utime = 0

time = scale_stime((__force u64)stime, (__force u64)rtime,
 (__force u64)(stime + utime));

The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
mainline kernel may has fixed this case. But it is also better to check
if divisor is 0 in __div64_32 for other situation.

Signed-off-by: Guohua Zhong 
Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to 
arch/powerpc)"
Cc: sta...@vger.kernel.org # v2.6.15+
---
  arch/powerpc/boot/div64.S | 4 
  arch/powerpc/lib/div64.S  | 4 
  2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
index 4354928ed62e..39a25b9712d1 100644
--- a/arch/powerpc/boot/div64.S
+++ b/arch/powerpc/boot/div64.S
@@ -13,6 +13,9 @@
  
  	.globl __div64_32

  __div64_32:
+   li  r9,0
+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero


In generic version in lib/math/div64.c, there is no checking of 'base' 
either.

Do we really want to add this check in the powerpc version only ?

The only user of __div64_32() is do_div() in 
include/asm-generic/div64.h. Wouldn't it be better to do the check there ?


Christophe



lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,6 +55,7 @@ __div64_32:
  4:stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr
  
  /*

diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
index 3d5426e7dcc4..1cc9bcabf678 100644
--- a/arch/powerpc/lib/div64.S
+++ b/arch/powerpc/lib/div64.S
@@ -13,6 +13,9 @@
  #include 
  
  _GLOBAL(__div64_32)

+   li  r9,0
+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
  4:stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr

Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero

2020-08-20 Thread Christophe Leroy





Le 20/08/2020 à 15:10, Guohua Zhong a écrit :

When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
which call stack is like this:

[17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
[17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW  O4.4.176 
#1
[17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
[17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 
[17179954.674349]REGS: e6fe1f10 TRAP: 3202   Tainted: PW  O 
(4.4.176)
[17179954.674355]MSR: 00021002   CR: 28002224  XER: 
[17179954.674364]
GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0  025d ffe15b24 
GPR08: de86aead  03ff  2800 0084d1c0  
GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062  000f
GPR24:  d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c
[17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
[17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
[17179954.674434]Call Trace:
[17179961.832693]Call Trace:
[17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
[17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
[17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
[17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
[17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
[17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
[17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
[17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
[17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c

do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
->div_u64_rem->do_div->__div64_32

In some corner case, stime + utime = 0 if overflow. Even in v5.8.2  kernel
the cputime has changed from unsigned long to u64 data type. About 200
days, the lowwer 32 bit will be 0x. Because divisor for __div64_32
is unsigned long data type,which is 32 bit for powepc 32, the bug still
exists.

So it is also a bug in the cputime_adjust which does not check if
stime + utime = 0

time = scale_stime((__force u64)stime, (__force u64)rtime,
 (__force u64)(stime + utime));

The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
mainline kernel may has fixed this case. But it is also better to check
if divisor is 0 in __div64_32 for other situation.

Signed-off-by: Guohua Zhong 
Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to 
arch/powerpc)"
Cc: sta...@vger.kernel.org # v2.6.15+
---
  arch/powerpc/boot/div64.S | 4 
  arch/powerpc/lib/div64.S  | 4 
  2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
index 4354928ed62e..39a25b9712d1 100644
--- a/arch/powerpc/boot/div64.S
+++ b/arch/powerpc/boot/div64.S
@@ -13,6 +13,9 @@
  
  	.globl __div64_32

  __div64_32:
+   li  r9,0
+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,6 +55,7 @@ __div64_32:
  4:stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr
  
  /*

diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
index 3d5426e7dcc4..1cc9bcabf678 100644
--- a/arch/powerpc/lib/div64.S
+++ b/arch/powerpc/lib/div64.S
@@ -13,6 +13,9 @@
  #include 
  
  _GLOBAL(__div64_32)

+   li  r9,0


You don't need to load r9 with 0, use cmplwi instead.


+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero


You should leave space between the compare and the branch (i.e. have 
other instructions inbetween when possible), so that the processor can 
prepare the branching and do a good prediction. Same as the compare 
below, you see that there are two other instructions between the cmplw 
are the blt. You can eventually use another cr field than cr0 in order 
to nest several test/branches.
Also because on recent powerpc32, instructions are fetched and executed 
two by two.



lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
  4:stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr



Christophe

[PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero

2020-08-20 Thread Guohua Zhong

When cat /proc/pid/stat, do_task_stat will call into cputime_adjust,
which call stack is like this:

[17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0
[17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW  O4.4.176 
#1
[17179954.674339]dtask: dc9d7040 task.stack: d3cb4000
[17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: 
[17179954.674349]REGS: e6fe1f10 TRAP: 3202   Tainted: PW  O 
(4.4.176)
[17179954.674355]MSR: 00021002   CR: 28002224  XER: 
[17179954.674364]
GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0  025d ffe15b24 
GPR08: de86aead  03ff  2800 0084d1c0  
GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062  000f
GPR24:  d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c
[17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0
[17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138
[17179954.674434]Call Trace:
[17179961.832693]Call Trace:
[17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable)
[17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80
[17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870
[17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4
[17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c
[17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114
[17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c
[17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4
[17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c

do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64
->div_u64_rem->do_div->__div64_32

In some corner case, stime + utime = 0 if overflow. Even in v5.8.2  kernel
the cputime has changed from unsigned long to u64 data type. About 200
days, the lowwer 32 bit will be 0x. Because divisor for __div64_32
is unsigned long data type,which is 32 bit for powepc 32, the bug still
exists.

So it is also a bug in the cputime_adjust which does not check if
stime + utime = 0

time = scale_stime((__force u64)stime, (__force u64)rtime,
(__force u64)(stime + utime));

The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in
mainline kernel may has fixed this case. But it is also better to check
if divisor is 0 in __div64_32 for other situation.

Signed-off-by: Guohua Zhong 
Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)"
Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to 
arch/powerpc)"
Cc: sta...@vger.kernel.org # v2.6.15+
---
 arch/powerpc/boot/div64.S | 4 
 arch/powerpc/lib/div64.S  | 4 
 2 files changed, 8 insertions(+)

diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S
index 4354928ed62e..39a25b9712d1 100644
--- a/arch/powerpc/boot/div64.S
+++ b/arch/powerpc/boot/div64.S
@@ -13,6 +13,9 @@
 
.globl __div64_32
 __div64_32:
+   li  r9,0
+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,6 +55,7 @@ __div64_32:
 4: stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr
 
 /*
diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S
index 3d5426e7dcc4..1cc9bcabf678 100644
--- a/arch/powerpc/lib/div64.S
+++ b/arch/powerpc/lib/div64.S
@@ -13,6 +13,9 @@
 #include 
 
 _GLOBAL(__div64_32)
+   li  r9,0
+   cmplw   r4,r9   # check if divisor r4 is zero
+   beq 5f  # jump to label 5 if r4(divisor) is zero
lwz r5,0(r3)# get the dividend into r5/r6
lwz r6,4(r3)
cmplw   r5,r4
@@ -52,4 +55,5 @@ _GLOBAL(__div64_32)
 4: stw r7,0(r3)# return the quotient in *r3
stw r8,4(r3)
mr  r3,r6   # return the remainder in r3
+5: # return if divisor r4 is zero
blr
-- 
2.12.3

Re: [PATCH v2 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry

2020-08-20 Thread Christophe Leroy





Le 19/08/2020 à 15:01, Aneesh Kumar K.V a écrit :

set_pte_at() should not be used to set a pte entry at locations that
already holds a valid pte entry. Architectures like ppc64 don't do TLB
invalidate in set_pte_at() and hence expect it to be used to set locations
that are not a valid PTE.

Signed-off-by: Aneesh Kumar K.V 
---
  mm/debug_vm_pgtable.c | 35 +++
  1 file changed, 15 insertions(+), 20 deletions(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 76f4c713e5a3..9c7e2c9cfc76 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -74,15 +74,18 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
  {
pte_t pte = pfn_pte(pfn, prot);
  
+	/*

+* Architectures optimize set_pte_at by avoiding TLB flush.
+* This requires set_pte_at to be not used to update an
+* existing pte entry. Clear pte before we do set_pte_at
+*/
+
pr_debug("Validating PTE advanced\n");
pte = pfn_pte(pfn, prot);
set_pte_at(mm, vaddr, ptep, pte);
ptep_set_wrprotect(mm, vaddr, ptep);
pte = ptep_get(ptep);
WARN_ON(pte_write(pte));
-
-   pte = pfn_pte(pfn, prot);
-   set_pte_at(mm, vaddr, ptep, pte);
ptep_get_and_clear(mm, vaddr, ptep);
pte = ptep_get(ptep);
WARN_ON(!pte_none(pte));
@@ -96,13 +99,11 @@ static void __init pte_advanced_tests(struct mm_struct *mm,
ptep_set_access_flags(vma, vaddr, ptep, pte, 1);
pte = ptep_get(ptep);
WARN_ON(!(pte_write(pte) && pte_dirty(pte)));
-
-   pte = pfn_pte(pfn, prot);
-   set_pte_at(mm, vaddr, ptep, pte);
ptep_get_and_clear_full(mm, vaddr, ptep, 1);
pte = ptep_get(ptep);
WARN_ON(!pte_none(pte));
  
+	pte = pfn_pte(pfn, prot);

pte = pte_mkyoung(pte);
set_pte_at(mm, vaddr, ptep, pte);
ptep_test_and_clear_young(vma, vaddr, ptep);
@@ -164,9 +165,6 @@ static void __init pmd_advanced_tests(struct mm_struct *mm,
pmdp_set_wrprotect(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(pmd_write(pmd));
-
-   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
-   set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear(mm, vaddr, pmdp);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
@@ -180,13 +178,11 @@ static void __init pmd_advanced_tests(struct mm_struct 
*mm,
pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1);
pmd = READ_ONCE(*pmdp);
WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd)));
-
-   pmd = pmd_mkhuge(pfn_pmd(pfn, prot));
-   set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1);
pmd = READ_ONCE(*pmdp);
WARN_ON(!pmd_none(pmd));
  
+	pmd = pmd_mkhuge(pfn_pmd(pfn, prot));

pmd = pmd_mkyoung(pmd);
set_pmd_at(mm, vaddr, pmdp, pmd);
pmdp_test_and_clear_young(vma, vaddr, pmdp);
@@ -283,18 +279,10 @@ static void __init pud_advanced_tests(struct mm_struct 
*mm,
WARN_ON(pud_write(pud));
  
  #ifndef __PAGETABLE_PMD_FOLDED


Same as below, once set_put_at() is gone, I don't think this #ifndef 
__PAGETABLE_PMD_FOLDED is still need, should be possible to replace by 
'if (mm_pmd_folded())'



-
-   pud = pud_mkhuge(pfn_pud(pfn, prot));
-   set_pud_at(mm, vaddr, pudp, pud);
pudp_huge_get_and_clear(mm, vaddr, pudp);
pud = READ_ONCE(*pudp);
WARN_ON(!pud_none(pud));
  
-	pud = pud_mkhuge(pfn_pud(pfn, prot));

-   set_pud_at(mm, vaddr, pudp, pud);
-   pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1);
-   pud = READ_ONCE(*pudp);
-   WARN_ON(!pud_none(pud));
  #endif /* __PAGETABLE_PMD_FOLDED */
  
  	pud = pud_mkhuge(pfn_pud(pfn, prot));

@@ -307,6 +295,13 @@ static void __init pud_advanced_tests(struct mm_struct *mm,
pud = READ_ONCE(*pudp);
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
  
+#ifndef __PAGETABLE_PMD_FOLDED

+   pudp_huge_get_and_clear_full(vma, vaddr, pudp, 1);
+   pud = READ_ONCE(*pudp);
+   WARN_ON(!pud_none(pud));
+#endif /* __PAGETABLE_PMD_FOLDED */


pudp_huge_get_and_clear_full() and pud_none() are always defined, I 
think this #ifndef can be replaced by an 'if (mm_pmd_folded())'



+
+   pud = pud_mkhuge(pfn_pud(pfn, prot));
pud = pud_mkyoung(pud);
set_pud_at(mm, vaddr, pudp, pud);
pudp_test_and_clear_young(vma, vaddr, pudp);



Christophe

Re: [PATCH v2] powerpc/pseries: Do not initiate shutdown when system is running on UPS

2020-08-20 Thread Michael Ellerman

On Thu, 20 Aug 2020 11:48:44 +0530, Vasant Hegde wrote:
> As per PAPR we have to look for both EPOW sensor value and event modifier to
> identify type of event and take appropriate action.
> 
> Sensor value = 3 (EPOW_SYSTEM_SHUTDOWN) schedule system to be shutdown after
>   OS defined delay (default 10 mins).
> 
> EPOW Event Modifier for sensor value = 3:
>We have to initiate immediate shutdown for most of the event modifier 
> except
>value = 2 (system running on UPS).
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries: Do not initiate shutdown when system is running on UPS
  https://git.kernel.org/powerpc/c/90a9b102eddf6a3f987d15f4454e26a2532c1c98

cheers

Re: [PATCH] powerpc/perf: Account for interrupts during PMC overflow for an invalid SIAR check

2020-08-20 Thread Michael Ellerman

On Thu, 6 Aug 2020 08:46:32 -0400, Athira Rajeev wrote:
> Performance monitor interrupt handler checks if any counter has overflown
> and calls `record_and_restart` in core-book3s which invokes
> `perf_event_overflow` to record the sample information.
> Apart from creating sample, perf_event_overflow also does the interrupt
> and period checks via perf_event_account_interrupt.
> 
> Currently we record information only if the SIAR valid bit is set
> ( using `siar_valid` check ) and hence the interrupt check.
> But it is possible that we do sampling for some events that are not
> generating valid SIAR and hence there is no chance to disable the event
> if interrupts is more than max_samples_per_tick. This leads to soft lockup.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/perf: Fix soft lockups due to missed interrupt accounting
  https://git.kernel.org/powerpc/c/17899eaf88d689529b866371344c8f269ba79b5f

cheers

Re: [PATCH] powerpc/powernv/pci: Fix typo when releasing DMA resources

2020-08-20 Thread Michael Ellerman

On Wed, 19 Aug 2020 15:07:41 +0200, Frederic Barrat wrote:
> Fix typo introduced during recent code cleanup, which could lead to
> silently not freeing resources or oops message (on PCI hotplug or CAPI
> reset).
> Only impacts ioda2, the code path for ioda1 is correct.

Applied to powerpc/fixes.

[1/1] powerpc/powernv/pci: Fix possible crash when releasing DMA resources
  https://git.kernel.org/powerpc/c/e17a7c0e0aebb956719ce2a8465f649859c2da7d

cheers

Re: [PATCH v2] powerpc/pseries: Do not initiate shutdown when system is running on UPS

2020-08-20 Thread Michael Ellerman

Vasant Hegde  writes:
> As per PAPR we have to look for both EPOW sensor value and event modifier to
> identify type of event and take appropriate action.
>
> Sensor value = 3 (EPOW_SYSTEM_SHUTDOWN) schedule system to be shutdown after
>   OS defined delay (default 10 mins).
>
> EPOW Event Modifier for sensor value = 3:
>We have to initiate immediate shutdown for most of the event modifier 
> except
>value = 2 (system running on UPS).
>
> Checking with firmware document its clear that we have to wait for predefined
> time before initiating shutdown. If power is restored within time we should
> cancel the shutdown process. I think commit 79872e35 accidently enabled
> immediate poweroff for EPOW_SHUTDOWN_ON_UPS event.

It's not that clear to me :)

LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes":

  SYSTEM_SHUTDOWN 3

  The system must be shut down. An EPOW-aware OS logs the EPOW error
  log information, then schedules the system to be shut down to begin
  after an OS defined delay internal (default is 10 minutes.)

And then in section 10.3.2.2.8 there is table 146 "Platform Event Log
Format, Version 6, EPOW Section", which includes the "EPOW Event
Modifier":

  For EPOW sensor value = 3
  0x01 = Normal system shutdown with no additional delay
  0x02 = Loss of utility power, system is running on UPS/Battery
  0x03 = Loss of system critical functions, system should be shutdown
  0x04 = Ambient temperature too high
  All other values = reserved

There is also section 7.3.6.4 which includes a note saying:

  2. The report that a system needs to be shutdown due to running under
  a UPS would be given by the platform as an EPOW event with EPOW event
  modifier being given as, 0x02 = Loss of utility power, system is
  running on UPS/Battery, as described in section Section 10.3.2.2.8‚
  “Platform Event Log Format, EPOW Section‚” on page 308.

So the only mention of the 10 minutes is in relation to all
SYSTEM_SHUTDOWN events. ie. according to that we should not be doing an
immediate shutdown for any of the events.

> We have user space tool (rtas_errd) on LPAR to monitor for 
> EPOW_SHUTDOWN_ON_UPS.
> Once it gets event it initiates shutdown after predefined time. Also starts
> monitoring for any new EPOW events. If it receives "Power restored" event
> before predefined time it will cancel the shutdown. Otherwise after
> predefined time it will shutdown the system.

What event are you referring to as the "Power restored" event? AFAICS
PAPR just says we "may" receive an EPOW_RESET.

I can't see anything else about what we're supposed to do if power is
restored.

Anyway I'm not opposed to the change, but I don't think it's correct to
say that PAPR defines the behaviour.

Rather we used to implement a certain behaviour, and we have at least
one customer who relies on that old behaviour and dislikes the new
behaviour. It's also generally good to defer decisions like this to
userspace, so that administrators can customise the behaviour.

Anyway I'll massage the change log a bit to incorporate some of the
above and apply it.

cheers

> Fixes: 79872e35 (powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must 
> initiate shutdown)
> Cc: sta...@vger.kernel.org # v4.0+
> Cc: Tyrel Datwyler 
> Cc: Michael Ellerman 
> Signed-off-by: Vasant Hegde 
> ---
> Changes in v2:
>   - Updated patch description based on mpe, Tyrel comment.
>
> -Vasant
>  arch/powerpc/platforms/pseries/ras.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/ras.c 
> b/arch/powerpc/platforms/pseries/ras.c
> index f3736fcd98fc..13c86a292c6d 100644
> --- a/arch/powerpc/platforms/pseries/ras.c
> +++ b/arch/powerpc/platforms/pseries/ras.c
> @@ -184,7 +184,6 @@ static void handle_system_shutdown(char event_modifier)
>   case EPOW_SHUTDOWN_ON_UPS:
>   pr_emerg("Loss of system power detected. System is running on"
>" UPS/battery. Check RTAS error log for details\n");
> - orderly_poweroff(true);
>   break;
>  
>   case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS:
> -- 
> 2.26.2

Re: [PATCH] kernel/watchdog: fix warning -Wunused-variable for watchdog_allowed_mask in ppc64

2020-08-20 Thread Petr Mladek

On Fri 2020-08-14 19:03:30, Balamuruhan S wrote:
> In ppc64 config if `CONFIG_SOFTLOCKUP_DETECTOR` is not set then it
> warns for unused declaration of `watchdog_allowed_mask` while building,
> move the declaration inside ifdef later in the code.
> 
> ```
> kernel/watchdog.c:47:23: warning: ‘watchdog_allowed_mask’ defined but not 
> used [-Wunused-variable]
>  static struct cpumask watchdog_allowed_mask __read_mostly;
> ```
> 
> Signed-off-by: Balamuruhan S 
> ---
>  kernel/watchdog.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 5abb5b22ad13..33c9b8a3d51b 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -44,7 +44,6 @@ int __read_mostly soft_watchdog_user_enabled = 1;
>  int __read_mostly watchdog_thresh = 10;
>  static int __read_mostly nmi_watchdog_available;
>  
> -static struct cpumask watchdog_allowed_mask __read_mostly;
>  
>  struct cpumask watchdog_cpumask __read_mostly;
>  unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask);
> @@ -166,6 +165,7 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace;
>  unsigned int __read_mostly softlockup_panic =
>   CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE;
>  
> +static struct cpumask watchdog_allowed_mask __read_mostly;

I could confirm that the variable is used only in code that is built
when CONFIG_SOFTLOCKUP_DETECTOR is enabled.

Note that the problem can't be seen on x86. There the softlockup
detector is enforced together with hardloclup detector via
via HARDLOCKUP_DETECTOR_PERF.

Reviewed-by: Petr Mladek 

Best Regards,
Petr

Re: [PATCH] powerpc/powernv/pci: Fix typo when releasing DMA resources

2020-08-20 Thread Frederic Barrat





Le 20/08/2020 à 06:18, Michael Ellerman a écrit :

I changed the subject to:

 powerpc/powernv/pci: Fix possible crash when releasing DMA resources



Much better, thanks!

  Fred

44 matches

Mail list logo