Re: kernel since 5.6 do not boot anymore on Apple PowerBook
Hi Giuseppe, Le 08/07/2020 à 20:44, Christophe Leroy a écrit : Le 08/07/2020 à 19:36, Giuseppe Sacco a écrit : Hi Cristophe, Il giorno mer, 08/07/2020 alle 19.09 +0200, Christophe Leroy ha scritto: Hi Le 08/07/2020 à 19:00, Giuseppe Sacco a écrit : Hello, while trying to debug a problem using git bisect, I am now at a point where I cannot build the kernel at all. This is the error message I get: $ LANG=C make ARCH=powerpc \ CROSS_COMPILE=powerpc-linux- \ CONFIG_MODULE_COMPRESS_GZIP=true \ INSTALL_MOD_STRIP=1 CONFIG_MODULE_COMPRESS=1 \ -j4 INSTALL_MOD_PATH=$BOOT INSTALL_PATH=$BOOT \ CONFIG_DEBUG_INFO_COMPRESSED=1 \ install modules_install make[2]: *** No rule to make target 'vmlinux', needed by Surprising. Did you make any change to Makefiles ? No Are you in the middle of a bisect ? If so, if the previous builds worked, I'd do 'git bisect skip' Yes, the previous one worked. What's the result with: LANG=C make ARCH=powerpc CROSS_COMPILE=powerpc-linux- vmlinux $ LANG=C make ARCH=powerpc CROSS_COMPILE=powerpc-linux- vmlinux CALL scripts/checksyscalls.sh CALL scripts/atomic/check-atomics.sh CHK include/generated/compile.h CC kernel/module.o kernel/module.c: In function 'do_init_module': kernel/module.c:3593:2: error: implicit declaration of function 'module_enable_ro'; did you mean 'module_enable_x'? [-Werror=implicit- function-declaration] 3593 | module_enable_ro(mod, true); | ^~~~ | module_enable_x cc1: some warnings being treated as errors make[1]: *** [scripts/Makefile.build:267: kernel/module.o] Error 1 make: *** [Makefile:1735: kernel] Error 2 So, should I 'git bisect skip'? Ah yes, I had the exact same problem last time I bisected. So yes do 'git bisect skip'. You'll probably hit this problem half a dozen of times, but at the end you should get a usefull bisect anyway. Were you able to progress ? Christophe
Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes
On 8/21/20 9:03 AM, Anshuman Khandual wrote: On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote: "Aneesh Kumar K.V" writes: This patch series includes fixes for debug_vm_pgtable test code so that they follow page table updates rules correctly. The first two patches introduce changes w.r.t ppc64. The patches are included in this series for completeness. We can merge them via ppc64 tree if required. Hugetlb test is disabled on ppc64 because that needs larger change to satisfy page table update rules. Changes from V1: * Address review feedback * drop test specific pfn_pte and pfn_pmd. * Update ppc64 page table helper to add _PAGE_PTE Aneesh Kumar K.V (13): powerpc/mm: Add DEBUG_VM WARN for pmd_clear powerpc/mm: Move setting pte specific flags to pfn_pte mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge vmap support. mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with CONFIG_NUMA_BALANCING mm/debug_vm_pgtable/THP: Mark the pte entry huge before using set_pmd/pud_at mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP mm/debug_vm_pgtable/locks: Move non page table modifying test together mm/debug_vm_pgtable/locks: Take correct page table lock mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64 mm/debug_vm_pgtable: populate a pte entry before fetching it arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++- arch/powerpc/include/asm/nohash/pgtable.h| 5 - arch/powerpc/mm/book3s64/pgtable.c | 2 +- arch/powerpc/mm/pgtable.c| 5 - include/linux/io.h | 12 ++ mm/debug_vm_pgtable.c| 151 +++ 6 files changed, 127 insertions(+), 77 deletions(-) BTW I picked a wrong branch when sending this. Attaching the diff against what I want to send. pfn_pmd() no more updates _PAGE_PTE because that is handled by pmd_mkhuge(). diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 3b4da7c63e28..e18ae50a275c 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot) unsigned long pmdv; pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK; - return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE); + return pmd_set_protbits(__pmd(pmdv), pgprot); } pmd_t mk_pmd(struct page *page, pgprot_t pgprot) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 7d9f8e1d790f..cad61d22f33a 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned long pfn, pgprot_t prot) static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) { - pmd_t pmd = pfn_pmd(pfn, prot); + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) return; Cover letter does not mention which branch or tag this series applies on. Just assumed it to be 5.9-rc1. Should the above changes be captured as a pre-requisite patch ? Anyways, the series fails to be build on arm64. A) Without CONFIG_TRANSPARENT_HUGEPAGE mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’: mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function ‘pmd_advanced_tests’ pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); ^~ mm/debug_vm_pgtable.c:366:20: note: declared here static void __init pmd_advanced_tests(struct mm_struct *mm, ^~ B) As mentioned previously, this should be solved by including mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’: mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? [-Werror=implicit-function-declaration] if (!arch_ioremap_pmd_supported()) ^~ Please make sure that the series builds on all enabled platforms i.e x86, arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling all the features that make various #ifdefs in the test. I was hoping to get kernel test robot build report to verify that. But if you can help with that i have pushed a branch to github with reported build failure fixes. https://github.com/kvaneesh/linux/tree/debug_vm_pgtable I still haven't looked at the PMD_FOLDED feedback from Christophe because I am not sure i follow why we are checking for PMD folded there. -aneesh
Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules
On 08/21/2020 05:11 AM, Christophe Leroy wrote: Le 21/08/2020 à 00:00, Andreas Schwab a écrit : On Jun 29 2020, Christophe Leroy wrote: Instead of leaving NX unset on all segments above the start of vmalloc space, only leave NX unset on segments used for modules. I'm getting this crash: kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 0) BUG: Unable to handle kernel instruction fetch Faulting instruction address: 0xf294b000 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash PowerMac Modules linked in: pata_macio(+) CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49 NIP: f294b000 LR: 0005c60 CTR: f294b000 REGS: f18d9cc0 TRAP: 0400 Not tainted (5.8.0-rc2-test) MSR: 10009032 CR: 84222422 XER: 2000 GPR00: c0005c14 f18d9d78 ef30ca20 efe0 c00993d0 ef6da038 005e GPR08: c09050b8 c08b f18d9d78 44222422 10072070 0fefaca4 GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 ef2776e4 GPR24: f18fd6e8 0001 c086fe64 c086fe04 c08b f294b000 NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio] LR [c0005c60] do_one_initcall+0x6c/0x160 Call Trace: [f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable) [f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0 [f18d9df8] [c00993d8] load_module+0x16a8/0x1c14 [f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94 [f18d9f38] [c0012174] ret_from_syscall+0x0/0x34 --- interrupt: c01 at 0xfdb4318 LR = 0xfeee9c0 Instruction dump: <3d20c08b> 3d40c086 9421ffe0 8129106c ---[ end trace 85a98cc836109871 ]--- Please try the patch at https://patchwork.ozlabs.org/project/linuxppc-dev/patch/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.le...@csgroup.eu/ And if you are using KAsan, also take https://patchwork.ozlabs.org/project/linuxppc-dev/patch/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.le...@csgroup.eu/ Allthough I have some doubt that it will fix it, because the faulting instruction address is at 0xf294b000 which is within the vmalloc area. In the likely case the patch doesn't fix the issue, can you provide your .config and a dump of /sys/kernel/debug/powerpc/segment_registers (You have to have CONFIG_PPC_PTDUMP enabled for that) and also the below part from boot log. [ 0.00] Memory: 509556K/524288K available (7088K kernel code, 592K rwdata, 1304K rodata, 356K init, 803K bss, 14732K reserved, 0K cma-reserved) [ 0.00] Kernel virtual memory layout: [ 0.00] * 0xff7ff000..0xf000 : fixmap [ 0.00] * 0xff7fd000..0xff7ff000 : early ioremap [ 0.00] * 0xe100..0xff7fd000 : vmalloc & ioremap I found the issue, when VMALLOC_END is above 0xf000, ALIGN(VMALLOC_END, SZ_256M) is 0 so the test is always false. The below change should fix it. diff --git a/arch/powerpc/mm/book3s32/mmu.c b/arch/powerpc/mm/book3s32/mmu.c index 82ae9e06a773..d426eaf76bb0 100644 --- a/arch/powerpc/mm/book3s32/mmu.c +++ b/arch/powerpc/mm/book3s32/mmu.c @@ -194,12 +194,12 @@ static bool is_module_segment(unsigned long addr) #ifdef MODULES_VADDR if (addr < ALIGN_DOWN(MODULES_VADDR, SZ_256M)) return false; - if (addr >= ALIGN(MODULES_END, SZ_256M)) + if (addr > ALIGN(MODULES_END, SZ_256M) - 1) return false; #else if (addr < ALIGN_DOWN(VMALLOC_START, SZ_256M)) return false; - if (addr >= ALIGN(VMALLOC_END, SZ_256M)) + if (addr > ALIGN(VMALLOC_END, SZ_256M) - 1) return false; #endif return true; Christophe
Re: [PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
Le 21/08/2020 à 06:44, Nicholas Piggin a écrit : This is a generic kernel virtual memory mapper, not specific to ioremap. Signed-off-by: Nicholas Piggin --- include/linux/vmalloc.h | 2 + mm/ioremap.c| 192 mm/vmalloc.c| 191 +++ 3 files changed, 193 insertions(+), 192 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 787d77ad7536..e3590e93bfff 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); #ifdef CONFIG_MMU +extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, + unsigned int max_page_shift); extern keyword is useless on function prototypes and deprecated. Please don't add new function prototypes with that keyword. extern int map_kernel_range_noflush(unsigned long start, unsigned long size, pgprot_t prot, struct page **pages); int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, Christophe
Re: [PATCH v5 0/8] huge vmalloc mappings
Le 21/08/2020 à 06:44, Nicholas Piggin a écrit : I made this powerpc-only for the time being. It shouldn't be too hard to add support for other archs that define HUGE_VMAP. I have booted x86 with it enabled, just may not have audited everything. I like this series, but if I understand correctly it enables huge vmalloc mappings only for hugepages sizes matching a page directory levels, ie on a PPC32 it would work only for 4M hugepages. On the 8xx, we only have 8M and 512k hugepages. Any change that it can support these as well one day ? Christophe Hi Andrew, would you care to put this in your tree? Thanks, Nick Since v4: - Fixed an off-by-page-order bug in v4 - Several minor cleanups. - Added page order to /proc/vmallocinfo - Added hugepage to alloc_large_system_hage output. - Made an architecture config option, powerpc only for now. Since v3: - Fixed an off-by-one bug in a loop - Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail - Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan Cameron for debugging the cause of this (hopefully). Since v2: - Rebased on vmalloc cleanups, split series into simpler pieces. - Fixed several compile errors and warnings - Keep the page array and accounting in small page units because struct vm_struct is an interface (this should fix x86 vmap stack debug assert). [Thanks Zefan] Nicholas Piggin (8): mm/vmalloc: fix vmalloc_to_page for huge vmap mappings mm: apply_to_pte_range warn and fail if a large pte is encountered mm/vmalloc: rename vmap_*_range vmap_pages_*_range lib/ioremap: rename ioremap_*_range to vmap_*_range mm: HUGE_VMAP arch support cleanup mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c mm/vmalloc: add vmap_range_noflush variant mm/vmalloc: Hugepage vmalloc mappings .../admin-guide/kernel-parameters.txt | 2 + arch/Kconfig | 4 + arch/arm64/mm/mmu.c | 12 +- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s64/radix_pgtable.c | 10 +- arch/x86/mm/ioremap.c | 12 +- include/linux/io.h| 9 - include/linux/vmalloc.h | 13 + init/main.c | 1 - mm/ioremap.c | 231 + mm/memory.c | 60 ++- mm/page_alloc.c | 4 +- mm/vmalloc.c | 456 +++--- 13 files changed, 476 insertions(+), 339 deletions(-)
Re: [PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
On Fri, Aug 21, 2020 at 02:44:25PM +1000, Nicholas Piggin wrote: > This is a generic kernel virtual memory mapper, not specific to ioremap. lib/ioremap doesn't exist any more. > > Signed-off-by: Nicholas Piggin > --- > include/linux/vmalloc.h | 2 + > mm/ioremap.c| 192 > mm/vmalloc.c| 191 +++ > 3 files changed, 193 insertions(+), 192 deletions(-) > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 787d77ad7536..e3590e93bfff 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr); > extern struct vm_struct *find_vm_area(const void *addr); > > #ifdef CONFIG_MMU > +extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t > phys_addr, pgprot_t prot, > + unsigned int max_page_shift); Please avoid the pointlessly long line. And don't add the pointless extern.
Re: [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup
> static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long > end, > - phys_addr_t phys_addr, pgprot_t prot) > + phys_addr_t phys_addr, pgprot_t prot, unsigned int > max_page_shift) > { ... and here.
Re: [PATCH v5 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range
On Fri, Aug 21, 2020 at 02:44:23PM +1000, Nicholas Piggin wrote: > This will be moved to mm/ and used as a generic kernel virtual mapping > function, so re-name it in preparation. > > Signed-off-by: Nicholas Piggin > --- > mm/ioremap.c | 55 ++-- > 1 file changed, 23 insertions(+), 32 deletions(-) > > diff --git a/mm/ioremap.c b/mm/ioremap.c > index 5fa1ab41d152..6016ae3227ad 100644 > --- a/mm/ioremap.c > +++ b/mm/ioremap.c > @@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; } > static inline int ioremap_pmd_enabled(void) { return 0; } > #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ > > -static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, > - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, > - pgtbl_mod_mask *mask) > +static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, > + phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask > *mask) Same here.
Re: [PATCH v5 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range
On Fri, Aug 21, 2020 at 02:44:22PM +1000, Nicholas Piggin wrote: > The vmalloc mapper operates on a struct page * array rather than a > linear physical address, re-name it to make this distinction clear. > > Signed-off-by: Nicholas Piggin > --- > mm/vmalloc.c | 28 > 1 file changed, 12 insertions(+), 16 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 49f225b0f855..3a1e45fd1626 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, > unsigned long size) > arch_sync_kernel_mappings(start, end); > } > > -static int vmap_pte_range(pmd_t *pmd, unsigned long addr, > - unsigned long end, pgprot_t prot, struct page **pages, int *nr, > - pgtbl_mod_mask *mask) > +static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned > long end, > + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask > *mask) Please don't add > 80 lines without any good reason.
Re: [PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup
Le 21/08/2020 à 06:44, Nicholas Piggin a écrit : This changes the awkward approach where architectures provide init functions to determine which levels they can provide large mappings for, to one where the arch is queried for each call. This removes code and indirection, and allows constant-folding of dead code for unsupported levels. I think that in order to allow constant-folding of dead code for unsupported levels, you must define arch_vmap_xxx_supported() as static inline in a .h If you have them in .c files, you'll get calls to tiny functions that will always return false, but will still be called and dead code won't be eliminated. And performance wise, that's probably not optimal either. Christophe This also adds a prot argument to the arch query. This is unused currently but could help with some architectures (e.g., some powerpc processors can't map uncacheable memory with large pages). Signed-off-by: Nicholas Piggin --- arch/arm64/mm/mmu.c | 12 +-- arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++- arch/x86/mm/ioremap.c| 12 +-- include/linux/io.h | 9 --- include/linux/vmalloc.h | 10 +++ init/main.c | 1 - mm/ioremap.c | 96 +++- 7 files changed, 73 insertions(+), 77 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..bbb3ccf6a7ce 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot) return dt_virt; } -int __init arch_ioremap_p4d_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } -int __init arch_ioremap_pud_supported(void) +bool arch_vmap_pud_supported(pgprot_t prot) { /* * Only 4k granule supports level 1 block mappings. @@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void) !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { - /* See arch_ioremap_pud_supported() */ + /* See arch_vmap_pud_supported() */ return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); } +#endif int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) { diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index ae823bba29f2..7d3a620c5adf 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1182,13 +1182,14 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma, set_pte_at(mm, addr, ptep, pte); } -int __init arch_ioremap_pud_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_pud_supported(pgprot_t prot) { /* HPT does not cope with large pages in the vmalloc area */ return radix_enabled(); } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { return radix_enabled(); } @@ -1197,6 +1198,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) { return 0; } +#endif int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) { @@ -1282,7 +1284,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) return 1; } -int __init arch_ioremap_p4d_supported(void) +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 84d85dbd1dad..5b8b495ab4ed 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr) } EXPORT_SYMBOL(iounmap); -int __init arch_ioremap_p4d_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } -int __init arch_ioremap_pud_supported(void) +bool arch_vmap_pud_supported(pgprot_t prot) { #ifdef CONFIG_X86_64 return boot_cpu_has(X86_FEATURE_GBPAGES); #else - return 0; + return false; #endif } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { return boot_cpu_has(X86_FEATURE_PSE); } +#endif /* * Convert a physical pointer to a virtual kernel pointer for /dev/mem diff --git a/include/linux/io.h b/include/linux/io.h index 8394c56babc2..f1effd4d7a3c 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end, } #endif -#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP -void __init ioremap_huge_init(void); -int arch_ioremap_p4d_supported(void); -int arch_ioremap_pud_supported(void); -int arch_ioremap_pmd_supported(void); -#else -static inline voi
[PATCH 2/2] powerpc/64s: Disallow PROT_SAO in LPARs by default
Since migration of guests using SAO to ISA 3.1 hosts may cause issues, disable PROT_SAO in LPARs by default and introduce a new Kconfig option PPC_PROT_SAO_LPAR to allow users to enable it if desired. Signed-off-by: Shawn Anastasio --- arch/powerpc/Kconfig| 12 arch/powerpc/include/asm/mman.h | 9 +++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 1f48bbfb3ce9..65bed1fdeaad 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -860,6 +860,18 @@ config PPC_SUBPAGE_PROT If unsure, say N here. +config PPC_PROT_SAO_LPAR + bool "Support PROT_SAO mappings in LPARs" + depends on PPC_BOOK3S_64 + help + This option adds support for PROT_SAO mappings from userspace + inside LPARs on supported CPUs. + + This may cause issues when performing guest migration from + a CPU that supports SAO to one that does not. + + If unsure, say N here. + config PPC_COPRO_BASE bool diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h index 4ba303ea27f5..7cb6d18f5cd6 100644 --- a/arch/powerpc/include/asm/mman.h +++ b/arch/powerpc/include/asm/mman.h @@ -40,8 +40,13 @@ static inline bool arch_validate_prot(unsigned long prot, unsigned long addr) { if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO)) return false; - if ((prot & PROT_SAO) && !cpu_has_feature(CPU_FTR_SAO)) - return false; + if (prot & PROT_SAO) { + if (!cpu_has_feature(CPU_FTR_SAO)) + return false; + if (firmware_has_feature(FW_FEATURE_LPAR) && + !IS_ENABLED(CONFIG_PPC_PROT_SAO_LPAR)) + return false; + } return true; } #define arch_validate_prot arch_validate_prot -- 2.28.0
Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules
Le 21/08/2020 à 00:00, Andreas Schwab a écrit : On Jun 29 2020, Christophe Leroy wrote: Instead of leaving NX unset on all segments above the start of vmalloc space, only leave NX unset on segments used for modules. I'm getting this crash: kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 0) BUG: Unable to handle kernel instruction fetch Faulting instruction address: 0xf294b000 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash PowerMac Modules linked in: pata_macio(+) CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49 NIP: f294b000 LR: 0005c60 CTR: f294b000 REGS: f18d9cc0 TRAP: 0400 Not tainted (5.8.0-rc2-test) MSR: 10009032 CR: 84222422 XER: 2000 GPR00: c0005c14 f18d9d78 ef30ca20 efe0 c00993d0 ef6da038 005e GPR08: c09050b8 c08b f18d9d78 44222422 10072070 0fefaca4 GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 ef2776e4 GPR24: f18fd6e8 0001 c086fe64 c086fe04 c08b f294b000 NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio] LR [c0005c60] do_one_initcall+0x6c/0x160 Call Trace: [f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable) [f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0 [f18d9df8] [c00993d8] load_module+0x16a8/0x1c14 [f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94 [f18d9f38] [c0012174] ret_from_syscall+0x0/0x34 --- interrupt: c01 at 0xfdb4318 LR = 0xfeee9c0 Instruction dump: <3d20c08b> 3d40c086 9421ffe0 8129106c ---[ end trace 85a98cc836109871 ]--- Please try the patch at https://patchwork.ozlabs.org/project/linuxppc-dev/patch/07884ed033c31e074747b7eb8eaa329d15db07ec.1596641219.git.christophe.le...@csgroup.eu/ And if you are using KAsan, also take https://patchwork.ozlabs.org/project/linuxppc-dev/patch/6eddca2d5611fd57312a88eae31278c87a8fc99d.1596641224.git.christophe.le...@csgroup.eu/ Allthough I have some doubt that it will fix it, because the faulting instruction address is at 0xf294b000 which is within the vmalloc area. In the likely case the patch doesn't fix the issue, can you provide your .config and a dump of /sys/kernel/debug/powerpc/segment_registers (You have to have CONFIG_PPC_PTDUMP enabled for that) and also the below part from boot log. [0.00] Memory: 509556K/524288K available (7088K kernel code, 592K rwdata, 1304K rodata, 356K init, 803K bss, 14732K reserved, 0K cma-reserved) [0.00] Kernel virtual memory layout: [0.00] * 0xff7ff000..0xf000 : fixmap [0.00] * 0xff7fd000..0xff7ff000 : early ioremap [0.00] * 0xe100..0xff7fd000 : vmalloc & ioremap Thanks Christophe
[PATCH 1/2] Revert "powerpc/64s: Remove PROT_SAO support"
This reverts commit 5c9fa16e8abd342ce04dc830c1ebb2a03abf6c05. Since PROT_SAO can still be useful for certain classes of software, reintroduce it. Concerns about guest migration for LPARs using SAO will be addressed next. Signed-off-by: Shawn Anastasio --- arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++-- arch/powerpc/include/asm/cputable.h | 10 ++--- arch/powerpc/include/asm/mman.h | 26 ++-- arch/powerpc/include/asm/nohash/64/pgtable.h | 2 + arch/powerpc/include/uapi/asm/mman.h | 2 +- arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 2 + include/linux/mm.h| 2 + include/trace/events/mmflags.h| 2 + mm/ksm.c | 4 ++ tools/testing/selftests/powerpc/mm/.gitignore | 1 + tools/testing/selftests/powerpc/mm/Makefile | 4 +- tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++ 13 files changed, 90 insertions(+), 17 deletions(-) create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 6de56c3b33c4..495fc0ccb453 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -20,13 +20,9 @@ #define _PAGE_RW (_PAGE_READ | _PAGE_WRITE) #define _PAGE_RWX (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC) #define _PAGE_PRIVILEGED 0x8 /* kernel access only */ - -#define _PAGE_CACHE_CTL0x00030 /* Bits for the folowing cache modes */ - /* No bits set is normal cacheable memory */ - /* 0x00010 unused, is SAO bit on radix POWER9 */ +#define _PAGE_SAO 0x00010 /* Strong access order */ #define _PAGE_NON_IDEMPOTENT 0x00020 /* non idempotent memory */ #define _PAGE_TOLERANT 0x00030 /* tolerant memory, cache inhibited */ - #define _PAGE_DIRTY0x00080 /* C: page changed */ #define _PAGE_ACCESSED 0x00100 /* R: page referenced */ /* @@ -828,6 +824,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr, return hash__set_pte_at(mm, addr, ptep, pte, percpu); } +#define _PAGE_CACHE_CTL(_PAGE_SAO | _PAGE_NON_IDEMPOTENT | _PAGE_TOLERANT) + #define pgprot_noncached pgprot_noncached static inline pgprot_t pgprot_noncached(pgprot_t prot) { diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h index fdddb822d564..f89205eff691 100644 --- a/arch/powerpc/include/asm/cputable.h +++ b/arch/powerpc/include/asm/cputable.h @@ -191,7 +191,7 @@ static inline void cpu_feature_keys_init(void) { } #define CPU_FTR_SPURR LONG_ASM_CONST(0x0100) #define CPU_FTR_DSCR LONG_ASM_CONST(0x0200) #define CPU_FTR_VSXLONG_ASM_CONST(0x0400) -// Free LONG_ASM_CONST(0x0800) +#define CPU_FTR_SAOLONG_ASM_CONST(0x0800) #define CPU_FTR_CP_USE_DCBTZ LONG_ASM_CONST(0x1000) #define CPU_FTR_UNALIGNED_LD_STD LONG_ASM_CONST(0x2000) #define CPU_FTR_ASYM_SMT LONG_ASM_CONST(0x4000) @@ -436,7 +436,7 @@ static inline void cpu_feature_keys_init(void) { } CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | CPU_FTR_ASYM_SMT | \ + CPU_FTR_DSCR | CPU_FTR_SAO | CPU_FTR_ASYM_SMT | \ CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ CPU_FTR_CFAR | CPU_FTR_HVMODE | \ CPU_FTR_VMX_COPY | CPU_FTR_HAS_PPR | CPU_FTR_DABRX ) @@ -445,7 +445,7 @@ static inline void cpu_feature_keys_init(void) { } CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | \ + CPU_FTR_DSCR | CPU_FTR_SAO | \ CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \ CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_DAWR | \ @@ -456,7 +456,7 @@ static inline void cpu_feature_keys_init(void) { } CPU_FTR_MMCRA | CPU_FTR_SMT | \ CPU_FTR_COHERENT_ICACHE | \ CPU_FTR_PURR | CPU_FTR_SPURR | CPU_FTR_REAL_LE | \ - CPU_FTR_DSCR | \ + CPU_FTR_DSCR | CPU_FTR_SAO | \ CPU_FTR_STCX_CHECKS_ADDRESS | CPU_FTR_POPCNTB | CPU_FTR_POPCNTD | \ CPU_FTR_CFAR | CPU_FTR_HVMODE | CPU_FTR_VMX_COPY | \ CPU_FTR_DBELL | CPU_FTR_HAS_PPR | CPU_FTR_ARCH_207S | \ @@ -474,7 +474,7 @@ static inline void cpu_feature_keys_init
[PATCH 0/2] Reintroduce PROT_SAO
This set re-introduces the PROT_SAO prot flag removed in Commit 5c9fa16e8abd ("powerpc/64s: Remove PROT_SAO support"). To address concerns regarding live migration of guests using SAO to P10 hosts without SAO support, the flag is disabled by default in LPARs. A new config option, PPC_PROT_SAO_LPAR was added to allow users to explicitly enable it if they will not be running in an environment where this is a conern. Shawn Anastasio (2): Revert "powerpc/64s: Remove PROT_SAO support" powerpc/64s: Disallow PROT_SAO in LPARs by default arch/powerpc/Kconfig | 12 ++ arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++-- arch/powerpc/include/asm/cputable.h | 10 ++--- arch/powerpc/include/asm/mman.h | 31 -- arch/powerpc/include/asm/nohash/64/pgtable.h | 2 + arch/powerpc/include/uapi/asm/mman.h | 2 +- arch/powerpc/kernel/dt_cpu_ftrs.c | 2 +- arch/powerpc/mm/book3s64/hash_utils.c | 2 + include/linux/mm.h| 2 + include/trace/events/mmflags.h| 2 + mm/ksm.c | 4 ++ tools/testing/selftests/powerpc/mm/.gitignore | 1 + tools/testing/selftests/powerpc/mm/Makefile | 4 +- tools/testing/selftests/powerpc/mm/prot_sao.c | 42 +++ 14 files changed, 107 insertions(+), 17 deletions(-) create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c -- 2.28.0
[powerpc:fixes-test] BUILD SUCCESS 90a9b102eddf6a3f987d15f4454e26a2532c1c98
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git fixes-test branch HEAD: 90a9b102eddf6a3f987d15f4454e26a2532c1c98 powerpc/pseries: Do not initiate shutdown when system is running on UPS elapsed time: 927m configs tested: 75 configs skipped: 75 The following configs have been built successfully. More configs may be tested in the coming days. arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig m68k m5275evb_defconfig armkeystone_defconfig s390 alldefconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig c6x allyesconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig powerpc defconfig i386 randconfig-a002-20200820 i386 randconfig-a004-20200820 i386 randconfig-a005-20200820 i386 randconfig-a003-20200820 i386 randconfig-a006-20200820 i386 randconfig-a001-20200820 x86_64 randconfig-a015-20200820 x86_64 randconfig-a012-20200820 x86_64 randconfig-a016-20200820 x86_64 randconfig-a014-20200820 x86_64 randconfig-a011-20200820 x86_64 randconfig-a013-20200820 i386 randconfig-a013-20200820 i386 randconfig-a012-20200820 i386 randconfig-a011-20200820 i386 randconfig-a016-20200820 i386 randconfig-a014-20200820 i386 randconfig-a015-20200820 i386 randconfig-a013-20200821 i386 randconfig-a012-20200821 i386 randconfig-a011-20200821 i386 randconfig-a016-20200821 i386 randconfig-a014-20200821 i386 randconfig-a015-20200821 riscvallyesconfig riscv allnoconfig riscv defconfig riscvallmodconfig x86_64 rhel x86_64 allyesconfig x86_64rhel-7.6-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 kexec --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
[powerpc:merge] BUILD SUCCESS 7c25bda14d66718f9fa428808dae289dd84f1da3
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git merge branch HEAD: 7c25bda14d66718f9fa428808dae289dd84f1da3 Automatic merge of 'master', 'next' and 'fixes' (2020-08-20 23:20) elapsed time: 926m configs tested: 69 configs skipped: 2 The following configs have been built successfully. More configs may be tested in the coming days. arm defconfig arm64allyesconfig arm64 defconfig arm allyesconfig arm allmodconfig m68k m5275evb_defconfig armkeystone_defconfig s390 alldefconfig ia64 allmodconfig ia64defconfig ia64 allyesconfig m68k allmodconfig m68kdefconfig m68k allyesconfig nios2 defconfig arc allyesconfig nds32 allnoconfig c6x allyesconfig nds32 defconfig nios2allyesconfig cskydefconfig alpha defconfig alphaallyesconfig xtensa allyesconfig h8300allyesconfig arc defconfig sh allmodconfig parisc defconfig s390 allyesconfig parisc allyesconfig s390defconfig i386 allyesconfig sparcallyesconfig sparc defconfig i386defconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig powerpc defconfig i386 randconfig-a002-20200820 i386 randconfig-a004-20200820 i386 randconfig-a005-20200820 i386 randconfig-a003-20200820 i386 randconfig-a006-20200820 i386 randconfig-a001-20200820 x86_64 randconfig-a015-20200820 x86_64 randconfig-a012-20200820 x86_64 randconfig-a016-20200820 x86_64 randconfig-a014-20200820 x86_64 randconfig-a011-20200820 x86_64 randconfig-a013-20200820 i386 randconfig-a013-20200820 i386 randconfig-a012-20200820 i386 randconfig-a011-20200820 i386 randconfig-a016-20200820 i386 randconfig-a014-20200820 i386 randconfig-a015-20200820 riscvallyesconfig riscv allnoconfig riscv defconfig riscvallmodconfig x86_64 rhel x86_64 allyesconfig x86_64rhel-7.6-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 kexec --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
[PATCH v5 8/8] mm/vmalloc: Hugepage vmalloc mappings
On platforms that define HAVE_ARCH_HUGE_VMAP and support PMD vmaps, vmalloc will attempt to allocate PMD-sized pages first, before falling back to small pages. Allocations which use something other than PAGE_KERNEL protections are not permitted to use huge pages yet, not all callers expect this (e.g., module allocations vs strict module rwx). This reduces TLB misses by nearly 30x on a `git diff` workload on a 2-node POWER9 (59,800 -> 2,100) and reduces CPU cycles by 0.54%. This can result in more internal fragmentation and memory overhead for a given allocation, an option nohugevmalloc is added to disable at boot. Signed-off-by: Nicholas Piggin --- .../admin-guide/kernel-parameters.txt | 2 + arch/Kconfig | 4 + arch/powerpc/Kconfig | 1 + include/linux/vmalloc.h | 1 + mm/page_alloc.c | 4 +- mm/vmalloc.c | 188 +- 6 files changed, 152 insertions(+), 48 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index bdc1f33fd3d1..6f0b41289a90 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3190,6 +3190,8 @@ nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings. + nohugevmalloc [PPC] Disable kernel huge vmalloc mappings. + nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. diff --git a/arch/Kconfig b/arch/Kconfig index af14a567b493..b2b89d629317 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -616,6 +616,10 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD config HAVE_ARCH_HUGE_VMAP bool +config HAVE_ARCH_HUGE_VMALLOC + depends on HAVE_ARCH_HUGE_VMAP + bool + config ARCH_WANT_HUGE_PMD_SHARE bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 95dfd8ef3d4b..044e5a94967a 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -175,6 +175,7 @@ config PPC select GENERIC_TIME_VSYSCALL select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU + select HAVE_ARCH_HUGE_VMALLOC if HAVE_ARCH_HUGE_VMAP select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14 select HAVE_ARCH_KASAN_VMALLOC if PPC32 && PPC_PAGE_SHIFT <= 14 diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index e3590e93bfff..8f25dbaca0a1 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -58,6 +58,7 @@ struct vm_struct { unsigned long size; unsigned long flags; struct page **pages; + unsigned intpage_order; unsigned intnr_pages; phys_addr_t phys_addr; const void *caller; diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0e2bab486fea..d785e5335529 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8102,6 +8102,7 @@ void *__init alloc_large_system_hash(const char *tablename, void *table = NULL; gfp_t gfp_flags; bool virt; + bool huge; /* allow the kernel cmdline to have a say */ if (!numentries) { @@ -8169,6 +8170,7 @@ void *__init alloc_large_system_hash(const char *tablename, } else if (get_order(size) >= MAX_ORDER || hashdist) { table = __vmalloc(size, gfp_flags); virt = true; + huge = (find_vm_area(table)->page_order > 0); } else { /* * If bucketsize is not a power-of-two, we may free @@ -8185,7 +8187,7 @@ void *__init alloc_large_system_hash(const char *tablename, pr_info("%s hash table entries: %ld (order: %d, %lu bytes, %s)\n", tablename, 1UL << log2qty, ilog2(size) - PAGE_SHIFT, size, - virt ? "vmalloc" : "linear"); + virt ? (huge ? "vmalloc hugepage" : "vmalloc") : "linear"); if (_hash_shift) *_hash_shift = log2qty; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 4e5cb7c7f780..564d7497e551 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -45,6 +45,19 @@ #include "internal.h" #include "pgalloc-track.h" +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC +static bool __ro_after_init vmap_allow_huge = true; + +static int __init set_nohugevmalloc(char *str) +{ + vmap_allow_huge = false; + return 0; +} +early_param("nohugevmalloc", set_nohugevmalloc); +#else /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ +static const bool vmap_allow_huge = false; +#endif /* CONFIG_HAVE_ARCH_HUGE_VMALLOC */ + bool is_vmalloc_addr(const void *x) { unsigned long addr = (unsigned long)x; @@ -468,31
[PATCH v5 7/8] mm/vmalloc: add vmap_range_noflush variant
As a side-effect, the order of flush_cache_vmap() and arch_sync_kernel_mappings() calls are switched, but that now matches the other callers in this file. Signed-off-by: Nicholas Piggin --- mm/vmalloc.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 129f10545bb1..4e5cb7c7f780 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -234,8 +234,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, return 0; } -int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, - unsigned int max_page_shift) +static int vmap_range_noflush(unsigned long addr, unsigned long end, phys_addr_t phys_addr, + pgprot_t prot, unsigned int max_page_shift) { pgd_t *pgd; unsigned long start; @@ -255,14 +255,23 @@ int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgp break; } while (pgd++, phys_addr += (next - addr), addr = next, addr != end); - flush_cache_vmap(start, end); - if (mask & ARCH_PAGE_TABLE_SYNC_MASK) arch_sync_kernel_mappings(start, end); return err; } +int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, + unsigned int max_page_shift) +{ + int err; + + err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift); + flush_cache_vmap(addr, end); + + return err; +} + static void vunmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, pgtbl_mod_mask *mask) { -- 2.23.0
[PATCH v5 6/8] mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c
This is a generic kernel virtual memory mapper, not specific to ioremap. Signed-off-by: Nicholas Piggin --- include/linux/vmalloc.h | 2 + mm/ioremap.c| 192 mm/vmalloc.c| 191 +++ 3 files changed, 193 insertions(+), 192 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 787d77ad7536..e3590e93bfff 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -181,6 +181,8 @@ extern struct vm_struct *remove_vm_area(const void *addr); extern struct vm_struct *find_vm_area(const void *addr); #ifdef CONFIG_MMU +extern int vmap_range(unsigned long addr, unsigned long end, phys_addr_t phys_addr, pgprot_t prot, + unsigned int max_page_shift); extern int map_kernel_range_noflush(unsigned long start, unsigned long size, pgprot_t prot, struct page **pages); int map_kernel_range(unsigned long start, unsigned long size, pgprot_t prot, diff --git a/mm/ioremap.c b/mm/ioremap.c index b0032dbadaf7..cdda0e022740 100644 --- a/mm/ioremap.c +++ b/mm/ioremap.c @@ -28,198 +28,6 @@ early_param("nohugeiomap", set_nohugeiomap); static const bool iomap_allow_huge = false; #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ -static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, - phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask) -{ - pte_t *pte; - u64 pfn; - - pfn = phys_addr >> PAGE_SHIFT; - pte = pte_alloc_kernel_track(pmd, addr, mask); - if (!pte) - return -ENOMEM; - do { - BUG_ON(!pte_none(*pte)); - set_pte_at(&init_mm, addr, pte, pfn_pte(pfn, prot)); - pfn++; - } while (pte++, addr += PAGE_SIZE, addr != end); - *mask |= PGTBL_PTE_MODIFIED; - return 0; -} - -static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, - phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift) -{ - if (max_page_shift < PMD_SHIFT) - return 0; - - if (!arch_vmap_pmd_supported(prot)) - return 0; - - if ((end - addr) != PMD_SIZE) - return 0; - - if (!IS_ALIGNED(addr, PMD_SIZE)) - return 0; - - if (!IS_ALIGNED(phys_addr, PMD_SIZE)) - return 0; - - if (pmd_present(*pmd) && !pmd_free_pte_page(pmd, addr)) - return 0; - - return pmd_set_huge(pmd, phys_addr, prot); -} - -static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, - phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift, - pgtbl_mod_mask *mask) -{ - pmd_t *pmd; - unsigned long next; - - pmd = pmd_alloc_track(&init_mm, pud, addr, mask); - if (!pmd) - return -ENOMEM; - do { - next = pmd_addr_end(addr, end); - - if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot, max_page_shift)) { - *mask |= PGTBL_PMD_MODIFIED; - continue; - } - - if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) - return -ENOMEM; - } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); - return 0; -} - -static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end, - phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift) -{ - if (max_page_shift < PUD_SHIFT) - return 0; - - if (!arch_vmap_pud_supported(prot)) - return 0; - - if ((end - addr) != PUD_SIZE) - return 0; - - if (!IS_ALIGNED(addr, PUD_SIZE)) - return 0; - - if (!IS_ALIGNED(phys_addr, PUD_SIZE)) - return 0; - - if (pud_present(*pud) && !pud_free_pmd_page(pud, addr)) - return 0; - - return pud_set_huge(pud, phys_addr, prot); -} - -static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, - phys_addr_t phys_addr, pgprot_t prot, unsigned int max_page_shift, - pgtbl_mod_mask *mask) -{ - pud_t *pud; - unsigned long next; - - pud = pud_alloc_track(&init_mm, p4d, addr, mask); - if (!pud) - return -ENOMEM; - do { - next = pud_addr_end(addr, end); - - if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot, max_page_shift)) { - *mask |= PGTBL_PUD_MODIFIED; - continue; - } - - if (vmap_pmd_range(pud, addr, next, phys_addr, prot, max_page_shift, mask)) - return -ENOMEM; - } while (pud++, phys_addr += (next - addr), addr = next, addr != end); - return
[PATCH v5 5/8] mm: HUGE_VMAP arch support cleanup
This changes the awkward approach where architectures provide init functions to determine which levels they can provide large mappings for, to one where the arch is queried for each call. This removes code and indirection, and allows constant-folding of dead code for unsupported levels. This also adds a prot argument to the arch query. This is unused currently but could help with some architectures (e.g., some powerpc processors can't map uncacheable memory with large pages). Signed-off-by: Nicholas Piggin --- arch/arm64/mm/mmu.c | 12 +-- arch/powerpc/mm/book3s64/radix_pgtable.c | 10 ++- arch/x86/mm/ioremap.c| 12 +-- include/linux/io.h | 9 --- include/linux/vmalloc.h | 10 +++ init/main.c | 1 - mm/ioremap.c | 96 +++- 7 files changed, 73 insertions(+), 77 deletions(-) diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c index 75df62fea1b6..bbb3ccf6a7ce 100644 --- a/arch/arm64/mm/mmu.c +++ b/arch/arm64/mm/mmu.c @@ -1304,12 +1304,13 @@ void *__init fixmap_remap_fdt(phys_addr_t dt_phys, int *size, pgprot_t prot) return dt_virt; } -int __init arch_ioremap_p4d_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } -int __init arch_ioremap_pud_supported(void) +bool arch_vmap_pud_supported(pgprot_t prot) { /* * Only 4k granule supports level 1 block mappings. @@ -1319,11 +1320,12 @@ int __init arch_ioremap_pud_supported(void) !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { - /* See arch_ioremap_pud_supported() */ + /* See arch_vmap_pud_supported() */ return !IS_ENABLED(CONFIG_PTDUMP_DEBUGFS); } +#endif int pud_set_huge(pud_t *pudp, phys_addr_t phys, pgprot_t prot) { diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c index ae823bba29f2..7d3a620c5adf 100644 --- a/arch/powerpc/mm/book3s64/radix_pgtable.c +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c @@ -1182,13 +1182,14 @@ void radix__ptep_modify_prot_commit(struct vm_area_struct *vma, set_pte_at(mm, addr, ptep, pte); } -int __init arch_ioremap_pud_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_pud_supported(pgprot_t prot) { /* HPT does not cope with large pages in the vmalloc area */ return radix_enabled(); } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { return radix_enabled(); } @@ -1197,6 +1198,7 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr) { return 0; } +#endif int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot) { @@ -1282,7 +1284,7 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) return 1; } -int __init arch_ioremap_p4d_supported(void) +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c index 84d85dbd1dad..5b8b495ab4ed 100644 --- a/arch/x86/mm/ioremap.c +++ b/arch/x86/mm/ioremap.c @@ -481,24 +481,26 @@ void iounmap(volatile void __iomem *addr) } EXPORT_SYMBOL(iounmap); -int __init arch_ioremap_p4d_supported(void) +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_p4d_supported(pgprot_t prot) { - return 0; + return false; } -int __init arch_ioremap_pud_supported(void) +bool arch_vmap_pud_supported(pgprot_t prot) { #ifdef CONFIG_X86_64 return boot_cpu_has(X86_FEATURE_GBPAGES); #else - return 0; + return false; #endif } -int __init arch_ioremap_pmd_supported(void) +bool arch_vmap_pmd_supported(pgprot_t prot) { return boot_cpu_has(X86_FEATURE_PSE); } +#endif /* * Convert a physical pointer to a virtual kernel pointer for /dev/mem diff --git a/include/linux/io.h b/include/linux/io.h index 8394c56babc2..f1effd4d7a3c 100644 --- a/include/linux/io.h +++ b/include/linux/io.h @@ -31,15 +31,6 @@ static inline int ioremap_page_range(unsigned long addr, unsigned long end, } #endif -#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP -void __init ioremap_huge_init(void); -int arch_ioremap_p4d_supported(void); -int arch_ioremap_pud_supported(void); -int arch_ioremap_pmd_supported(void); -#else -static inline void ioremap_huge_init(void) { } -#endif - /* * Managed iomap interface */ diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 0221f852a7e1..787d77ad7536 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -84,6 +84,16 @@ struct vmap_area { }; }; +#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP +bool arch_vmap_p4d_supported(pgprot_t prot); +bool arch_vmap_pud_supported(pgprot_t prot); +bool arch_vmap_pmd_supported(pgprot_t prot); +#else +static inline bool arch_vmap_p4d_suppo
[PATCH v5 4/8] lib/ioremap: rename ioremap_*_range to vmap_*_range
This will be moved to mm/ and used as a generic kernel virtual mapping function, so re-name it in preparation. Signed-off-by: Nicholas Piggin --- mm/ioremap.c | 55 ++-- 1 file changed, 23 insertions(+), 32 deletions(-) diff --git a/mm/ioremap.c b/mm/ioremap.c index 5fa1ab41d152..6016ae3227ad 100644 --- a/mm/ioremap.c +++ b/mm/ioremap.c @@ -61,9 +61,8 @@ static inline int ioremap_pud_enabled(void) { return 0; } static inline int ioremap_pmd_enabled(void) { return 0; } #endif /* CONFIG_HAVE_ARCH_HUGE_VMAP */ -static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, - pgtbl_mod_mask *mask) +static int vmap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask) { pte_t *pte; u64 pfn; @@ -81,9 +80,8 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr, return 0; } -static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, - pgprot_t prot) +static int vmap_try_huge_pmd(pmd_t *pmd, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot) { if (!ioremap_pmd_enabled()) return 0; @@ -103,9 +101,8 @@ static int ioremap_try_huge_pmd(pmd_t *pmd, unsigned long addr, return pmd_set_huge(pmd, phys_addr, prot); } -static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, - pgtbl_mod_mask *mask) +static int vmap_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask) { pmd_t *pmd; unsigned long next; @@ -116,20 +113,19 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr, do { next = pmd_addr_end(addr, end); - if (ioremap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { + if (vmap_try_huge_pmd(pmd, addr, next, phys_addr, prot)) { *mask |= PGTBL_PMD_MODIFIED; continue; } - if (ioremap_pte_range(pmd, addr, next, phys_addr, prot, mask)) + if (vmap_pte_range(pmd, addr, next, phys_addr, prot, mask)) return -ENOMEM; } while (pmd++, phys_addr += (next - addr), addr = next, addr != end); return 0; } -static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, - pgprot_t prot) +static int vmap_try_huge_pud(pud_t *pud, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot) { if (!ioremap_pud_enabled()) return 0; @@ -149,9 +145,8 @@ static int ioremap_try_huge_pud(pud_t *pud, unsigned long addr, return pud_set_huge(pud, phys_addr, prot); } -static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, pgprot_t prot, - pgtbl_mod_mask *mask) +static int vmap_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot, pgtbl_mod_mask *mask) { pud_t *pud; unsigned long next; @@ -162,20 +157,19 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr, do { next = pud_addr_end(addr, end); - if (ioremap_try_huge_pud(pud, addr, next, phys_addr, prot)) { + if (vmap_try_huge_pud(pud, addr, next, phys_addr, prot)) { *mask |= PGTBL_PUD_MODIFIED; continue; } - if (ioremap_pmd_range(pud, addr, next, phys_addr, prot, mask)) + if (vmap_pmd_range(pud, addr, next, phys_addr, prot, mask)) return -ENOMEM; } while (pud++, phys_addr += (next - addr), addr = next, addr != end); return 0; } -static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, - pgprot_t prot) +static int vmap_try_huge_p4d(p4d_t *p4d, unsigned long addr, unsigned long end, + phys_addr_t phys_addr, pgprot_t prot) { if (!ioremap_p4d_enabled()) return 0; @@ -195,9 +189,8 @@ static int ioremap_try_huge_p4d(p4d_t *p4d, unsigned long addr, return p4d_set_huge(p4d, phys_addr, prot); } -static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr, - unsigned long end, phys_addr_t phys_addr, pgprot_t p
[PATCH v5 3/8] mm/vmalloc: rename vmap_*_range vmap_pages_*_range
The vmalloc mapper operates on a struct page * array rather than a linear physical address, re-name it to make this distinction clear. Signed-off-by: Nicholas Piggin --- mm/vmalloc.c | 28 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 49f225b0f855..3a1e45fd1626 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -190,9 +190,8 @@ void unmap_kernel_range_noflush(unsigned long start, unsigned long size) arch_sync_kernel_mappings(start, end); } -static int vmap_pte_range(pmd_t *pmd, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr, - pgtbl_mod_mask *mask) +static int vmap_pages_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask) { pte_t *pte; @@ -218,9 +217,8 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr, return 0; } -static int vmap_pmd_range(pud_t *pud, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr, - pgtbl_mod_mask *mask) +static int vmap_pages_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask) { pmd_t *pmd; unsigned long next; @@ -230,15 +228,14 @@ static int vmap_pmd_range(pud_t *pud, unsigned long addr, return -ENOMEM; do { next = pmd_addr_end(addr, end); - if (vmap_pte_range(pmd, addr, next, prot, pages, nr, mask)) + if (vmap_pages_pte_range(pmd, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (pmd++, addr = next, addr != end); return 0; } -static int vmap_pud_range(p4d_t *p4d, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr, - pgtbl_mod_mask *mask) +static int vmap_pages_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask) { pud_t *pud; unsigned long next; @@ -248,15 +245,14 @@ static int vmap_pud_range(p4d_t *p4d, unsigned long addr, return -ENOMEM; do { next = pud_addr_end(addr, end); - if (vmap_pmd_range(pud, addr, next, prot, pages, nr, mask)) + if (vmap_pages_pmd_range(pud, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (pud++, addr = next, addr != end); return 0; } -static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, - unsigned long end, pgprot_t prot, struct page **pages, int *nr, - pgtbl_mod_mask *mask) +static int vmap_pages_p4d_range(pgd_t *pgd, unsigned long addr, unsigned long end, + pgprot_t prot, struct page **pages, int *nr, pgtbl_mod_mask *mask) { p4d_t *p4d; unsigned long next; @@ -266,7 +262,7 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr, return -ENOMEM; do { next = p4d_addr_end(addr, end); - if (vmap_pud_range(p4d, addr, next, prot, pages, nr, mask)) + if (vmap_pages_pud_range(p4d, addr, next, prot, pages, nr, mask)) return -ENOMEM; } while (p4d++, addr = next, addr != end); return 0; @@ -307,7 +303,7 @@ int map_kernel_range_noflush(unsigned long addr, unsigned long size, next = pgd_addr_end(addr, end); if (pgd_bad(*pgd)) mask |= PGTBL_PGD_MODIFIED; - err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); + err = vmap_pages_p4d_range(pgd, addr, next, prot, pages, &nr, &mask); if (err) return err; } while (pgd++, addr = next, addr != end); -- 2.23.0
[PATCH v5 2/8] mm: apply_to_pte_range warn and fail if a large pte is encountered
Signed-off-by: Nicholas Piggin --- mm/memory.c | 60 +++-- 1 file changed, 44 insertions(+), 16 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index f95edbb77326..19986af291e0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2261,13 +2261,20 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud, } do { next = pmd_addr_end(addr, end); - if (create || !pmd_none_or_clear_bad(pmd)) { - err = apply_to_pte_range(mm, pmd, addr, next, fn, data, -create); - if (err) - break; + if (pmd_none(*pmd) && !create) + continue; + if (WARN_ON_ONCE(pmd_leaf(*pmd))) + return -EINVAL; + if (WARN_ON_ONCE(pmd_bad(*pmd))) { + if (!create) + continue; + pmd_clear_bad(pmd); } + err = apply_to_pte_range(mm, pmd, addr, next, fn, data, create); + if (err) + break; } while (pmd++, addr = next, addr != end); + return err; } @@ -2288,13 +2295,20 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d, } do { next = pud_addr_end(addr, end); - if (create || !pud_none_or_clear_bad(pud)) { - err = apply_to_pmd_range(mm, pud, addr, next, fn, data, -create); - if (err) - break; + if (pud_none(*pud) && !create) + continue; + if (WARN_ON_ONCE(pud_leaf(*pud))) + return -EINVAL; + if (WARN_ON_ONCE(pud_bad(*pud))) { + if (!create) + continue; + pud_clear_bad(pud); } + err = apply_to_pmd_range(mm, pud, addr, next, fn, data, create); + if (err) + break; } while (pud++, addr = next, addr != end); + return err; } @@ -2315,13 +2329,20 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd, } do { next = p4d_addr_end(addr, end); - if (create || !p4d_none_or_clear_bad(p4d)) { - err = apply_to_pud_range(mm, p4d, addr, next, fn, data, -create); - if (err) - break; + if (p4d_none(*p4d) && !create) + continue; + if (WARN_ON_ONCE(p4d_leaf(*p4d))) + return -EINVAL; + if (WARN_ON_ONCE(p4d_bad(*p4d))) { + if (!create) + continue; + p4d_clear_bad(p4d); } + err = apply_to_pud_range(mm, p4d, addr, next, fn, data, create); + if (err) + break; } while (p4d++, addr = next, addr != end); + return err; } @@ -2340,8 +2361,15 @@ static int __apply_to_page_range(struct mm_struct *mm, unsigned long addr, pgd = pgd_offset(mm, addr); do { next = pgd_addr_end(addr, end); - if (!create && pgd_none_or_clear_bad(pgd)) + if (pgd_none(*pgd) && !create) continue; + if (WARN_ON_ONCE(pgd_leaf(*pgd))) + return -EINVAL; + if (WARN_ON_ONCE(pgd_bad(*pgd))) { + if (!create) + continue; + pgd_clear_bad(pgd); + } err = apply_to_p4d_range(mm, pgd, addr, next, fn, data, create); if (err) break; -- 2.23.0
[PATCH v5 1/8] mm/vmalloc: fix vmalloc_to_page for huge vmap mappings
vmalloc_to_page returns NULL for addresses mapped by larger pages[*]. Whether or not a vmap is huge depends on the architecture details, alignments, boot options, etc., which the caller can not be expected to know. Therefore HUGE_VMAP is a regression for vmalloc_to_page. This change teaches vmalloc_to_page about larger pages, and returns the struct page that corresponds to the offset within the large page. This makes the API agnostic to mapping implementation details. [*] As explained by commit 029c54b095995 ("mm/vmalloc.c: huge-vmap: fail gracefully on unexpected huge vmap mappings") Signed-off-by: Nicholas Piggin --- mm/vmalloc.c | 40 ++-- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b482d240f9a2..49f225b0f855 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -38,6 +38,7 @@ #include #include +#include #include #include @@ -343,7 +344,9 @@ int is_vmalloc_or_module_addr(const void *x) } /* - * Walk a vmap address to the struct page it maps. + * Walk a vmap address to the struct page it maps. Huge vmap mappings will + * return the tail page that corresponds to the base page address, which + * matches small vmap mappings. */ struct page *vmalloc_to_page(const void *vmalloc_addr) { @@ -363,25 +366,33 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) if (pgd_none(*pgd)) return NULL; + if (WARN_ON_ONCE(pgd_leaf(*pgd))) + return NULL; /* XXX: no allowance for huge pgd */ + if (WARN_ON_ONCE(pgd_bad(*pgd))) + return NULL; + p4d = p4d_offset(pgd, addr); if (p4d_none(*p4d)) return NULL; - pud = pud_offset(p4d, addr); + if (p4d_leaf(*p4d)) + return p4d_page(*p4d) + ((addr & ~P4D_MASK) >> PAGE_SHIFT); + if (WARN_ON_ONCE(p4d_bad(*p4d))) + return NULL; - /* -* Don't dereference bad PUD or PMD (below) entries. This will also -* identify huge mappings, which we may encounter on architectures -* that define CONFIG_HAVE_ARCH_HUGE_VMAP=y. Such regions will be -* identified as vmalloc addresses by is_vmalloc_addr(), but are -* not [unambiguously] associated with a struct page, so there is -* no correct value to return for them. -*/ - WARN_ON_ONCE(pud_bad(*pud)); - if (pud_none(*pud) || pud_bad(*pud)) + pud = pud_offset(p4d, addr); + if (pud_none(*pud)) + return NULL; + if (pud_leaf(*pud)) + return pud_page(*pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + if (WARN_ON_ONCE(pud_bad(*pud))) return NULL; + pmd = pmd_offset(pud, addr); - WARN_ON_ONCE(pmd_bad(*pmd)); - if (pmd_none(*pmd) || pmd_bad(*pmd)) + if (pmd_none(*pmd)) + return NULL; + if (pmd_leaf(*pmd)) + return pmd_page(*pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); + if (WARN_ON_ONCE(pmd_bad(*pmd))) return NULL; ptep = pte_offset_map(pmd, addr); @@ -389,6 +400,7 @@ struct page *vmalloc_to_page(const void *vmalloc_addr) if (pte_present(pte)) page = pte_page(pte); pte_unmap(ptep); + return page; } EXPORT_SYMBOL(vmalloc_to_page); -- 2.23.0
[PATCH v5 0/8] huge vmalloc mappings
I made this powerpc-only for the time being. It shouldn't be too hard to add support for other archs that define HUGE_VMAP. I have booted x86 with it enabled, just may not have audited everything. Hi Andrew, would you care to put this in your tree? Thanks, Nick Since v4: - Fixed an off-by-page-order bug in v4 - Several minor cleanups. - Added page order to /proc/vmallocinfo - Added hugepage to alloc_large_system_hage output. - Made an architecture config option, powerpc only for now. Since v3: - Fixed an off-by-one bug in a loop - Fix !CONFIG_HAVE_ARCH_HUGE_VMAP build fail - Hopefully this time fix the arm64 vmap stack bug, thanks Jonathan Cameron for debugging the cause of this (hopefully). Since v2: - Rebased on vmalloc cleanups, split series into simpler pieces. - Fixed several compile errors and warnings - Keep the page array and accounting in small page units because struct vm_struct is an interface (this should fix x86 vmap stack debug assert). [Thanks Zefan] Nicholas Piggin (8): mm/vmalloc: fix vmalloc_to_page for huge vmap mappings mm: apply_to_pte_range warn and fail if a large pte is encountered mm/vmalloc: rename vmap_*_range vmap_pages_*_range lib/ioremap: rename ioremap_*_range to vmap_*_range mm: HUGE_VMAP arch support cleanup mm: Move vmap_range from lib/ioremap.c to mm/vmalloc.c mm/vmalloc: add vmap_range_noflush variant mm/vmalloc: Hugepage vmalloc mappings .../admin-guide/kernel-parameters.txt | 2 + arch/Kconfig | 4 + arch/arm64/mm/mmu.c | 12 +- arch/powerpc/Kconfig | 1 + arch/powerpc/mm/book3s64/radix_pgtable.c | 10 +- arch/x86/mm/ioremap.c | 12 +- include/linux/io.h| 9 - include/linux/vmalloc.h | 13 + init/main.c | 1 - mm/ioremap.c | 231 + mm/memory.c | 60 ++- mm/page_alloc.c | 4 +- mm/vmalloc.c | 456 +++--- 13 files changed, 476 insertions(+), 339 deletions(-) -- 2.23.0
Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes
On 08/21/2020 09:03 AM, Anshuman Khandual wrote: > > > On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote: >> "Aneesh Kumar K.V" writes: >> >>> This patch series includes fixes for debug_vm_pgtable test code so that >>> they follow page table updates rules correctly. The first two patches >>> introduce >>> changes w.r.t ppc64. The patches are included in this series for >>> completeness. We can >>> merge them via ppc64 tree if required. >>> >>> Hugetlb test is disabled on ppc64 because that needs larger change to >>> satisfy >>> page table update rules. >>> >>> Changes from V1: >>> * Address review feedback >>> * drop test specific pfn_pte and pfn_pmd. >>> * Update ppc64 page table helper to add _PAGE_PTE >>> >>> Aneesh Kumar K.V (13): >>> powerpc/mm: Add DEBUG_VM WARN for pmd_clear >>> powerpc/mm: Move setting pte specific flags to pfn_pte >>> mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value >>> mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge >>> vmap support. >>> mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with >>> CONFIG_NUMA_BALANCING >>> mm/debug_vm_pgtable/THP: Mark the pte entry huge before using >>> set_pmd/pud_at >>> mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an >>> existing pte entry >>> mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP >>> mm/debug_vm_pgtable/locks: Move non page table modifying test together >>> mm/debug_vm_pgtable/locks: Take correct page table lock >>> mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries >>> mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64 >>> mm/debug_vm_pgtable: populate a pte entry before fetching it >>> >>> arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++- >>> arch/powerpc/include/asm/nohash/pgtable.h| 5 - >>> arch/powerpc/mm/book3s64/pgtable.c | 2 +- >>> arch/powerpc/mm/pgtable.c| 5 - >>> include/linux/io.h | 12 ++ >>> mm/debug_vm_pgtable.c| 151 +++ >>> 6 files changed, 127 insertions(+), 77 deletions(-) >>> >> >> BTW I picked a wrong branch when sending this. Attaching the diff >> against what I want to send. pfn_pmd() no more updates _PAGE_PTE >> because that is handled by pmd_mkhuge(). >> >> diff --git a/arch/powerpc/mm/book3s64/pgtable.c >> b/arch/powerpc/mm/book3s64/pgtable.c >> index 3b4da7c63e28..e18ae50a275c 100644 >> --- a/arch/powerpc/mm/book3s64/pgtable.c >> +++ b/arch/powerpc/mm/book3s64/pgtable.c >> @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot) >> unsigned long pmdv; >> >> pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK; >> -return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE); >> +return pmd_set_protbits(__pmd(pmdv), pgprot); >> } >> >> pmd_t mk_pmd(struct page *page, pgprot_t pgprot) >> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c >> index 7d9f8e1d790f..cad61d22f33a 100644 >> --- a/mm/debug_vm_pgtable.c >> +++ b/mm/debug_vm_pgtable.c >> @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned >> long pfn, pgprot_t prot) >> >> static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) >> { >> -pmd_t pmd = pfn_pmd(pfn, prot); >> +pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); >> >> if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) >> return; >> > > Cover letter does not mention which branch or tag this series applies on. > Just assumed it to be 5.9-rc1. Should the above changes be captured as a > pre-requisite patch ? > > Anyways, the series fails to be build on arm64. > > A) Without CONFIG_TRANSPARENT_HUGEPAGE > > mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’: > mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function > ‘pmd_advanced_tests’ > pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); > ^~ > mm/debug_vm_pgtable.c:366:20: note: declared here > static void __init pmd_advanced_tests(struct mm_struct *mm, > ^~ > > B) As mentioned previously, this should be solved by including > > mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’: > mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function > ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? > [-Werror=implicit-function-declaration] > if (!arch_ioremap_pmd_supported()) >^~ > > Please make sure that the series builds on all enabled platforms i.e x86, > arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling > all the features that make various #ifdefs in the test. > > - Anshuman Here is another build failure on x86. mm/debug_vm_pgtable.c: In function ‘pud_advanced_tests’: mm/debug_vm_pgtable.c:306:31: error: passing argument 1 of ‘pudp_huge_get_and_clear_full’ from incompatible pointer type [-Werror=incompatibl
Re: [PATCH v2 00/13] mm/debug_vm_pgtable fixes
On 08/19/2020 07:15 PM, Aneesh Kumar K.V wrote: > "Aneesh Kumar K.V" writes: > >> This patch series includes fixes for debug_vm_pgtable test code so that >> they follow page table updates rules correctly. The first two patches >> introduce >> changes w.r.t ppc64. The patches are included in this series for >> completeness. We can >> merge them via ppc64 tree if required. >> >> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy >> page table update rules. >> >> Changes from V1: >> * Address review feedback >> * drop test specific pfn_pte and pfn_pmd. >> * Update ppc64 page table helper to add _PAGE_PTE >> >> Aneesh Kumar K.V (13): >> powerpc/mm: Add DEBUG_VM WARN for pmd_clear >> powerpc/mm: Move setting pte specific flags to pfn_pte >> mm/debug_vm_pgtable/ppc64: Avoid setting top bits in radom value >> mm/debug_vm_pgtables/hugevmap: Use the arch helper to identify huge >> vmap support. >> mm/debug_vm_pgtable/savedwrite: Enable savedwrite test with >> CONFIG_NUMA_BALANCING >> mm/debug_vm_pgtable/THP: Mark the pte entry huge before using >> set_pmd/pud_at >> mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an >> existing pte entry >> mm/debug_vm_pgtable/thp: Use page table depost/withdraw with THP >> mm/debug_vm_pgtable/locks: Move non page table modifying test together >> mm/debug_vm_pgtable/locks: Take correct page table lock >> mm/debug_vm_pgtable/pmd_clear: Don't use pmd/pud_clear on pte entries >> mm/debug_vm_pgtable/hugetlb: Disable hugetlb test on ppc64 >> mm/debug_vm_pgtable: populate a pte entry before fetching it >> >> arch/powerpc/include/asm/book3s/64/pgtable.h | 29 +++- >> arch/powerpc/include/asm/nohash/pgtable.h| 5 - >> arch/powerpc/mm/book3s64/pgtable.c | 2 +- >> arch/powerpc/mm/pgtable.c| 5 - >> include/linux/io.h | 12 ++ >> mm/debug_vm_pgtable.c| 151 +++ >> 6 files changed, 127 insertions(+), 77 deletions(-) >> > > BTW I picked a wrong branch when sending this. Attaching the diff > against what I want to send. pfn_pmd() no more updates _PAGE_PTE > because that is handled by pmd_mkhuge(). > > diff --git a/arch/powerpc/mm/book3s64/pgtable.c > b/arch/powerpc/mm/book3s64/pgtable.c > index 3b4da7c63e28..e18ae50a275c 100644 > --- a/arch/powerpc/mm/book3s64/pgtable.c > +++ b/arch/powerpc/mm/book3s64/pgtable.c > @@ -141,7 +141,7 @@ pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot) > unsigned long pmdv; > > pmdv = (pfn << PAGE_SHIFT) & PTE_RPN_MASK; > - return __pmd(pmdv | pgprot_val(pgprot) | _PAGE_PTE); > + return pmd_set_protbits(__pmd(pmdv), pgprot); > } > > pmd_t mk_pmd(struct page *page, pgprot_t pgprot) > diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c > index 7d9f8e1d790f..cad61d22f33a 100644 > --- a/mm/debug_vm_pgtable.c > +++ b/mm/debug_vm_pgtable.c > @@ -229,7 +229,7 @@ static void __init pmd_huge_tests(pmd_t *pmdp, unsigned > long pfn, pgprot_t prot) > > static void __init pmd_savedwrite_tests(unsigned long pfn, pgprot_t prot) > { > - pmd_t pmd = pfn_pmd(pfn, prot); > + pmd_t pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); > > if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) > return; > Cover letter does not mention which branch or tag this series applies on. Just assumed it to be 5.9-rc1. Should the above changes be captured as a pre-requisite patch ? Anyways, the series fails to be build on arm64. A) Without CONFIG_TRANSPARENT_HUGEPAGE mm/debug_vm_pgtable.c: In function ‘debug_vm_pgtable’: mm/debug_vm_pgtable.c:1045:2: error: too many arguments to function ‘pmd_advanced_tests’ pmd_advanced_tests(mm, vma, pmdp, pmd_aligned, vaddr, prot, saved_ptep); ^~ mm/debug_vm_pgtable.c:366:20: note: declared here static void __init pmd_advanced_tests(struct mm_struct *mm, ^~ B) As mentioned previously, this should be solved by including mm/debug_vm_pgtable.c: In function ‘pmd_huge_tests’: mm/debug_vm_pgtable.c:215:7: error: implicit declaration of function ‘arch_ioremap_pmd_supported’; did you mean ‘arch_disable_smp_support’? [-Werror=implicit-function-declaration] if (!arch_ioremap_pmd_supported()) ^~ Please make sure that the series builds on all enabled platforms i.e x86, arm64, ppc32, ppc64, arc, s390 along with selectively enabling/disabling all the features that make various #ifdefs in the test. - Anshuman
[PATCH] tty: hvcs: Don't NULL tty->driver_data until hvcs_cleanup()
The code currently NULLs tty->driver_data in hvcs_close() with the intent of informing the next call to hvcs_open() that device needs to be reconfigured. However, when hvcs_cleanup() is called we copy hvcsd from tty->driver_data which was previoulsy NULLed by hvcs_close() and our call to tty_port_put(&hvcsd->port) doesn't actually do anything since &hvcsd->port ends up translating to NULL by chance. This has the side effect that when hvcs_remove() is called we have one too many port references preventing hvcs_destuct_port() from ever being called. This also prevents us from reusing the /dev/hvcsX node in a future hvcs_probe() and we can eventually run out of /dev/hvcsX devices. Fix this by waiting to NULL tty->driver_data in hvcs_cleanup(). Signed-off-by: Tyrel Datwyler --- drivers/tty/hvc/hvcs.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c index 55105ac38f89..509d1042825a 100644 --- a/drivers/tty/hvc/hvcs.c +++ b/drivers/tty/hvc/hvcs.c @@ -1216,13 +1216,6 @@ static void hvcs_close(struct tty_struct *tty, struct file *filp) tty_wait_until_sent(tty, HVCS_CLOSE_WAIT); - /* -* This line is important because it tells hvcs_open that this -* device needs to be re-configured the next time hvcs_open is -* called. -*/ - tty->driver_data = NULL; - free_irq(irq, hvcsd); return; } else if (hvcsd->port.count < 0) { @@ -1237,6 +1230,13 @@ static void hvcs_cleanup(struct tty_struct * tty) { struct hvcs_struct *hvcsd = tty->driver_data; + /* +* This line is important because it tells hvcs_open that this +* device needs to be re-configured the next time hvcs_open is +* called. +*/ + tty->driver_data = NULL; + tty_port_put(&hvcsd->port); } -- 2.27.0
Re: [PATCH net-next v2 0/4] refactoring of ibmvnic code
From: Lijun Pan Date: Wed, 19 Aug 2020 17:52:22 -0500 > This patch series refactor reset_init and init functions, > and make some other cosmetic changes to make the code > easier to read and debug. v2 removes __func__ and v1's 1/5. Series applied, thank you.
[RFT][PATCH 1/7] powerpc/iommu: Avoid overflow at boundary_size
The boundary_size might be as large as ULONG_MAX, which means that a device has no specific boundary limit. So either "+ 1" or passing it to ALIGN() would potentially overflow. According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So fixing a potential overflow with the safer shortcut. Reported-by: Stephen Rothwell Signed-off-by: Nicolin Chen Cc: Christoph Hellwig --- arch/powerpc/kernel/iommu.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 9704f3f76e63..c01ccbf8afdd 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -236,15 +236,14 @@ static unsigned long iommu_range_alloc(struct device *dev, } } - if (dev) - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, - 1 << tbl->it_page_shift); - else - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift); /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */ + boundary_size = dev ? dma_get_seg_boundary(dev) : U32_MAX; + + /* Overflow-free shortcut for: ALIGN(b + 1, 1 << s) >> s */ + boundary_size = (boundary_size >> tbl->it_page_shift) + 1; n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset, -boundary_size >> tbl->it_page_shift, align_mask); +boundary_size, align_mask); if (n == -1) { if (likely(pass == 0)) { /* First try the pool from the start */ -- 2.17.1
[RFT][PATCH 0/7] Avoid overflow at boundary_size
We are expending the default DMA segmentation boundary to its possible maximum value (ULONG_MAX) to indicate that a device doesn't specify a boundary limit. So all dma_get_seg_boundary callers should take a precaution with the return values since it would easily get overflowed. I scanned the entire kernel tree for all the existing callers and found that most of callers may get overflowed in two ways: either "+ 1" or passing it to ALIGN() that does "+ mask". According to kernel defines: #define ALIGN_MASK(x, mask) (((x) + (mask)) & ~(mask)) #define ALIGN(x, a) ALIGN_MASK(x, (typeof(x))(a) - 1) We can simplify the logic here: ALIGN(boundary + 1, 1 << shift) >> shift = ALIGN_MASK(b + 1, (1 << s) - 1) >> s = {[b + 1 + (1 << s) - 1] & ~[(1 << s) - 1]} >> s = [b + 1 + (1 << s) - 1] >> s = [b + (1 << s)] >> s = (b >> s) + 1 So this series of patches fix the potential overflow with this overflow-free shortcut. As I don't think that I have these platforms, marking RFT. Thanks Nic Nicolin Chen (7): powerpc/iommu: Avoid overflow at boundary_size alpha: Avoid overflow at boundary_size ia64/sba_iommu: Avoid overflow at boundary_size s390/pci_dma: Avoid overflow at boundary_size sparc: Avoid overflow at boundary_size x86/amd_gart: Avoid overflow at boundary_size parisc: Avoid overflow at boundary_size arch/alpha/kernel/pci_iommu.c| 10 -- arch/ia64/hp/common/sba_iommu.c | 4 ++-- arch/powerpc/kernel/iommu.c | 11 +-- arch/s390/pci/pci_dma.c | 4 ++-- arch/sparc/kernel/iommu-common.c | 9 +++-- arch/sparc/kernel/iommu.c| 4 ++-- arch/sparc/kernel/pci_sun4v.c| 4 ++-- arch/x86/kernel/amd_gart_64.c| 4 ++-- drivers/parisc/ccio-dma.c| 4 ++-- drivers/parisc/sba_iommu.c | 4 ++-- 10 files changed, 26 insertions(+), 32 deletions(-) -- 2.17.1
Re: [PATCH v2 3/6] powerpc/32s: Only leave NX unset on segments used for modules
On Jun 29 2020, Christophe Leroy wrote: > Instead of leaving NX unset on all segments above the start > of vmalloc space, only leave NX unset on segments used for > modules. I'm getting this crash: kernel tried to execute exec-protected page (f294b000) - exploit attempt (uid: 0) BUG: Unable to handle kernel instruction fetch Faulting instruction address: 0xf294b000 Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K MMU=Hash PowerMac Modules linked in: pata_macio(+) CPU: 0 PID: 87 Comm: udevd Not tainted 5.8.0-rc2-test #49 NIP: f294b000 LR: 0005c60 CTR: f294b000 REGS: f18d9cc0 TRAP: 0400 Not tainted (5.8.0-rc2-test) MSR: 10009032 CR: 84222422 XER: 2000 GPR00: c0005c14 f18d9d78 ef30ca20 efe0 c00993d0 ef6da038 005e GPR08: c09050b8 c08b f18d9d78 44222422 10072070 0fefaca4 GPR16: 1006a00c f294d50b 0120 0124 c0096ea8 000e ef2776c0 ef2776e4 GPR24: f18fd6e8 0001 c086fe64 c086fe04 c08b f294b000 NIP [f294b000] pata_macio_init+0x0/0xc0 [pata_macio] LR [c0005c60] do_one_initcall+0x6c/0x160 Call Trace: [f18d9d78] [c0005c14] do_one_initcall+0x20/0x160 (unreliable) [f18d9dd8] [c009a22c] do_init_module+0x60/0x1c0 [f18d9df8] [c00993d8] load_module+0x16a8/0x1c14 [f18d9ea8] [c0099aa4] sys_finit_module+0x8c/0x94 [f18d9f38] [c0012174] ret_from_syscall+0x0/0x34 --- interrupt: c01 at 0xfdb4318 LR = 0xfeee9c0 Instruction dump: <3d20c08b> 3d40c086 9421ffe0 8129106c ---[ end trace 85a98cc836109871 ]--- Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1 "And now for something completely different."
Re: [PATCH v2 3/4] powerpc/memhotplug: Make lmb size 64bit
"Aneesh Kumar K.V" writes: > @@ -322,12 +322,16 @@ static int pseries_remove_mem_node(struct device_node > *np) > /* >* Find the base address and size of the memblock >*/ > - regs = of_get_property(np, "reg", NULL); > - if (!regs) > + prop = of_get_property(np, "reg", NULL); > + if (!prop) > return ret; > > - base = be64_to_cpu(*(unsigned long *)regs); > - lmb_size = be32_to_cpu(regs[3]); > + /* > + * "reg" property represents (addr,size) tuple. > + */ > + base = of_read_number(prop, mem_addr_cells); > + prop += mem_addr_cells; > + lmb_size = of_read_number(prop, mem_size_cells); Would of_n_size_cells() and of_n_addr_cells() work here?
Re: [PATCH v2 1/4] powerpc/drmem: Make lmb_size 64 bit
"Aneesh Kumar K.V" writes: > Similar to commit 89c140bbaeee ("pseries: Fix 64 bit logical memory block > panic") > make sure different variables tracking lmb_size are updated to be 64 bit. > > This was found by code audit. > > Cc: sta...@vger.kernel.org > Signed-off-by: Aneesh Kumar K.V > --- > arch/powerpc/include/asm/drmem.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/include/asm/drmem.h > b/arch/powerpc/include/asm/drmem.h > index 17ccc6474ab6..d719cbac34b2 100644 > --- a/arch/powerpc/include/asm/drmem.h > +++ b/arch/powerpc/include/asm/drmem.h > @@ -21,7 +21,7 @@ struct drmem_lmb { > struct drmem_lmb_info { > struct drmem_lmb*lmbs; > int n_lmbs; > - u32 lmb_size; > + u64 lmb_size; > }; > > extern struct drmem_lmb_info *drmem_info; > @@ -67,7 +67,7 @@ struct of_drconf_cell_v2 { > #define DRCONF_MEM_RESERVED 0x0080 > #define DRCONF_MEM_HOTREMOVABLE 0x0100 > > -static inline u32 drmem_lmb_size(void) > +static inline u64 drmem_lmb_size(void) > { > return drmem_info->lmb_size; > } Looks fine. Acked-by: Nathan Lynch
Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
Le 20/08/2020 à 15:10, Guohua Zhong a écrit : When cat /proc/pid/stat, do_task_stat will call into cputime_adjust, which call stack is like this: [17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0 [17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW O4.4.176 #1 [17179954.674339]dtask: dc9d7040 task.stack: d3cb4000 [17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: [17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: PW O (4.4.176) [17179954.674355]MSR: 00021002 CR: 28002224 XER: [17179954.674364] GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0 025d ffe15b24 GPR08: de86aead 03ff 2800 0084d1c0 GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062 000f GPR24: d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c [17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0 [17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138 [17179954.674434]Call Trace: [17179961.832693]Call Trace: [17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable) [17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80 [17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870 [17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4 [17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c [17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114 [17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c [17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4 [17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64 ->div_u64_rem->do_div->__div64_32 In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel the cputime has changed from unsigned long to u64 data type. About 200 days, the lowwer 32 bit will be 0x. Because divisor for __div64_32 is unsigned long data type,which is 32 bit for powepc 32, the bug still exists. So it is also a bug in the cputime_adjust which does not check if stime + utime = 0 time = scale_stime((__force u64)stime, (__force u64)rtime, (__force u64)(stime + utime)); The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in mainline kernel may has fixed this case. But it is also better to check if divisor is 0 in __div64_32 for other situation. Signed-off-by: Guohua Zhong Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)" Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)" Cc: sta...@vger.kernel.org # v2.6.15+ --- arch/powerpc/boot/div64.S | 4 arch/powerpc/lib/div64.S | 4 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S index 4354928ed62e..39a25b9712d1 100644 --- a/arch/powerpc/boot/div64.S +++ b/arch/powerpc/boot/div64.S @@ -13,6 +13,9 @@ .globl __div64_32 __div64_32: + li r9,0 + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero In generic version in lib/math/div64.c, there is no checking of 'base' either. Do we really want to add this check in the powerpc version only ? The only user of __div64_32() is do_div() in include/asm-generic/div64.h. Wouldn't it be better to do the check there ? Christophe lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,6 +55,7 @@ __div64_32: 4:stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr /* diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S index 3d5426e7dcc4..1cc9bcabf678 100644 --- a/arch/powerpc/lib/div64.S +++ b/arch/powerpc/lib/div64.S @@ -13,6 +13,9 @@ #include _GLOBAL(__div64_32) + li r9,0 + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,4 +55,5 @@ _GLOBAL(__div64_32) 4:stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr
Re: [PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
Le 20/08/2020 à 15:10, Guohua Zhong a écrit : When cat /proc/pid/stat, do_task_stat will call into cputime_adjust, which call stack is like this: [17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0 [17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW O4.4.176 #1 [17179954.674339]dtask: dc9d7040 task.stack: d3cb4000 [17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: [17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: PW O (4.4.176) [17179954.674355]MSR: 00021002 CR: 28002224 XER: [17179954.674364] GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0 025d ffe15b24 GPR08: de86aead 03ff 2800 0084d1c0 GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062 000f GPR24: d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c [17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0 [17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138 [17179954.674434]Call Trace: [17179961.832693]Call Trace: [17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable) [17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80 [17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870 [17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4 [17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c [17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114 [17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c [17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4 [17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64 ->div_u64_rem->do_div->__div64_32 In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel the cputime has changed from unsigned long to u64 data type. About 200 days, the lowwer 32 bit will be 0x. Because divisor for __div64_32 is unsigned long data type,which is 32 bit for powepc 32, the bug still exists. So it is also a bug in the cputime_adjust which does not check if stime + utime = 0 time = scale_stime((__force u64)stime, (__force u64)rtime, (__force u64)(stime + utime)); The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in mainline kernel may has fixed this case. But it is also better to check if divisor is 0 in __div64_32 for other situation. Signed-off-by: Guohua Zhong Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)" Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)" Cc: sta...@vger.kernel.org # v2.6.15+ --- arch/powerpc/boot/div64.S | 4 arch/powerpc/lib/div64.S | 4 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S index 4354928ed62e..39a25b9712d1 100644 --- a/arch/powerpc/boot/div64.S +++ b/arch/powerpc/boot/div64.S @@ -13,6 +13,9 @@ .globl __div64_32 __div64_32: + li r9,0 + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,6 +55,7 @@ __div64_32: 4:stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr /* diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S index 3d5426e7dcc4..1cc9bcabf678 100644 --- a/arch/powerpc/lib/div64.S +++ b/arch/powerpc/lib/div64.S @@ -13,6 +13,9 @@ #include _GLOBAL(__div64_32) + li r9,0 You don't need to load r9 with 0, use cmplwi instead. + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero You should leave space between the compare and the branch (i.e. have other instructions inbetween when possible), so that the processor can prepare the branching and do a good prediction. Same as the compare below, you see that there are two other instructions between the cmplw are the blt. You can eventually use another cr field than cr0 in order to nest several test/branches. Also because on recent powerpc32, instructions are fetched and executed two by two. lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,4 +55,5 @@ _GLOBAL(__div64_32) 4:stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr Christophe
[PATCH] powerpc: Fix a bug in __div64_32 if divisor is zero
When cat /proc/pid/stat, do_task_stat will call into cputime_adjust, which call stack is like this: [17179954.674326]BookE Watchdog detected hard LOCKUP on cpu 0 [17179954.674331]dCPU: 0 PID: 1262 Comm: TICK Tainted: PW O4.4.176 #1 [17179954.674339]dtask: dc9d7040 task.stack: d3cb4000 [17179954.674344]NIP: c001b1a8 LR: c006a7ac CTR: [17179954.674349]REGS: e6fe1f10 TRAP: 3202 Tainted: PW O (4.4.176) [17179954.674355]MSR: 00021002 CR: 28002224 XER: [17179954.674364] GPR00: 0016 d3cb5cb0 dc9d7040 d3cb5cc0 025d ffe15b24 GPR08: de86aead 03ff 2800 0084d1c0 GPR16: b5929ca0 b4bb7a48 c0863c08 048d 0062 0062 000f GPR24: d3cb5d08 d3cb5d60 d3cb5d64 00029002 d3e9c214 f30e d3e9c20c [17179954.674410]NIP [c001b1a8] __div64_32+0x60/0xa0 [17179954.674422]LR [c006a7ac] cputime_adjust+0x124/0x138 [17179954.674434]Call Trace: [17179961.832693]Call Trace: [17179961.832695][d3cb5cb0] [c006a6dc] cputime_adjust+0x54/0x138 (unreliable) [17179961.832705][d3cb5cf0] [c006a818] task_cputime_adjusted+0x58/0x80 [17179961.832713][d3cb5d20] [c01dab44] do_task_stat+0x298/0x870 [17179961.832720][d3cb5de0] [c01d4948] proc_single_show+0x60/0xa4 [17179961.832728][d3cb5e10] [c01963d8] seq_read+0x2d8/0x52c [17179961.832736][d3cb5e80] [c01702fc] __vfs_read+0x40/0x114 [17179961.832744][d3cb5ef0] [c0170b1c] vfs_read+0x9c/0x10c [17179961.832751][d3cb5f10] [c0171440] SyS_read+0x68/0xc4 [17179961.832759][d3cb5f40] [c0010a40] ret_from_syscall+0x0/0x3c do_task_stat->task_cputime_adjusted->cputime_adjust->scale_stime->div_u64 ->div_u64_rem->do_div->__div64_32 In some corner case, stime + utime = 0 if overflow. Even in v5.8.2 kernel the cputime has changed from unsigned long to u64 data type. About 200 days, the lowwer 32 bit will be 0x. Because divisor for __div64_32 is unsigned long data type,which is 32 bit for powepc 32, the bug still exists. So it is also a bug in the cputime_adjust which does not check if stime + utime = 0 time = scale_stime((__force u64)stime, (__force u64)rtime, (__force u64)(stime + utime)); The commit 3dc167ba5729 ("sched/cputime: Improve cputime_adjust()") in mainline kernel may has fixed this case. But it is also better to check if divisor is 0 in __div64_32 for other situation. Signed-off-by: Guohua Zhong Fixes:14cf11af6cf6 "( powerpc: Merge enough to start building in arch/powerpc.)" Fixes:94b212c29f68 "( powerpc: Move ppc64 boot wrapper code over to arch/powerpc)" Cc: sta...@vger.kernel.org # v2.6.15+ --- arch/powerpc/boot/div64.S | 4 arch/powerpc/lib/div64.S | 4 2 files changed, 8 insertions(+) diff --git a/arch/powerpc/boot/div64.S b/arch/powerpc/boot/div64.S index 4354928ed62e..39a25b9712d1 100644 --- a/arch/powerpc/boot/div64.S +++ b/arch/powerpc/boot/div64.S @@ -13,6 +13,9 @@ .globl __div64_32 __div64_32: + li r9,0 + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,6 +55,7 @@ __div64_32: 4: stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr /* diff --git a/arch/powerpc/lib/div64.S b/arch/powerpc/lib/div64.S index 3d5426e7dcc4..1cc9bcabf678 100644 --- a/arch/powerpc/lib/div64.S +++ b/arch/powerpc/lib/div64.S @@ -13,6 +13,9 @@ #include _GLOBAL(__div64_32) + li r9,0 + cmplw r4,r9 # check if divisor r4 is zero + beq 5f # jump to label 5 if r4(divisor) is zero lwz r5,0(r3)# get the dividend into r5/r6 lwz r6,4(r3) cmplw r5,r4 @@ -52,4 +55,5 @@ _GLOBAL(__div64_32) 4: stw r7,0(r3)# return the quotient in *r3 stw r8,4(r3) mr r3,r6 # return the remainder in r3 +5: # return if divisor r4 is zero blr -- 2.12.3
Re: [PATCH v2 07/13] mm/debug_vm_pgtable/set_pte/pmd/pud: Don't use set_*_at to update an existing pte entry
Le 19/08/2020 à 15:01, Aneesh Kumar K.V a écrit : set_pte_at() should not be used to set a pte entry at locations that already holds a valid pte entry. Architectures like ppc64 don't do TLB invalidate in set_pte_at() and hence expect it to be used to set locations that are not a valid PTE. Signed-off-by: Aneesh Kumar K.V --- mm/debug_vm_pgtable.c | 35 +++ 1 file changed, 15 insertions(+), 20 deletions(-) diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 76f4c713e5a3..9c7e2c9cfc76 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -74,15 +74,18 @@ static void __init pte_advanced_tests(struct mm_struct *mm, { pte_t pte = pfn_pte(pfn, prot); + /* +* Architectures optimize set_pte_at by avoiding TLB flush. +* This requires set_pte_at to be not used to update an +* existing pte entry. Clear pte before we do set_pte_at +*/ + pr_debug("Validating PTE advanced\n"); pte = pfn_pte(pfn, prot); set_pte_at(mm, vaddr, ptep, pte); ptep_set_wrprotect(mm, vaddr, ptep); pte = ptep_get(ptep); WARN_ON(pte_write(pte)); - - pte = pfn_pte(pfn, prot); - set_pte_at(mm, vaddr, ptep, pte); ptep_get_and_clear(mm, vaddr, ptep); pte = ptep_get(ptep); WARN_ON(!pte_none(pte)); @@ -96,13 +99,11 @@ static void __init pte_advanced_tests(struct mm_struct *mm, ptep_set_access_flags(vma, vaddr, ptep, pte, 1); pte = ptep_get(ptep); WARN_ON(!(pte_write(pte) && pte_dirty(pte))); - - pte = pfn_pte(pfn, prot); - set_pte_at(mm, vaddr, ptep, pte); ptep_get_and_clear_full(mm, vaddr, ptep, 1); pte = ptep_get(ptep); WARN_ON(!pte_none(pte)); + pte = pfn_pte(pfn, prot); pte = pte_mkyoung(pte); set_pte_at(mm, vaddr, ptep, pte); ptep_test_and_clear_young(vma, vaddr, ptep); @@ -164,9 +165,6 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_set_wrprotect(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(pmd_write(pmd)); - - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); - set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear(mm, vaddr, pmdp); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); @@ -180,13 +178,11 @@ static void __init pmd_advanced_tests(struct mm_struct *mm, pmdp_set_access_flags(vma, vaddr, pmdp, pmd, 1); pmd = READ_ONCE(*pmdp); WARN_ON(!(pmd_write(pmd) && pmd_dirty(pmd))); - - pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); - set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_huge_get_and_clear_full(vma, vaddr, pmdp, 1); pmd = READ_ONCE(*pmdp); WARN_ON(!pmd_none(pmd)); + pmd = pmd_mkhuge(pfn_pmd(pfn, prot)); pmd = pmd_mkyoung(pmd); set_pmd_at(mm, vaddr, pmdp, pmd); pmdp_test_and_clear_young(vma, vaddr, pmdp); @@ -283,18 +279,10 @@ static void __init pud_advanced_tests(struct mm_struct *mm, WARN_ON(pud_write(pud)); #ifndef __PAGETABLE_PMD_FOLDED Same as below, once set_put_at() is gone, I don't think this #ifndef __PAGETABLE_PMD_FOLDED is still need, should be possible to replace by 'if (mm_pmd_folded())' - - pud = pud_mkhuge(pfn_pud(pfn, prot)); - set_pud_at(mm, vaddr, pudp, pud); pudp_huge_get_and_clear(mm, vaddr, pudp); pud = READ_ONCE(*pudp); WARN_ON(!pud_none(pud)); - pud = pud_mkhuge(pfn_pud(pfn, prot)); - set_pud_at(mm, vaddr, pudp, pud); - pudp_huge_get_and_clear_full(mm, vaddr, pudp, 1); - pud = READ_ONCE(*pudp); - WARN_ON(!pud_none(pud)); #endif /* __PAGETABLE_PMD_FOLDED */ pud = pud_mkhuge(pfn_pud(pfn, prot)); @@ -307,6 +295,13 @@ static void __init pud_advanced_tests(struct mm_struct *mm, pud = READ_ONCE(*pudp); WARN_ON(!(pud_write(pud) && pud_dirty(pud))); +#ifndef __PAGETABLE_PMD_FOLDED + pudp_huge_get_and_clear_full(vma, vaddr, pudp, 1); + pud = READ_ONCE(*pudp); + WARN_ON(!pud_none(pud)); +#endif /* __PAGETABLE_PMD_FOLDED */ pudp_huge_get_and_clear_full() and pud_none() are always defined, I think this #ifndef can be replaced by an 'if (mm_pmd_folded())' + + pud = pud_mkhuge(pfn_pud(pfn, prot)); pud = pud_mkyoung(pud); set_pud_at(mm, vaddr, pudp, pud); pudp_test_and_clear_young(vma, vaddr, pudp); Christophe
Re: [PATCH v2] powerpc/pseries: Do not initiate shutdown when system is running on UPS
On Thu, 20 Aug 2020 11:48:44 +0530, Vasant Hegde wrote: > As per PAPR we have to look for both EPOW sensor value and event modifier to > identify type of event and take appropriate action. > > Sensor value = 3 (EPOW_SYSTEM_SHUTDOWN) schedule system to be shutdown after > OS defined delay (default 10 mins). > > EPOW Event Modifier for sensor value = 3: >We have to initiate immediate shutdown for most of the event modifier > except >value = 2 (system running on UPS). > > [...] Applied to powerpc/fixes. [1/1] powerpc/pseries: Do not initiate shutdown when system is running on UPS https://git.kernel.org/powerpc/c/90a9b102eddf6a3f987d15f4454e26a2532c1c98 cheers
Re: [PATCH] powerpc/perf: Account for interrupts during PMC overflow for an invalid SIAR check
On Thu, 6 Aug 2020 08:46:32 -0400, Athira Rajeev wrote: > Performance monitor interrupt handler checks if any counter has overflown > and calls `record_and_restart` in core-book3s which invokes > `perf_event_overflow` to record the sample information. > Apart from creating sample, perf_event_overflow also does the interrupt > and period checks via perf_event_account_interrupt. > > Currently we record information only if the SIAR valid bit is set > ( using `siar_valid` check ) and hence the interrupt check. > But it is possible that we do sampling for some events that are not > generating valid SIAR and hence there is no chance to disable the event > if interrupts is more than max_samples_per_tick. This leads to soft lockup. > > [...] Applied to powerpc/fixes. [1/1] powerpc/perf: Fix soft lockups due to missed interrupt accounting https://git.kernel.org/powerpc/c/17899eaf88d689529b866371344c8f269ba79b5f cheers
Re: [PATCH] powerpc/powernv/pci: Fix typo when releasing DMA resources
On Wed, 19 Aug 2020 15:07:41 +0200, Frederic Barrat wrote: > Fix typo introduced during recent code cleanup, which could lead to > silently not freeing resources or oops message (on PCI hotplug or CAPI > reset). > Only impacts ioda2, the code path for ioda1 is correct. Applied to powerpc/fixes. [1/1] powerpc/powernv/pci: Fix possible crash when releasing DMA resources https://git.kernel.org/powerpc/c/e17a7c0e0aebb956719ce2a8465f649859c2da7d cheers
Re: [PATCH v2] powerpc/pseries: Do not initiate shutdown when system is running on UPS
Vasant Hegde writes: > As per PAPR we have to look for both EPOW sensor value and event modifier to > identify type of event and take appropriate action. > > Sensor value = 3 (EPOW_SYSTEM_SHUTDOWN) schedule system to be shutdown after > OS defined delay (default 10 mins). > > EPOW Event Modifier for sensor value = 3: >We have to initiate immediate shutdown for most of the event modifier > except >value = 2 (system running on UPS). > > Checking with firmware document its clear that we have to wait for predefined > time before initiating shutdown. If power is restored within time we should > cancel the shutdown process. I think commit 79872e35 accidently enabled > immediate poweroff for EPOW_SHUTDOWN_ON_UPS event. It's not that clear to me :) LoPAPR v1.1 section 10.2.2 includes table 136 "EPOW Action Codes": SYSTEM_SHUTDOWN 3 The system must be shut down. An EPOW-aware OS logs the EPOW error log information, then schedules the system to be shut down to begin after an OS defined delay internal (default is 10 minutes.) And then in section 10.3.2.2.8 there is table 146 "Platform Event Log Format, Version 6, EPOW Section", which includes the "EPOW Event Modifier": For EPOW sensor value = 3 0x01 = Normal system shutdown with no additional delay 0x02 = Loss of utility power, system is running on UPS/Battery 0x03 = Loss of system critical functions, system should be shutdown 0x04 = Ambient temperature too high All other values = reserved There is also section 7.3.6.4 which includes a note saying: 2. The report that a system needs to be shutdown due to running under a UPS would be given by the platform as an EPOW event with EPOW event modifier being given as, 0x02 = Loss of utility power, system is running on UPS/Battery, as described in section Section 10.3.2.2.8‚ “Platform Event Log Format, EPOW Section‚” on page 308. So the only mention of the 10 minutes is in relation to all SYSTEM_SHUTDOWN events. ie. according to that we should not be doing an immediate shutdown for any of the events. > We have user space tool (rtas_errd) on LPAR to monitor for > EPOW_SHUTDOWN_ON_UPS. > Once it gets event it initiates shutdown after predefined time. Also starts > monitoring for any new EPOW events. If it receives "Power restored" event > before predefined time it will cancel the shutdown. Otherwise after > predefined time it will shutdown the system. What event are you referring to as the "Power restored" event? AFAICS PAPR just says we "may" receive an EPOW_RESET. I can't see anything else about what we're supposed to do if power is restored. Anyway I'm not opposed to the change, but I don't think it's correct to say that PAPR defines the behaviour. Rather we used to implement a certain behaviour, and we have at least one customer who relies on that old behaviour and dislikes the new behaviour. It's also generally good to defer decisions like this to userspace, so that administrators can customise the behaviour. Anyway I'll massage the change log a bit to incorporate some of the above and apply it. cheers > Fixes: 79872e35 (powerpc/pseries: All events of EPOW_SYSTEM_SHUTDOWN must > initiate shutdown) > Cc: sta...@vger.kernel.org # v4.0+ > Cc: Tyrel Datwyler > Cc: Michael Ellerman > Signed-off-by: Vasant Hegde > --- > Changes in v2: > - Updated patch description based on mpe, Tyrel comment. > > -Vasant > arch/powerpc/platforms/pseries/ras.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/powerpc/platforms/pseries/ras.c > b/arch/powerpc/platforms/pseries/ras.c > index f3736fcd98fc..13c86a292c6d 100644 > --- a/arch/powerpc/platforms/pseries/ras.c > +++ b/arch/powerpc/platforms/pseries/ras.c > @@ -184,7 +184,6 @@ static void handle_system_shutdown(char event_modifier) > case EPOW_SHUTDOWN_ON_UPS: > pr_emerg("Loss of system power detected. System is running on" >" UPS/battery. Check RTAS error log for details\n"); > - orderly_poweroff(true); > break; > > case EPOW_SHUTDOWN_LOSS_OF_CRITICAL_FUNCTIONS: > -- > 2.26.2
Re: [PATCH] kernel/watchdog: fix warning -Wunused-variable for watchdog_allowed_mask in ppc64
On Fri 2020-08-14 19:03:30, Balamuruhan S wrote: > In ppc64 config if `CONFIG_SOFTLOCKUP_DETECTOR` is not set then it > warns for unused declaration of `watchdog_allowed_mask` while building, > move the declaration inside ifdef later in the code. > > ``` > kernel/watchdog.c:47:23: warning: ‘watchdog_allowed_mask’ defined but not > used [-Wunused-variable] > static struct cpumask watchdog_allowed_mask __read_mostly; > ``` > > Signed-off-by: Balamuruhan S > --- > kernel/watchdog.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 5abb5b22ad13..33c9b8a3d51b 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -44,7 +44,6 @@ int __read_mostly soft_watchdog_user_enabled = 1; > int __read_mostly watchdog_thresh = 10; > static int __read_mostly nmi_watchdog_available; > > -static struct cpumask watchdog_allowed_mask __read_mostly; > > struct cpumask watchdog_cpumask __read_mostly; > unsigned long *watchdog_cpumask_bits = cpumask_bits(&watchdog_cpumask); > @@ -166,6 +165,7 @@ int __read_mostly sysctl_softlockup_all_cpu_backtrace; > unsigned int __read_mostly softlockup_panic = > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE; > > +static struct cpumask watchdog_allowed_mask __read_mostly; I could confirm that the variable is used only in code that is built when CONFIG_SOFTLOCKUP_DETECTOR is enabled. Note that the problem can't be seen on x86. There the softlockup detector is enforced together with hardloclup detector via via HARDLOCKUP_DETECTOR_PERF. Reviewed-by: Petr Mladek Best Regards, Petr
Re: [PATCH] powerpc/powernv/pci: Fix typo when releasing DMA resources
Le 20/08/2020 à 06:18, Michael Ellerman a écrit : I changed the subject to: powerpc/powernv/pci: Fix possible crash when releasing DMA resources Much better, thanks! Fred