Re: [PATCH 0/6] Memory corruption may occur due to incorrent tlb flush
On Thu, Feb 20, 2020 at 11:04:51AM +0530, Santosh Sivaraj wrote: > The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC > flushes) may result in random memory corruption. Any concurrent page-table > walk > could end up with a Use-after-Free. Even on UP this might give issues, since > mmu_gather is preemptible these days. An interrupt or preempted task accessing > user pages might stumble into the free page if the hardware caches page > directories. > > The series is a backport of the fix sent by Peter [1]. > > The first three patches are dependencies for the last patch (avoid potential > double flush). If the performance impact due to double flush is considered > trivial then the first three patches and last patch may be dropped. > > [1] https://patchwork.kernel.org/cover/11284843/ Can you resend these with the git commit ids of the upstream patches in them, and say what stable tree(s) you wish to have them applied to? thanks, greg k-h
[PATCH 6/6] asm-generic/tlb: avoid potential double flush
From: Peter Zijlstra Aneesh reported that: tlb_flush_mmu() tlb_flush_mmu_tlbonly() tlb_flush() <-- #1 tlb_flush_mmu_free() tlb_table_flush() tlb_table_invalidate() tlb_flush_mmu_tlbonly() tlb_flush() <-- #2 does two TLBIs when tlb->fullmm, because __tlb_reset_range() will not clear tlb->end in that case. Observe that any caller to __tlb_adjust_range() also sets at least one of the tlb->freed_tables || tlb->cleared_p* bits, and those are unconditionally cleared by __tlb_reset_range(). Change the condition for actually issuing TLBI to having one of those bits set, as opposed to having tlb->end != 0. Link: http://lkml.kernel.org/r/20200116064531.483522-4-aneesh.ku...@linux.ibm.com Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Aneesh Kumar K.V Reported-by: "Aneesh Kumar K.V" Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: backported to 4.19 stable] --- include/asm-generic/tlb.h | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 19934cdd143e..427a70c56ddd 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -179,7 +179,12 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { - if (!tlb->end) + /* +* Anything calling __tlb_adjust_range() also sets at least one of +* these bits. +*/ + if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds || + tlb->cleared_puds || tlb->cleared_p4ds)) return; tlb_flush(tlb); -- 2.24.1
[PATCH 5/6] mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush
From: Peter Zijlstra Architectures for which we have hardware walkers of Linux page table should flush TLB on mmu gather batch allocation failures and batch flush. Some architectures like POWER supports multiple translation modes (hash and radix) and in the case of POWER only radix translation mode needs the above TLBI. This is because for hash translation mode kernel wants to avoid this extra flush since there are no hardware walkers of linux page table. With radix translation, the hardware also walks linux page table and with that, kernel needs to make sure to TLB invalidate page walk cache before page table pages are freed. More details in commit d86564a2f085 ("mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE") The changes to sparc are to make sure we keep the old behavior since we are now removing HAVE_RCU_TABLE_NO_INVALIDATE. The default value for tlb_needs_table_invalidate is to always force an invalidate and sparc can avoid the table invalidate. Hence we define tlb_needs_table_invalidate to false for sparc architecture. Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.ku...@linux.ibm.com Fixes: a46cc7a90fd8 ("powerpc/mm/radix: Improve TLB/PWC flushes") Signed-off-by: Peter Zijlstra (Intel) Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: backported to 4.19 stable] --- arch/Kconfig| 3 --- arch/powerpc/Kconfig| 1 - arch/powerpc/include/asm/tlb.h | 11 +++ arch/sparc/Kconfig | 1 - arch/sparc/include/asm/tlb_64.h | 9 + include/asm-generic/tlb.h | 15 +++ mm/memory.c | 16 7 files changed, 43 insertions(+), 13 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 061a12b8140e..3abbdb0cea44 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -363,9 +363,6 @@ config HAVE_ARCH_JUMP_LABEL config HAVE_RCU_TABLE_FREE bool -config HAVE_RCU_TABLE_NO_INVALIDATE - bool - config ARCH_HAVE_NMI_SAFE_CMPXCHG bool diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index fa231130eee1..b6429f53835e 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -216,7 +216,6 @@ config PPC select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE - select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index f0e571b2dc7c..63418275f402 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -30,6 +30,17 @@ #define tlb_remove_check_page_size_change tlb_remove_check_page_size_change extern void tlb_flush(struct mmu_gather *tlb); +/* + * book3s: + * Hash does not use the linux page-tables, so we can avoid + * the TLB invalidate for page-table freeing, Radix otoh does use the + * page-tables and needs the TLBI. + * + * nohash: + * We still do TLB invalidate in the __pte_free_tlb routine before we + * add the page table pages to mmu gather table batch. + */ +#define tlb_needs_table_invalidate() radix_enabled() /* Get the generic bits... */ #include diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index d90d632868aa..e6f2a38d2e61 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -64,7 +64,6 @@ config SPARC64 select HAVE_KRETPROBES select HAVE_KPROBES select HAVE_RCU_TABLE_FREE if SMP - select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_MEMBLOCK_NODE_MAP select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_DYNAMIC_FTRACE diff --git a/arch/sparc/include/asm/tlb_64.h b/arch/sparc/include/asm/tlb_64.h index a2f3fa61ee36..8cb8f3833239 100644 --- a/arch/sparc/include/asm/tlb_64.h +++ b/arch/sparc/include/asm/tlb_64.h @@ -28,6 +28,15 @@ void flush_tlb_pending(void); #define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) #define tlb_flush(tlb) flush_tlb_pending() +/* + * SPARC64's hardware TLB fill does not use the Linux page-tables + * and therefore we don't need a TLBI when freeing page-table pages. + */ + +#ifdef CONFIG_HAVE_RCU_TABLE_FREE +#define tlb_needs_table_invalidate() (false) +#endif + #include #endif /* _SPARC64_TLB_H */ diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index f2b9dc9cbaf8..19934cdd143e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -61,8 +61,23 @@ struct mmu_table_batch { extern void tlb_table_flush(struct mmu_gather *tlb); extern void tlb_remove_table(struct mmu_gather *tlb, void *table); +/* + * This allows an architecture that does not use the linux page-tables for + * hardware to skip the TLBI when freeing page tables. + */ +#ifndef tlb_needs_table_invalidate +#define tlb_needs_table_invalidate() (true) #e
[PATCH 4/6] powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case
From: "Aneesh Kumar K.V" Patch series "Fixup page directory freeing", v4. This is a repost of patch series from Peter with the arch specific changes except ppc64 dropped. ppc64 changes are added here because we are redoing the patch series on top of ppc64 changes. This makes it easy to backport these changes. Only the first 2 patches need to be backported to stable. The thing is, on anything SMP, freeing page directories should observe the exact same order as normal page freeing: 1) unhook page/directory 2) TLB invalidate 3) free page/directory Without this, any concurrent page-table walk could end up with a Use-after-Free. This is esp. trivial for anything that has software page-table walkers (HAVE_FAST_GUP / software TLB fill) or the hardware caches partial page-walks (ie. caches page directories). Even on UP this might give issues since mmu_gather is preemptible these days. An interrupt or preempted task accessing user pages might stumble into the free page if the hardware caches page directories. This patch series fixes ppc64 and add generic MMU_GATHER changes to support the conversion of other architectures. I haven't added patches w.r.t other architecture because they are yet to be acked. This patch (of 9): A followup patch is going to make sure we correctly invalidate page walk cache before we free page table pages. In order to keep things simple enable RCU_TABLE_FREE even for !SMP so that we don't have to fixup the !SMP case differently in the followup patch !SMP case is right now broken for radix translation w.r.t page walk cache flush. We can get interrupted in between page table free and that would imply we have page walk cache entries pointing to tables which got freed already. Michael said "both our platforms that run on Power9 force SMP on in Kconfig, so the !SMP case is unlikely to be a problem for anyone in practice, unless they've hacked their kernel to build it !SMP." Link: http://lkml.kernel.org/r/20200116064531.483522-2-aneesh.ku...@linux.ibm.com Signed-off-by: Aneesh Kumar K.V Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: backported for 4.19 stable] --- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 -- arch/powerpc/mm/pgtable-book3s64.c | 7 --- 4 files changed, 1 insertion(+), 18 deletions(-) diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index f7f046ff6407..fa231130eee1 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -215,7 +215,7 @@ config PPC select HAVE_HARDLOCKUP_DETECTOR_PERFif PERF_EVENTS && HAVE_PERF_EVENTS_NMI && !HAVE_HARDLOCKUP_DETECTOR_ARCH select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP - select HAVE_RCU_TABLE_FREE if SMP + select HAVE_RCU_TABLE_FREE select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h index 82e44b1a00ae..79ba3fbb512e 100644 --- a/arch/powerpc/include/asm/book3s/32/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h @@ -110,7 +110,6 @@ static inline void pgtable_free(void *table, unsigned index_size) #define check_pgt_cache() do { } while (0) #define get_hugepd_cache_index(x) (x) -#ifdef CONFIG_SMP static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift) { @@ -127,13 +126,6 @@ static inline void __tlb_remove_table(void *_table) pgtable_free(table, shift); } -#else -static inline void pgtable_free_tlb(struct mmu_gather *tlb, - void *table, int shift) -{ - pgtable_free(table, shift); -} -#endif static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table, unsigned long address) diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h index f9019b579903..1013c0214213 100644 --- a/arch/powerpc/include/asm/book3s/64/pgalloc.h +++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h @@ -47,9 +47,7 @@ extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long); extern void pte_fragment_free(unsigned long *, int); extern void pmd_fragment_free(unsigned long *); extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift); -#ifdef CONFIG_SMP extern void __tlb_remove_table(void *_table); -#endif static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm) { diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c index 297db665d953..5b4e9fd8990c 100644 --- a/arch/powerpc/mm/pgtable-book3s64.c +++ b/arch/powerpc/mm/pgtable-book3s64.c @@ -432,7 +432,6 @@ static inline void pgtable_free(void
[PATCH 3/6] asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE
From: Peter Zijlstra Make issuing a TLB invalidate for page-table pages the normal case. The reason is twofold: - too many invalidates is safer than too few, - most architectures use the linux page-tables natively and would thus require this. Make it an opt-out, instead of an opt-in. No change in behavior intended. Signed-off-by: Peter Zijlstra (Intel) Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: prerequisite for upcoming tlbflush backports] --- arch/Kconfig | 2 +- arch/powerpc/Kconfig | 1 + arch/sparc/Kconfig | 1 + arch/x86/Kconfig | 1 - mm/memory.c | 2 +- 5 files changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index a336548487e6..061a12b8140e 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -363,7 +363,7 @@ config HAVE_ARCH_JUMP_LABEL config HAVE_RCU_TABLE_FREE bool -config HAVE_RCU_TABLE_INVALIDATE +config HAVE_RCU_TABLE_NO_INVALIDATE bool config ARCH_HAVE_NMI_SAFE_CMPXCHG diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index a80669209155..f7f046ff6407 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -216,6 +216,7 @@ config PPC select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE if SMP + select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if PPC64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig index e6f2a38d2e61..d90d632868aa 100644 --- a/arch/sparc/Kconfig +++ b/arch/sparc/Kconfig @@ -64,6 +64,7 @@ config SPARC64 select HAVE_KRETPROBES select HAVE_KPROBES select HAVE_RCU_TABLE_FREE if SMP + select HAVE_RCU_TABLE_NO_INVALIDATE if HAVE_RCU_TABLE_FREE select HAVE_MEMBLOCK_NODE_MAP select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_DYNAMIC_FTRACE diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index af35f5caadbe..181d0d522977 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -181,7 +181,6 @@ config X86 select HAVE_PERF_REGS select HAVE_PERF_USER_STACK_DUMP select HAVE_RCU_TABLE_FREE if PARAVIRT - select HAVE_RCU_TABLE_INVALIDATEif HAVE_RCU_TABLE_FREE select HAVE_REGS_AND_STACK_ACCESS_API select HAVE_RELIABLE_STACKTRACE if X86_64 && (UNWINDER_FRAME_POINTER || UNWINDER_ORC) && STACK_VALIDATION select HAVE_STACKPROTECTOR if CC_HAS_SANE_STACKPROTECTOR diff --git a/mm/memory.c b/mm/memory.c index 1832c5ed6ac0..ba5689610c04 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -327,7 +327,7 @@ bool __tlb_remove_page_size(struct mmu_gather *tlb, struct page *page, int page_ */ static inline void tlb_table_invalidate(struct mmu_gather *tlb) { -#ifdef CONFIG_HAVE_RCU_TABLE_INVALIDATE +#ifndef CONFIG_HAVE_RCU_TABLE_NO_INVALIDATE /* * Invalidate page-table caches used by hardware walkers. Then we still * need to RCU-sched wait while freeing the pages because software -- 2.24.1
[PATCH 2/6] asm-generic/tlb: Track which levels of the page tables have been cleared
From: Will Deacon It is common for architectures with hugepage support to require only a single TLB invalidation operation per hugepage during unmap(), rather than iterating through the mapping at a PAGE_SIZE increment. Currently, however, the level in the page table where the unmap() operation occurs is not stored in the mmu_gather structure, therefore forcing architectures to issue additional TLB invalidation operations or to give up and over-invalidate by e.g. invalidating the entire TLB. Ideally, we could add an interval rbtree to the mmu_gather structure, which would allow us to associate the correct mapping granule with the various sub-mappings within the range being invalidated. However, this is costly in terms of book-keeping and memory management, so instead we approximate by keeping track of the page table levels that are cleared and provide a means to query the smallest granule required for invalidation. Signed-off-by: Will Deacon Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: prerequisite for upcoming tlbflush backports] --- include/asm-generic/tlb.h | 58 +-- mm/memory.c | 4 ++- 2 files changed, 53 insertions(+), 9 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 97306b32d8d2..f2b9dc9cbaf8 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -114,6 +114,14 @@ struct mmu_gather { */ unsigned intfreed_tables : 1; + /* +* at which levels have we cleared entries? +*/ + unsigned intcleared_ptes : 1; + unsigned intcleared_pmds : 1; + unsigned intcleared_puds : 1; + unsigned intcleared_p4ds : 1; + struct mmu_gather_batch *active; struct mmu_gather_batch local; struct page *__pages[MMU_GATHER_BUNDLE]; @@ -148,6 +156,10 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->end = 0; } tlb->freed_tables = 0; + tlb->cleared_ptes = 0; + tlb->cleared_pmds = 0; + tlb->cleared_puds = 0; + tlb->cleared_p4ds = 0; } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -197,6 +209,25 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, } #endif +static inline unsigned long tlb_get_unmap_shift(struct mmu_gather *tlb) +{ + if (tlb->cleared_ptes) + return PAGE_SHIFT; + if (tlb->cleared_pmds) + return PMD_SHIFT; + if (tlb->cleared_puds) + return PUD_SHIFT; + if (tlb->cleared_p4ds) + return P4D_SHIFT; + + return PAGE_SHIFT; +} + +static inline unsigned long tlb_get_unmap_size(struct mmu_gather *tlb) +{ + return 1UL << tlb_get_unmap_shift(tlb); +} + /* * In the case of tlb vma handling, we can optimise these away in the * case where we're doing a full MM flush. When we're doing a munmap, @@ -230,13 +261,19 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define tlb_remove_tlb_entry(tlb, ptep, address) \ do {\ __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->cleared_ptes = 1; \ __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) -#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)\ - do { \ - __tlb_adjust_range(tlb, address, huge_page_size(h)); \ - __tlb_remove_tlb_entry(tlb, ptep, address); \ +#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ + do {\ + unsigned long _sz = huge_page_size(h); \ + __tlb_adjust_range(tlb, address, _sz); \ + if (_sz == PMD_SIZE)\ + tlb->cleared_pmds = 1; \ + else if (_sz == PUD_SIZE) \ + tlb->cleared_puds = 1; \ + __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) /** @@ -250,6 +287,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ do {\ __tlb_adjust_range(tlb, address, HPAGE_PMD_SIZE); \ + tlb->cleared_pmds = 1; \ __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ } while (0) @@ -264,6 +302,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define tlb_remove_pud_tlb_
[PATCH 0/6] Memory corruption may occur due to incorrent tlb flush
The TLB flush optimisation (a46cc7a90f: powerpc/mm/radix: Improve TLB/PWC flushes) may result in random memory corruption. Any concurrent page-table walk could end up with a Use-after-Free. Even on UP this might give issues, since mmu_gather is preemptible these days. An interrupt or preempted task accessing user pages might stumble into the free page if the hardware caches page directories. The series is a backport of the fix sent by Peter [1]. The first three patches are dependencies for the last patch (avoid potential double flush). If the performance impact due to double flush is considered trivial then the first three patches and last patch may be dropped. [1] https://patchwork.kernel.org/cover/11284843/ -- Aneesh Kumar K.V (1): powerpc/mmu_gather: enable RCU_TABLE_FREE even for !SMP case Peter Zijlstra (4): asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather asm-generic/tlb, arch: Invert CONFIG_HAVE_RCU_TABLE_INVALIDATE mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush asm-generic/tlb: avoid potential double flush Will Deacon (1): asm-generic/tlb: Track which levels of the page tables have been cleared arch/Kconfig | 3 - arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/book3s/32/pgalloc.h | 8 -- arch/powerpc/include/asm/book3s/64/pgalloc.h | 2 - arch/powerpc/include/asm/tlb.h | 11 ++ arch/powerpc/mm/pgtable-book3s64.c | 7 -- arch/sparc/include/asm/tlb_64.h | 9 ++ arch/x86/Kconfig | 1 - include/asm-generic/tlb.h| 103 --- mm/memory.c | 20 ++-- 10 files changed, 122 insertions(+), 44 deletions(-) -- 2.24.1
[PATCH 1/6] asm-generic/tlb: Track freeing of page-table directories in struct mmu_gather
From: Peter Zijlstra Some architectures require different TLB invalidation instructions depending on whether it is only the last-level of page table being changed, or whether there are also changes to the intermediate (directory) entries higher up the tree. Add a new bit to the flags bitfield in struct mmu_gather so that the architecture code can operate accordingly if it's the intermediate levels being invalidated. Signed-off-by: Peter Zijlstra Signed-off-by: Will Deacon Cc: # 4.19 Signed-off-by: Santosh Sivaraj [santosh: prerequisite for tlbflush backports] --- include/asm-generic/tlb.h | 31 +++ 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index b3353e21f3b3..97306b32d8d2 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -97,12 +97,22 @@ struct mmu_gather { #endif unsigned long start; unsigned long end; - /* we are in the middle of an operation to clear -* a full mm and can make some optimizations */ - unsigned intfullmm : 1, - /* we have performed an operation which -* requires a complete flush of the tlb */ - need_flush_all : 1; + /* +* we are in the middle of an operation to clear +* a full mm and can make some optimizations +*/ + unsigned intfullmm : 1; + + /* +* we have performed an operation which +* requires a complete flush of the tlb +*/ + unsigned intneed_flush_all : 1; + + /* +* we have removed page directories +*/ + unsigned intfreed_tables : 1; struct mmu_gather_batch *active; struct mmu_gather_batch local; @@ -137,6 +147,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) tlb->start = TASK_SIZE; tlb->end = 0; } + tlb->freed_tables = 0; } static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -278,6 +289,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define pte_free_tlb(tlb, ptep, address) \ do {\ __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pte_free_tlb(tlb, ptep, address); \ } while (0) #endif @@ -285,7 +297,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #ifndef pmd_free_tlb #define pmd_free_tlb(tlb, pmdp, address) \ do {\ - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) #endif @@ -295,6 +308,7 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #define pud_free_tlb(tlb, pudp, address) \ do {\ __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __pud_free_tlb(tlb, pudp, address); \ } while (0) #endif @@ -304,7 +318,8 @@ static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb, #ifndef p4d_free_tlb #define p4d_free_tlb(tlb, pudp, address) \ do {\ - __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + __tlb_adjust_range(tlb, address, PAGE_SIZE);\ + tlb->freed_tables = 1; \ __p4d_free_tlb(tlb, pudp, address); \ } while (0) #endif -- 2.24.1
Re: [PATCH v3 0/6] implement KASLR for powerpc/fsl_booke/64
ping... on 2020/2/13 11:00, Jason Yan wrote: Hi everyone, any comments or suggestions? Thanks, Jason on 2020/2/6 10:58, Jason Yan wrote: This is a try to implement KASLR for Freescale BookE64 which is based on my earlier implementation for Freescale BookE32: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=131718 The implementation for Freescale BookE64 is similar as BookE32. One difference is that Freescale BookE64 set up a TLB mapping of 1G during booting. Another difference is that ppc64 needs the kernel to be 64K-aligned. So we can randomize the kernel in this 1G mapping and make it 64K-aligned. This can save some code to creat another TLB map at early boot. The disadvantage is that we only have about 1G/64K = 16384 slots to put the kernel in. KERNELBASE 64K |--> kernel <--| | | | +--+--+--+ +--+--+--+--+--+--+--+--+--+ +--+--+ | | | || | | | | | | | | || | | +--+--+--+ +--+--+--+--+--+--+--+--+--+ +--+--+ | | 1G |-> offset <-| kernstart_virt_addr I'm not sure if the slot numbers is enough or the design has any defects. If you have some better ideas, I would be happy to hear that. Thank you all. v2->v3: Fix build error when KASLR is disabled. v1->v2: Add __kaslr_offset for the secondary cpu boot up. Jason Yan (6): powerpc/fsl_booke/kaslr: refactor kaslr_legal_offset() and kaslr_early_init() powerpc/fsl_booke/64: introduce reloc_kernel_entry() helper powerpc/fsl_booke/64: implement KASLR for fsl_booke64 powerpc/fsl_booke/64: do not clear the BSS for the second pass powerpc/fsl_booke/64: clear the original kernel if randomized powerpc/fsl_booke/kaslr: rename kaslr-booke32.rst to kaslr-booke.rst and add 64bit part .../{kaslr-booke32.rst => kaslr-booke.rst} | 35 +++-- arch/powerpc/Kconfig | 2 +- arch/powerpc/kernel/exceptions-64e.S | 23 ++ arch/powerpc/kernel/head_64.S | 14 arch/powerpc/kernel/setup_64.c | 4 +- arch/powerpc/mm/mmu_decl.h | 19 ++--- arch/powerpc/mm/nohash/kaslr_booke.c | 71 +-- 7 files changed, 132 insertions(+), 36 deletions(-) rename Documentation/powerpc/{kaslr-booke32.rst => kaslr-booke.rst} (59%) .
Re: [PATCH 0/3] soc: fsl: dpio: Enable QMAN batch enqueuing
On Thu, Feb 6, 2020 at 2:41 PM Roy Pledge wrote: > > On 12/12/2019 12:01 PM, Youri Querry wrote: > > This patch set consists of: > > - We added an interface to enqueue several packets at a time and > >improve performance. > > - Make the algorithm decisions once at initialization and use > >function pointers to improve performance. > > - Replaced the QMAN enqueue array mode algorithm with a ring > >mode algorithm. This is to make the enqueue of several frames > >at a time more effective. > > > > Youri Querry (3): > >soc: fsl: dpio: Adding QMAN multiple enqueue interface. > >soc: fsl: dpio: QMAN performance improvement. Function pointer > > indirection. > >soc: fsl: dpio: Replace QMAN array mode by ring mode enqueue. > > > > drivers/soc/fsl/dpio/dpio-service.c | 69 +++- > > drivers/soc/fsl/dpio/qbman-portal.c | 766 > > > > drivers/soc/fsl/dpio/qbman-portal.h | 158 +++- > > include/soc/fsl/dpaa2-io.h | 6 +- > > 4 files changed, 907 insertions(+), 92 deletions(-) > > > Acked-by: Roy Pledge Series applied with some clean up and typo fix in the title/commit message. Regards, Leo
Re: MCE handler gets NIP wrong on MPC8378
On 02/19/2020 at 4:21 PM Christophe Leroy wrote: > > Radu Rendec a écrit : > >> On 02/19/2020 at 10:11 AM Radu Rendec wrote: > >>> On 02/18/2020 at 1:08 PM Christophe Leroy wrote: > Le 18/02/2020 à 18:07, Radu Rendec a écrit : > > The saved NIP seems to be broken inside machine_check_exception() on > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > > but I have seen other weird values. > > > > I've been able to track down the entry code to head_32.S (vector > > 0x200), > > but I'm not sure where/how the NIP value (where the exception occurred) > > is captured. > > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and > saved into _NIP(r11) in transfer_to_handler in entry_32.S > > Can something clobber r12 at some point ? > > >>> > >>> I did something even simpler: I added the following > >>> > >>> lis r12,0x1234 > >>> > >>> ... right after > >>> > >>> mfspr r12,SPRN_SRR0 > >>> > >>> ... and now the NIP value I see in the crash dump is 0x1234. This > >>> means r12 is not clobbered and most likely the NIP value I normally see > >>> is the actual SRR0 value. > >> > >> I apologize for the noise. I just found out accidentally that the saved > >> NIP value is correct if interrupts are disabled at the time when the > >> faulty access that triggers the MCE occurs. This seems to happen > >> consistently. > >> > >> By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so > >> it's basically enough to wrap ioread32 to get the NIP value right. > >> > >> Does this make any sense? Maybe it's not a silicon bug after all, or > >> maybe it is and I just found a workaround. Could this happen on other > >> PowerPC CPUs as well? > > > > Interesting. > > > > 0x900 is the adress of the timer interrupt. > > > > Would the MCE occur just after the timer interrupt ? I doubt that. I'm using a small test module to artificially trigger the MCE. Basically it's just this (the full code is in my original post): bad_addr_base = ioremap(0xf000, 0x100); x = ioread32(bad_addr_base); I find it hard to believe that every time I load the module the lwbrx instruction that triggers the MCE is executed exactly after the timer interrupt (or that the timer interrupt always occurs close to the lwbrx instruction). > > > > Can you tell how are configured your IO busses, etc ... ? Nothing special. The device tree is mostly similar to mpc8379_rdb.dts, but I can provide the actual dts if you think it's relevant. > And what's the value of SERSR after the machine check ? I'm assuming you're talking about the IPIC SERSR register. I modified machine_check_exception and added a call to ipic_get_mcp_status, which seems to read IPIC_SERSR. The value is 0, both with interrupts enabled and disabled (which makes sense, since disabling/enabling interrupts is local to the CPU core). > Do you use the local bus monitoring driver ? I don't. In fact, I'm not even aware of it. What driver is that? Best regards, Radu
Re: MCE handler gets NIP wrong on MPC8378
Christophe Leroy a écrit : Radu Rendec a écrit : On 02/19/2020 at 10:11 AM Radu Rendec wrote: On 02/18/2020 at 1:08 PM Christophe Leroy wrote: Le 18/02/2020 à 18:07, Radu Rendec a écrit : > The saved NIP seems to be broken inside machine_check_exception() on > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > but I have seen other weird values. > > I've been able to track down the entry code to head_32.S (vector 0x200), > but I'm not sure where/how the NIP value (where the exception occurred) > is captured. NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and saved into _NIP(r11) in transfer_to_handler in entry_32.S Can something clobber r12 at some point ? I did something even simpler: I added the following lis r12,0x1234 ... right after mfspr r12,SPRN_SRR0 ... and now the NIP value I see in the crash dump is 0x1234. This means r12 is not clobbered and most likely the NIP value I normally see is the actual SRR0 value. I apologize for the noise. I just found out accidentally that the saved NIP value is correct if interrupts are disabled at the time when the faulty access that triggers the MCE occurs. This seems to happen consistently. By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so it's basically enough to wrap ioread32 to get the NIP value right. Does this make any sense? Maybe it's not a silicon bug after all, or maybe it is and I just found a workaround. Could this happen on other PowerPC CPUs as well? Interesting. 0x900 is the adress of the timer interrupt. Would the MCE occur just after the timer interrupt ? Can you tell how are configured your IO busses, etc ... ? And what's the value of SERSR after the machine check ? Do you use the local bus monitoring driver ? Christophe
Re: MCE handler gets NIP wrong on MPC8378
Radu Rendec a écrit : On 02/19/2020 at 10:11 AM Radu Rendec wrote: On 02/18/2020 at 1:08 PM Christophe Leroy wrote: > Le 18/02/2020 à 18:07, Radu Rendec a écrit : > > The saved NIP seems to be broken inside machine_check_exception() on > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > > but I have seen other weird values. > > > > I've been able to track down the entry code to head_32.S (vector 0x200), > > but I'm not sure where/how the NIP value (where the exception occurred) > > is captured. > > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and > saved into _NIP(r11) in transfer_to_handler in entry_32.S > > Can something clobber r12 at some point ? > I did something even simpler: I added the following lis r12,0x1234 ... right after mfspr r12,SPRN_SRR0 ... and now the NIP value I see in the crash dump is 0x1234. This means r12 is not clobbered and most likely the NIP value I normally see is the actual SRR0 value. I apologize for the noise. I just found out accidentally that the saved NIP value is correct if interrupts are disabled at the time when the faulty access that triggers the MCE occurs. This seems to happen consistently. By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so it's basically enough to wrap ioread32 to get the NIP value right. Does this make any sense? Maybe it's not a silicon bug after all, or maybe it is and I just found a workaround. Could this happen on other PowerPC CPUs as well? Interesting. 0x900 is the adress of the timer interrupt. Would the MCE occur just after the timer interrupt ? Can you tell how are configured your IO busses, etc ... ? Christophe
Re: [PATCH v2 2/3] ASoC: dt-bindings: fsl_easrc: Add document for EASRC
On Tue, Feb 18, 2020 at 02:39:36PM +0800, Shengjiu Wang wrote: > EASRC (Enhanced Asynchronous Sample Rate Converter) is a new > IP module found on i.MX8MN. > > Signed-off-by: Shengjiu Wang > --- > .../devicetree/bindings/sound/fsl,easrc.txt | 57 +++ > 1 file changed, 57 insertions(+) > create mode 100644 Documentation/devicetree/bindings/sound/fsl,easrc.txt Bindings are now in DT schema format. See Documentation/devicetree/writing-schema.rst.
Re: MCE handler gets NIP wrong on MPC8378
On 02/19/2020 at 10:11 AM Radu Rendec wrote: > On 02/18/2020 at 1:08 PM Christophe Leroy wrote: > > Le 18/02/2020 à 18:07, Radu Rendec a écrit : > > > The saved NIP seems to be broken inside machine_check_exception() on > > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > > > but I have seen other weird values. > > > > > > I've been able to track down the entry code to head_32.S (vector 0x200), > > > but I'm not sure where/how the NIP value (where the exception occurred) > > > is captured. > > > > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and > > saved into _NIP(r11) in transfer_to_handler in entry_32.S > > > > Can something clobber r12 at some point ? > > > > I did something even simpler: I added the following > > lis r12,0x1234 > > ... right after > > mfspr r12,SPRN_SRR0 > > ... and now the NIP value I see in the crash dump is 0x1234. This > means r12 is not clobbered and most likely the NIP value I normally see > is the actual SRR0 value. I apologize for the noise. I just found out accidentally that the saved NIP value is correct if interrupts are disabled at the time when the faulty access that triggers the MCE occurs. This seems to happen consistently. By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so it's basically enough to wrap ioread32 to get the NIP value right. Does this make any sense? Maybe it's not a silicon bug after all, or maybe it is and I just found a workaround. Could this happen on other PowerPC CPUs as well? Best regards, Radu
Re: [PATCH] KVM: PPC: Book3S HV: Use RADIX_PTE_INDEX_SIZE in Radix MMU code
Hello Michael, On Tue, 2020-02-18 at 15:36 +1100, Michael Ellerman wrote: > In kvmppc_unmap_free_pte() in book3s_64_mmu_radix.c, we use the > non-constant value PTE_INDEX_SIZE to clear a PTE page. > > We can instead use the constant RADIX_PTE_INDEX_SIZE, because we know > this code will only be running when the Radix MMU is active. > > Note that we already use RADIX_PTE_INDEX_SIZE for the allocation of > kvm_pte_cache. > > Signed-off-by: Michael Ellerman > --- > arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c > b/arch/powerpc/kvm/book3s_64_mmu_radix.c > index 803940d79b73..134fbc1f029f 100644 > --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c > +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c > @@ -425,7 +425,7 @@ static void kvmppc_unmap_free_pte(struct kvm *kvm, pte_t > *pte, bool full, > unsigned int lpid) > { > if (full) { > - memset(pte, 0, sizeof(long) << PTE_INDEX_SIZE); > + memset(pte, 0, sizeof(long) << RADIX_PTE_INDEX_SIZE); > } else { > pte_t *p = pte; > unsigned long it; Looks fine to mee. For book3s_64, pgtable.h says: extern unsigned long __pte_index_size; #define PTE_INDEX_SIZE __pte_index_size powerpc/mm/pgtable_64.c defines/export the variable: unsigned long __pte_index_size; EXPORT_SYMBOL(__pte_index_size); And book3s64/radix_pgtable.c set the value in radix__early_init_mmu(). __pte_index_size = RADIX_PTE_INDEX_SIZE; So I think it's ok to use the value directly in book3s_64_mmu_radix.c. The include dependency looks fine for that to work. FWIW: Reviewed-by: Leonardo Bras signature.asc Description: This is a digitally signed message part
[RESEND PATCH v2 9/9] ath5k: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Acked-by: Kalle Valo --- drivers/net/wireless/ath/ath5k/ahb.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/net/wireless/ath/ath5k/ahb.c b/drivers/net/wireless/ath/ath5k/ahb.c index 2c9cec8b53d9..8bd01df369fb 100644 --- a/drivers/net/wireless/ath/ath5k/ahb.c +++ b/drivers/net/wireless/ath/ath5k/ahb.c @@ -138,18 +138,18 @@ static int ath_ahb_probe(struct platform_device *pdev) if (bcfg->devid >= AR5K_SREV_AR2315_R6) { /* Enable WMAC AHB arbitration */ - reg = ioread32((void __iomem *) AR5K_AR2315_AHB_ARB_CTL); + reg = ioread32((const void __iomem *) AR5K_AR2315_AHB_ARB_CTL); reg |= AR5K_AR2315_AHB_ARB_CTL_WLAN; iowrite32(reg, (void __iomem *) AR5K_AR2315_AHB_ARB_CTL); /* Enable global WMAC swapping */ - reg = ioread32((void __iomem *) AR5K_AR2315_BYTESWAP); + reg = ioread32((const void __iomem *) AR5K_AR2315_BYTESWAP); reg |= AR5K_AR2315_BYTESWAP_WMAC; iowrite32(reg, (void __iomem *) AR5K_AR2315_BYTESWAP); } else { /* Enable WMAC DMA access (assuming 5312 or 231x*/ /* TODO: check other platforms */ - reg = ioread32((void __iomem *) AR5K_AR5312_ENABLE); + reg = ioread32((const void __iomem *) AR5K_AR5312_ENABLE); if (to_platform_device(ah->dev)->id == 0) reg |= AR5K_AR5312_ENABLE_WLAN0; else @@ -202,12 +202,12 @@ static int ath_ahb_remove(struct platform_device *pdev) if (bcfg->devid >= AR5K_SREV_AR2315_R6) { /* Disable WMAC AHB arbitration */ - reg = ioread32((void __iomem *) AR5K_AR2315_AHB_ARB_CTL); + reg = ioread32((const void __iomem *) AR5K_AR2315_AHB_ARB_CTL); reg &= ~AR5K_AR2315_AHB_ARB_CTL_WLAN; iowrite32(reg, (void __iomem *) AR5K_AR2315_AHB_ARB_CTL); } else { /*Stop DMA access */ - reg = ioread32((void __iomem *) AR5K_AR5312_ENABLE); + reg = ioread32((const void __iomem *) AR5K_AR5312_ENABLE); if (to_platform_device(ah->dev)->id == 0) reg &= ~AR5K_AR5312_ENABLE_WLAN0; else -- 2.17.1
[RESEND PATCH v2 8/9] media: fsl-viu: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski --- drivers/media/platform/fsl-viu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/media/platform/fsl-viu.c b/drivers/media/platform/fsl-viu.c index 81a8faedbba6..991d9dc82749 100644 --- a/drivers/media/platform/fsl-viu.c +++ b/drivers/media/platform/fsl-viu.c @@ -34,7 +34,7 @@ /* Allow building this driver with COMPILE_TEST */ #if !defined(CONFIG_PPC) && !defined(CONFIG_MICROBLAZE) #define out_be32(v, a) iowrite32be(a, (void __iomem *)v) -#define in_be32(a) ioread32be((void __iomem *)a) +#define in_be32(a) ioread32be((const void __iomem *)a) #endif #define BUFFER_TIMEOUT msecs_to_jiffies(500) /* 0.5 seconds */ -- 2.17.1
[RESEND PATCH v2 7/9] drm/nouveau: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski --- drivers/gpu/drm/nouveau/nouveau_bo.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 1b62ccc57aef..d95bdd65dbca 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -613,7 +613,7 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index) mem += index; if (is_iomem) - return ioread32_native((void __force __iomem *)mem); + return ioread32_native((const void __force __iomem *)mem); else return *mem; } -- 2.17.1
[RESEND PATCH v2 6/9] drm/mgag200: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Reviewed-by: Thomas Zimmermann --- Changes since v1: 1. Add Thomas' review. --- drivers/gpu/drm/mgag200/mgag200_drv.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/mgag200/mgag200_drv.h b/drivers/gpu/drm/mgag200/mgag200_drv.h index aa32aad222c2..6512b3af4fb7 100644 --- a/drivers/gpu/drm/mgag200/mgag200_drv.h +++ b/drivers/gpu/drm/mgag200/mgag200_drv.h @@ -34,9 +34,9 @@ #define MGAG200FB_CONN_LIMIT 1 -#define RREG8(reg) ioread8(((void __iomem *)mdev->rmmio) + (reg)) +#define RREG8(reg) ioread8(((const void __iomem *)mdev->rmmio) + (reg)) #define WREG8(reg, v) iowrite8(v, ((void __iomem *)mdev->rmmio) + (reg)) -#define RREG32(reg) ioread32(((void __iomem *)mdev->rmmio) + (reg)) +#define RREG32(reg) ioread32(((const void __iomem *)mdev->rmmio) + (reg)) #define WREG32(reg, v) iowrite32(v, ((void __iomem *)mdev->rmmio) + (reg)) #define ATTR_INDEX 0x1fc0 -- 2.17.1
[RESEND PATCH v2 5/9] arc: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Acked-by: Alexey Brodkin --- Changes since v1: 1. Add Alexey's ack. --- arch/arc/plat-axs10x/axs10x.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arc/plat-axs10x/axs10x.c b/arch/arc/plat-axs10x/axs10x.c index 63ea5a606ecd..180c260a8221 100644 --- a/arch/arc/plat-axs10x/axs10x.c +++ b/arch/arc/plat-axs10x/axs10x.c @@ -84,7 +84,7 @@ static void __init axs10x_print_board_ver(unsigned int creg, const char *str) unsigned int val; } board; - board.val = ioread32((void __iomem *)creg); + board.val = ioread32((const void __iomem *)creg); pr_info("AXS: %s FPGA Date: %u-%u-%u\n", str, board.d, board.m, board.y); } @@ -95,7 +95,7 @@ static void __init axs10x_early_init(void) char mb[32]; /* Determine motherboard version */ - if (ioread32((void __iomem *) CREG_MB_CONFIG) & (1 << 28)) + if (ioread32((const void __iomem *) CREG_MB_CONFIG) & (1 << 28)) mb_rev = 3; /* HT-3 (rev3.0) */ else mb_rev = 2; /* HT-2 (rev2.0) */ -- 2.17.1
[RESEND PATCH v2 4/9] virtio: pci: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Reviewed-by: Geert Uytterhoeven --- Changes since v1: 1. Add Geert's review. --- drivers/virtio/virtio_pci_modern.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 7abcc50838b8..fc58db4ab6c3 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -26,16 +26,16 @@ * method, i.e. 32-bit accesses for 32-bit fields, 16-bit accesses * for 16-bit fields and 8-bit accesses for 8-bit fields. */ -static inline u8 vp_ioread8(u8 __iomem *addr) +static inline u8 vp_ioread8(const u8 __iomem *addr) { return ioread8(addr); } -static inline u16 vp_ioread16 (__le16 __iomem *addr) +static inline u16 vp_ioread16 (const __le16 __iomem *addr) { return ioread16(addr); } -static inline u32 vp_ioread32(__le32 __iomem *addr) +static inline u32 vp_ioread32(const __le32 __iomem *addr) { return ioread32(addr); } -- 2.17.1
[RESEND PATCH v2 3/9] ntb: intel: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Reviewed-by: Geert Uytterhoeven Acked-by: Dave Jiang --- Changes since v1: 1. Add Geert's review. 2. Add Dave's ack. --- drivers/ntb/hw/intel/ntb_hw_gen1.c | 2 +- drivers/ntb/hw/intel/ntb_hw_gen3.h | 2 +- drivers/ntb/hw/intel/ntb_hw_intel.h | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c b/drivers/ntb/hw/intel/ntb_hw_gen1.c index bb57ec239029..9202502a9787 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen1.c +++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c @@ -1202,7 +1202,7 @@ int intel_ntb_peer_spad_write(struct ntb_dev *ntb, int pidx, int sidx, ndev->peer_reg->spad); } -static u64 xeon_db_ioread(void __iomem *mmio) +static u64 xeon_db_ioread(const void __iomem *mmio) { return (u64)ioread16(mmio); } diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.h b/drivers/ntb/hw/intel/ntb_hw_gen3.h index 75fb86ca27bb..d1455f24ec99 100644 --- a/drivers/ntb/hw/intel/ntb_hw_gen3.h +++ b/drivers/ntb/hw/intel/ntb_hw_gen3.h @@ -91,7 +91,7 @@ #define GEN3_DB_TOTAL_SHIFT33 #define GEN3_SPAD_COUNT16 -static inline u64 gen3_db_ioread(void __iomem *mmio) +static inline u64 gen3_db_ioread(const void __iomem *mmio) { return ioread64(mmio); } diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.h b/drivers/ntb/hw/intel/ntb_hw_intel.h index e071e28bca3f..3c0a5a2da241 100644 --- a/drivers/ntb/hw/intel/ntb_hw_intel.h +++ b/drivers/ntb/hw/intel/ntb_hw_intel.h @@ -102,7 +102,7 @@ struct intel_ntb_dev; struct intel_ntb_reg { int (*poll_link)(struct intel_ntb_dev *ndev); int (*link_is_up)(struct intel_ntb_dev *ndev); - u64 (*db_ioread)(void __iomem *mmio); + u64 (*db_ioread)(const void __iomem *mmio); void (*db_iowrite)(u64 db_bits, void __iomem *mmio); unsigned long ntb_ctl; resource_size_t db_size; -- 2.17.1
[RESEND PATCH v2 2/9] rtl818x: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Signed-off-by: Krzysztof Kozlowski Reviewed-by: Geert Uytterhoeven Acked-by: Kalle Valo --- Changes since v1: 1. Add Geert's review. 2. Add Kalle's ack. Fix subject prefix. --- drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h b/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h index 7948a2da195a..2ff00800d45b 100644 --- a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h +++ b/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h @@ -150,17 +150,17 @@ void rtl8180_write_phy(struct ieee80211_hw *dev, u8 addr, u32 data); void rtl8180_set_anaparam(struct rtl8180_priv *priv, u32 anaparam); void rtl8180_set_anaparam2(struct rtl8180_priv *priv, u32 anaparam2); -static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, u8 __iomem *addr) +static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, const u8 __iomem *addr) { return ioread8(addr); } -static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, __le16 __iomem *addr) +static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, const __le16 __iomem *addr) { return ioread16(addr); } -static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, __le32 __iomem *addr) +static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, const __le32 __iomem *addr) { return ioread32(addr); } -- 2.17.1
[RESEND PATCH v2 1/9] iomap: Constify ioreadX() iomem argument (as in generic implementation)
The ioreadX() and ioreadX_rep() helpers have inconsistent interface. On some architectures void *__iomem address argument is a pointer to const, on some not. Implementations of ioreadX() do not modify the memory under the address so they can be converted to a "const" version for const-safety and consistency among architectures. Suggested-by: Geert Uytterhoeven Signed-off-by: Krzysztof Kozlowski Reviewed-by: Geert Uytterhoeven Reviewed-by: Arnd Bergmann --- Changes since v1: 1. Constify also ioreadX_rep() and mmio_insX(), 2. Squash lib+alpha+powerpc+parisc+sh into one patch for bisectability, 3. Add Geert's review. 4. Add Arnd's review. --- arch/alpha/include/asm/core_apecs.h | 6 +-- arch/alpha/include/asm/core_cia.h | 6 +-- arch/alpha/include/asm/core_lca.h | 6 +-- arch/alpha/include/asm/core_marvel.h | 4 +- arch/alpha/include/asm/core_mcpcia.h | 6 +-- arch/alpha/include/asm/core_t2.h | 2 +- arch/alpha/include/asm/io.h | 12 ++--- arch/alpha/include/asm/io_trivial.h | 16 +++--- arch/alpha/include/asm/jensen.h | 2 +- arch/alpha/include/asm/machvec.h | 6 +-- arch/alpha/kernel/core_marvel.c | 2 +- arch/alpha/kernel/io.c| 12 ++--- arch/parisc/include/asm/io.h | 4 +- arch/parisc/lib/iomap.c | 72 +-- arch/powerpc/kernel/iomap.c | 28 +-- arch/sh/kernel/iomap.c| 22 include/asm-generic/iomap.h | 28 +-- include/linux/io-64-nonatomic-hi-lo.h | 4 +- include/linux/io-64-nonatomic-lo-hi.h | 4 +- lib/iomap.c | 30 +-- 20 files changed, 136 insertions(+), 136 deletions(-) diff --git a/arch/alpha/include/asm/core_apecs.h b/arch/alpha/include/asm/core_apecs.h index 0a07055bc0fe..2d9726fc02ef 100644 --- a/arch/alpha/include/asm/core_apecs.h +++ b/arch/alpha/include/asm/core_apecs.h @@ -384,7 +384,7 @@ struct el_apecs_procdata } \ } while (0) -__EXTERN_INLINE unsigned int apecs_ioread8(void __iomem *xaddr) +__EXTERN_INLINE unsigned int apecs_ioread8(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; unsigned long result, base_and_type; @@ -420,7 +420,7 @@ __EXTERN_INLINE void apecs_iowrite8(u8 b, void __iomem *xaddr) *(vuip) ((addr << 5) + base_and_type) = w; } -__EXTERN_INLINE unsigned int apecs_ioread16(void __iomem *xaddr) +__EXTERN_INLINE unsigned int apecs_ioread16(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; unsigned long result, base_and_type; @@ -456,7 +456,7 @@ __EXTERN_INLINE void apecs_iowrite16(u16 b, void __iomem *xaddr) *(vuip) ((addr << 5) + base_and_type) = w; } -__EXTERN_INLINE unsigned int apecs_ioread32(void __iomem *xaddr) +__EXTERN_INLINE unsigned int apecs_ioread32(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; if (addr < APECS_DENSE_MEM) diff --git a/arch/alpha/include/asm/core_cia.h b/arch/alpha/include/asm/core_cia.h index c706a7f2b061..cb22991f6761 100644 --- a/arch/alpha/include/asm/core_cia.h +++ b/arch/alpha/include/asm/core_cia.h @@ -342,7 +342,7 @@ struct el_CIA_sysdata_mcheck { #define vuip volatile unsigned int __force * #define vulp volatile unsigned long __force * -__EXTERN_INLINE unsigned int cia_ioread8(void __iomem *xaddr) +__EXTERN_INLINE unsigned int cia_ioread8(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; unsigned long result, base_and_type; @@ -374,7 +374,7 @@ __EXTERN_INLINE void cia_iowrite8(u8 b, void __iomem *xaddr) *(vuip) ((addr << 5) + base_and_type) = w; } -__EXTERN_INLINE unsigned int cia_ioread16(void __iomem *xaddr) +__EXTERN_INLINE unsigned int cia_ioread16(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; unsigned long result, base_and_type; @@ -404,7 +404,7 @@ __EXTERN_INLINE void cia_iowrite16(u16 b, void __iomem *xaddr) *(vuip) ((addr << 5) + base_and_type) = w; } -__EXTERN_INLINE unsigned int cia_ioread32(void __iomem *xaddr) +__EXTERN_INLINE unsigned int cia_ioread32(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; if (addr < CIA_DENSE_MEM) diff --git a/arch/alpha/include/asm/core_lca.h b/arch/alpha/include/asm/core_lca.h index 84d5e5b84f4f..ec86314418cb 100644 --- a/arch/alpha/include/asm/core_lca.h +++ b/arch/alpha/include/asm/core_lca.h @@ -230,7 +230,7 @@ union el_lca { } while (0) -__EXTERN_INLINE unsigned int lca_ioread8(void __iomem *xaddr) +__EXTERN_INLINE unsigned int lca_ioread8(const void __iomem *xaddr) { unsigned long addr = (unsigned long) xaddr; unsigned long result, base_and_type; @@ -266,7 +266,7 @@ __EXTERN_INLINE void lca_iowrite8(u8 b, void __iomem *xaddr) *(vuip) ((addr << 5) + base_and_type) =
[RESEND PATCH v2 0/9] iomap: Constify ioreadX() iomem argument
Hi, Changes since v1 https://lore.kernel.org/lkml/1578415992-24054-1-git-send-email-k...@kernel.org/ 1. Constify also ioreadX_rep() and mmio_insX(), 2. Squash lib+alpha+powerpc+parisc+sh into one patch for bisectability, 3. Add acks and reviews, 4. Re-order patches so all optional driver changes are at the end. Description === The ioread8/16/32() and others have inconsistent interface among the architectures: some taking address as const, some not. It seems there is nothing really stopping all of them to take pointer to const. Patchset was only compile tested on affected architectures. No real testing. volatile There is still interface inconsistency between architectures around "volatile" qualifier: - include/asm-generic/io.h:static inline u32 ioread32(const volatile void __iomem *addr) - include/asm-generic/iomap.h:extern unsigned int ioread32(const void __iomem *); This is still discussed and out of scope of this patchset. Merging === Multiple architectures are affected in first patch so acks are welcomed. 1. All patches depend on first patch, 2. Patches 2-4 unify the interface also in few drivers, 3. PAtches 5-9 are optional cleanup, without actual impact. Best regards, Krzysztof Krzysztof Kozlowski (9): iomap: Constify ioreadX() iomem argument (as in generic implementation) rtl818x: Constify ioreadX() iomem argument (as in generic implementation) ntb: intel: Constify ioreadX() iomem argument (as in generic implementation) virtio: pci: Constify ioreadX() iomem argument (as in generic implementation) arc: Constify ioreadX() iomem argument (as in generic implementation) drm/mgag200: Constify ioreadX() iomem argument (as in generic implementation) drm/nouveau: Constify ioreadX() iomem argument (as in generic implementation) media: fsl-viu: Constify ioreadX() iomem argument (as in generic implementation) ath5k: Constify ioreadX() iomem argument (as in generic implementation) arch/alpha/include/asm/core_apecs.h | 6 +- arch/alpha/include/asm/core_cia.h | 6 +- arch/alpha/include/asm/core_lca.h | 6 +- arch/alpha/include/asm/core_marvel.h | 4 +- arch/alpha/include/asm/core_mcpcia.h | 6 +- arch/alpha/include/asm/core_t2.h | 2 +- arch/alpha/include/asm/io.h | 12 ++-- arch/alpha/include/asm/io_trivial.h | 16 ++--- arch/alpha/include/asm/jensen.h | 2 +- arch/alpha/include/asm/machvec.h | 6 +- arch/alpha/kernel/core_marvel.c | 2 +- arch/alpha/kernel/io.c| 12 ++-- arch/arc/plat-axs10x/axs10x.c | 4 +- arch/parisc/include/asm/io.h | 4 +- arch/parisc/lib/iomap.c | 72 +-- arch/powerpc/kernel/iomap.c | 28 arch/sh/kernel/iomap.c| 22 +++--- drivers/gpu/drm/mgag200/mgag200_drv.h | 4 +- drivers/gpu/drm/nouveau/nouveau_bo.c | 2 +- drivers/media/platform/fsl-viu.c | 2 +- drivers/net/wireless/ath/ath5k/ahb.c | 10 +-- .../realtek/rtl818x/rtl8180/rtl8180.h | 6 +- drivers/ntb/hw/intel/ntb_hw_gen1.c| 2 +- drivers/ntb/hw/intel/ntb_hw_gen3.h| 2 +- drivers/ntb/hw/intel/ntb_hw_intel.h | 2 +- drivers/virtio/virtio_pci_modern.c| 6 +- include/asm-generic/iomap.h | 28 include/linux/io-64-nonatomic-hi-lo.h | 4 +- include/linux/io-64-nonatomic-lo-hi.h | 4 +- lib/iomap.c | 30 30 files changed, 156 insertions(+), 156 deletions(-) -- 2.17.1
[PATCH] powerpc/kasan: Fix error detection on memory allocation
In case (k_start & PAGE_MASK) doesn't equal (kstart), 'va' will never be NULL allthough 'block' is NULL Check the return of memblock_alloc() directly instead of the resulting address in the loop. Fixes: 509cd3f2b473 ("powerpc/32: Simplify KASAN init") Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index b533e7a8319d..e998c3576ae1 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -78,15 +78,14 @@ static int __init kasan_init_region(void *start, size_t size) return ret; block = memblock_alloc(k_end - k_start, PAGE_SIZE); + if (!block) + return -ENOMEM; for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { pmd_t *pmd = pmd_offset(pud_offset(pgd_offset_k(k_cur), k_cur), k_cur); void *va = block + k_cur - k_start; pte_t pte = pfn_pte(PHYS_PFN(__pa(va)), PAGE_KERNEL); - if (!va) - return -ENOMEM; - __set_pte_at(&init_mm, k_cur, pte_offset_kernel(pmd, k_cur), pte, 0); } flush_tlb_kernel_range(k_start, k_end); -- 2.25.0
[PATCH] powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC
With CONFIG_KASAN_VMALLOC, new page tables are created at the time shadow memory for vmalloc area in unmapped. If some parts of the page table still has entries to the zero page shadow memory, the entries are wrongly marked RW. Make sure new page tables are populated with RO entries once kasan_remap_early_shadow_ro() has run. Fixes: 3d4247fcc938 ("powerpc/32: Add support of KASAN_VMALLOC") Signed-off-by: Christophe Leroy --- arch/powerpc/mm/kasan/kasan_init_32.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/kasan/kasan_init_32.c b/arch/powerpc/mm/kasan/kasan_init_32.c index 16dd95bd0749..b533e7a8319d 100644 --- a/arch/powerpc/mm/kasan/kasan_init_32.c +++ b/arch/powerpc/mm/kasan/kasan_init_32.c @@ -30,11 +30,13 @@ static void __init kasan_populate_pte(pte_t *ptep, pgprot_t prot) __set_pte_at(&init_mm, va, ptep, pfn_pte(PHYS_PFN(pa), prot), 0); } -static int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end) +static int __init +kasan_init_shadow_page_tables(unsigned long k_start, unsigned long k_end, bool is_late) { pmd_t *pmd; unsigned long k_cur, k_next; pte_t *new = NULL; + pgprot_t prot = is_late ? kasan_prot_ro() : PAGE_KERNEL; pmd = pmd_offset(pud_offset(pgd_offset_k(k_start), k_start), k_start); @@ -48,7 +50,7 @@ static int __init kasan_init_shadow_page_tables(unsigned long k_start, unsigned if (!new) return -ENOMEM; - kasan_populate_pte(new, PAGE_KERNEL); + kasan_populate_pte(new, prot); smp_wmb(); /* See comment in __pte_alloc */ @@ -71,7 +73,7 @@ static int __init kasan_init_region(void *start, size_t size) int ret; void *block; - ret = kasan_init_shadow_page_tables(k_start, k_end); + ret = kasan_init_shadow_page_tables(k_start, k_end, false); if (ret) return ret; @@ -121,7 +123,7 @@ static void __init kasan_unmap_early_shadow_vmalloc(void) phys_addr_t pa = __pa(kasan_early_shadow_page); if (!early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) { - int ret = kasan_init_shadow_page_tables(k_start, k_end); + int ret = kasan_init_shadow_page_tables(k_start, k_end, true); if (ret) panic("kasan: kasan_init_shadow_page_tables() failed"); @@ -144,7 +146,8 @@ void __init kasan_mmu_init(void) struct memblock_region *reg; if (early_mmu_has_feature(MMU_FTR_HPTE_TABLE)) { - ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, KASAN_SHADOW_END); + ret = kasan_init_shadow_page_tables(KASAN_SHADOW_START, KASAN_SHADOW_END, + false); if (ret) panic("kasan: kasan_init_shadow_page_tables() failed"); -- 2.25.0
Re: MCE handler gets NIP wrong on MPC8378
On 02/18/2020 at 1:08 PM Christophe Leroy wrote: > Le 18/02/2020 à 18:07, Radu Rendec a écrit : > > The saved NIP seems to be broken inside machine_check_exception() on > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times, > > but I have seen other weird values. > > > > I've been able to track down the entry code to head_32.S (vector 0x200), > > but I'm not sure where/how the NIP value (where the exception occurred) > > is captured. > > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and > saved into _NIP(r11) in transfer_to_handler in entry_32.S Thank you so much for the information, it is extremely helpful! > Can something clobber r12 at some point ? > > Maybe add the following at some place to trap when it happens ? > > tweqi r12, 0x900 > > If you put it just after reading SRR0, and just before writing into > NIP(r11), you'll see if its wrong from the begining or if it is > overwriten later. I did something even simpler: I added the following lis r12,0x1234 ... right after mfspr r12,SPRN_SRR0 ... and now the NIP value I see in the crash dump is 0x1234. This means r12 is not clobbered and most likely the NIP value I normally see is the actual SRR0 value. Just to be sure that SRR0 is not clobbered before it's even saved to r12 (very unlikely though) I changed the code to save SRR0 to r8 at the very beginning of the handler (first instruction, at address 0x200) and then load r12 from r8 later. This of course clobbers r8, but it's good for testing. Now in the crash dump I see 0x900 in both NIP and r8. So I think I ruled out any problem in the Linux MCE handler. MPC8378 has an e300 core and I double checked with the e300 core reference manual (e300coreRM.pdf from NXP). I couldn't find anything weird there either. Quoting from the RM: | 5.5.2.1 Machine Check Interrupt Enabled (MSR[ME] = 1) | | When a machine check interrupt is taken, registers are updated as | shown in Table 5-14. | | Table 5-14. Machine Check Interrupt—Register Settings | | SRR0 Set to the address of the next instruction that would have been | completed in the interrupted instruction stream. Neither this | instruction nor any others beyond it will have been completed. | All preceding instructions will have been completed. At this point I'm assuming a silicon bug, although I couldn't find anything interesting in the Errata provided by NXP. Best regards, Radu
[PATCHv3] powerpc/crashkernel: take "mem=" option into account
'mem=" option is an easy way to put high pressure on memory during some test. Hence after applying the memory limit, instead of total mem, the actual usable memory should be considered when reserving mem for crashkernel. Otherwise the boot up may experience OOM issue. E.g. it would reserve 4G prior to the change and 512M afterward, if passing crashkernel="2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G", and mem=5G on a 256G machine. This issue is powerpc specific because it puts higher priority on fadump and kdump reservation than on "mem=". Referring the following code: if (fadump_reserve_mem() == 0) reserve_crashkernel(); ... /* Ensure that total memory size is page-aligned. */ limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE); memblock_enforce_memory_limit(limit); While on other arches, the effect of "mem=" takes a higher priority and pass through memblock_phys_mem_size() before calling reserve_crashkernel(). Signed-off-by: Pingfan Liu To: linuxppc-dev@lists.ozlabs.org Cc: Hari Bathini Cc: Michael Ellerman Cc: ke...@lists.infradead.org --- v2 -> v3: improve commit log arch/powerpc/kernel/machine_kexec.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/machine_kexec.c b/arch/powerpc/kernel/machine_kexec.c index c4ed328..eec96dc 100644 --- a/arch/powerpc/kernel/machine_kexec.c +++ b/arch/powerpc/kernel/machine_kexec.c @@ -114,11 +114,12 @@ void machine_kexec(struct kimage *image) void __init reserve_crashkernel(void) { - unsigned long long crash_size, crash_base; + unsigned long long crash_size, crash_base, total_mem_sz; int ret; + total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size(); /* use common parsing */ - ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(), + ret = parse_crashkernel(boot_command_line, total_mem_sz, &crash_size, &crash_base); if (ret == 0 && crash_size > 0) { crashk_res.start = crash_base; @@ -185,7 +186,7 @@ void __init reserve_crashkernel(void) "for crashkernel (System RAM: %ldMB)\n", (unsigned long)(crash_size >> 20), (unsigned long)(crashk_res.start >> 20), - (unsigned long)(memblock_phys_mem_size() >> 20)); + (unsigned long)(total_mem_sz >> 20)); if (!memblock_is_region_memory(crashk_res.start, crash_size) || memblock_reserve(crashk_res.start, crash_size)) { -- 2.7.5
Re: [PATCH v2 07/13] powerpc: add support for folded p4d page tables
On Wed, Feb 19, 2020 at 01:07:55PM +0100, Christophe Leroy wrote: > > Le 16/02/2020 à 09:18, Mike Rapoport a écrit : > > diff --git a/arch/powerpc/mm/ptdump/ptdump.c > > b/arch/powerpc/mm/ptdump/ptdump.c > > index 206156255247..7bd4b81d5b5d 100644 > > --- a/arch/powerpc/mm/ptdump/ptdump.c > > +++ b/arch/powerpc/mm/ptdump/ptdump.c > > @@ -277,9 +277,9 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, > > unsigned long start) > > } > > } > > -static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) > > +static void walk_pud(struct pg_state *st, p4d_t *p4d, unsigned long start) > > { > > - pud_t *pud = pud_offset(pgd, 0); > > + pud_t *pud = pud_offset(p4d, 0); > > unsigned long addr; > > unsigned int i; > > @@ -293,6 +293,22 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, > > unsigned long start) > > } > > } > > +static void walk_p4d(struct pg_state *st, pgd_t *pgd, unsigned long start) > > +{ > > + p4d_t *p4d = p4d_offset(pgd, 0); > > + unsigned long addr; > > + unsigned int i; > > + > > + for (i = 0; i < PTRS_PER_P4D; i++, p4d++) { > > + addr = start + i * P4D_SIZE; > > + if (!p4d_none(*p4d) && !p4d_is_leaf(*p4d)) > > + /* p4d exists */ > > + walk_pud(st, p4d, addr); > > + else > > + note_page(st, addr, 2, p4d_val(*p4d)); > > Level 2 is already used by walk_pud(). > > I think you have to increment the level used in walk_pud() and walk_pmd() > and walk_pte() Thanks for catching this! I'll fix the numbers in the next version. > > + } > > +} > > + > > static void walk_pagetables(struct pg_state *st) > > { > > unsigned int i; > > @@ -306,7 +322,7 @@ static void walk_pagetables(struct pg_state *st) > > for (i = pgd_index(addr); i < PTRS_PER_PGD; i++, pgd++, addr += > > PGDIR_SIZE) { > > if (!pgd_none(*pgd) && !pgd_is_leaf(*pgd)) > > /* pgd exists */ > > - walk_pud(st, pgd, addr); > > + walk_p4d(st, pgd, addr); > > else > > note_page(st, addr, 1, pgd_val(*pgd)); > > } > > Christophe -- Sincerely yours, Mike.
Re: Surprising code generated for vdso_read_begin()
On Wed, Feb 19, 2020 at 10:52:16AM +0100, Arnd Bergmann wrote: > On Wed, Feb 19, 2020 at 9:45 AM Christophe Leroy > wrote: > > Le 16/02/2020 à 19:10, Arnd Bergmann a écrit : > > > On Sat, Jan 11, 2020 at 12:33 PM Segher Boessenkool > > > wrote: > > >> > > >> On Fri, Jan 10, 2020 at 07:45:44AM +0100, Christophe Leroy wrote: > > >>> Le 09/01/2020 à 21:07, Segher Boessenkool a écrit : > > It looks like the compiler did loop peeling. What GCC version is this? > > Please try current trunk (to become GCC 10), or at least GCC 9? > > >>> > > >>> It is with GCC 5.5 > > >>> > > >>> https://mirrors.edge.kernel.org/pub/tools/crosstool/ doesn't have more > > >>> recent than 8.1 > > >> > > >> Arnd, can you update the tools? We are at 8.3 and 9.2 now :-) Or is > > >> this hard and/or painful to do? > > > > > > To follow up on this older thread, I have now uploaded 6.5, 7.5, 8.3 and > > > 9.2 > > > binaries, as well as a recent 10.0 snapshot. > > > > > > > Thanks Arnd, > > > > I have built the VDSO with 9.2, I get less performant result than with > > 8.2 (same performance as with 5.5). > > > > After a quick look, I see: > > - Irrelevant NOPs to align loops and stuff, allthough -mpcu=860 should > > avoid that. > > - A stack frame is set for saving r31 in __c_kernel_clock_gettime. GCC > > 8.1 don't need that, all VDSO functions are frameless with 8.1 > > If you think it should be fixed in gcc, maybe try to reproduce it in > https://godbolt.org/ (Feel free to skip this step; and don't put links to godbolt (or anything else external) in our bugzilla, please; such links go stale before you know it.) > and open a gcc bug against that. Yes please :-) > Also, please try the gcc-10 snapshot, which has the highest chance > of getting fixes if it shows the same issue (or worse). If it is a regression, chances are it will be backported. (But not to 9.3, which is due in just a few weeks, just like 8.4). If it is just a side effect of some other change, it will probably *not* be undone, not on trunk (GCC 10) either. It depends. But sure, always test trunk if you can. Segher
Re: [PATCH] powerpc/entry: Fix a #if which should be a #ifdef in entry_32.S
On Tue, 2020-02-18 at 14:09:29 UTC, Christophe Leroy wrote: > Fixes: 12c3f1fd87bf ("powerpc/32s: get rid of CPU_FTR_601 feature") > Cc: sta...@vger.kernel.org > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/9eb425b2e04e0e3006adffea5bf5f227a896f128 cheers
Re: [PATCH] powerpc/xmon: Fix whitespace handling in getstring()
On Mon, 2020-02-17 at 04:13:43 UTC, Oliver O'Halloran wrote: > The ls (lookup symbol) and zr (reboot) commands use xmon's getstring() > helper to read a string argument from the xmon prompt. This function skips > over leading whitespace, but doesn't check if the first "non-whitespace" > character is a newline which causes some odd behaviour ( indicates > a the enter key was pressed): > > 0:mon> ls printk > printk: c01680c4 > > 0:mon> ls > printk > Symbol ' > printk' not found. > 0:mon> > > With commit 2d9b332d99b ("powerpc/xmon: Allow passing an argument > to ppc_md.restart()") we have a similar problem with the zr command. > Previously zr took no arguments so "zr would trigger a reboot. > With that patch applied a second newline needs to be sent in order for > the reboot to occur. Fix this by checking if the leading whitespace > ended on a newline: > > 0:mon> ls > Symbol '' not found. > > Fixes: 2d9b332d99b ("powerpc/xmon: Allow passing an argument to > ppc_md.restart()") > Reported-by: Michael Ellerman > Signed-off-by: Oliver O'Halloran Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/066bc3576e653b615ee3f5230a89d69c8ebeeb71 cheers
Re: [PATCH v4] powerpc/32s: Fix DSI and ISI exceptions for CONFIG_VMAP_STACK
On Sat, 2020-02-15 at 10:14:25 UTC, Christophe Leroy wrote: > hash_page() needs to read page tables from kernel memory. When entire > kernel memory is mapped by BATs, which is normally the case when > CONFIG_STRICT_KERNEL_RWX is not set, it works even if the page hosting > the page table is not referenced in the MMU hash table. > > However, if the page where the page table resides is not covered by > a BAT, a DSI fault can be encountered from hash_page(), and it loops > forever. This can happen when CONFIG_STRICT_KERNEL_RWX is selected > and the alignment of the different regions is too small to allow > covering the entire memory with BATs. This also happens when > CONFIG_DEBUG_PAGEALLOC is selected or when booting with 'nobats' > flag. > > Also, if the page containing the kernel stack is not present in the > MMU hash table, registers cannot be saved and a recursive DSI fault > is encountered. > > To allow hash_page() to properly do its job at all time and load the > MMU hash table whenever needed, it must run with data MMU disabled. > This means it must be called before re-enabling data MMU. To allow > this, registers clobbered by hash_page() and create_hpte() have to > be saved in the thread struct together with SRR0, SSR1, DAR and DSISR. > It is also necessary to ensure that DSI prolog doesn't overwrite > regs saved by prolog of the current running exception. That means: > - DSI can only use SPRN_SPRG_SCRATCH0 > - Exceptions must free SPRN_SPRG_SCRATCH0 before writing to the stack. > > This also fixes the Oops reported by Erhard when create_hpte() is > called by add_hash_page(). > > Due to prolog size increase, a few more exceptions had to get split > in two parts. > > Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK") > Reported-by: Erhard F. > Link: https://bugzilla.kernel.org/show_bug.cgi?id=206501 > Signed-off-by: Christophe Leroy > Tested-by: Erhard F. > Tested-by: Larry Finger Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/232ca1eecafed8c54491017f0612c33d8c742d74 cheers
Re: [PATCH RESEND with Fixes: tag] powerpc/chrp: Fix enter_rtas() with CONFIG_VMAP_STACK
On Fri, 2020-02-14 at 08:39:50 UTC, Christophe Leroy wrote: > With CONFIG_VMAP_STACK, data MMU has to be enabled > to read data on the stack. > > Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK") > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/5a528eb67908bcf493f3583cb4726fbd8e129605 cheers
Re: [PATCH] powerpc/6xx: Fix power_save_ppc32_restore() with CONFIG_VMAP_STACK
On Fri, 2020-02-14 at 06:53:00 UTC, Christophe Leroy wrote: > power_save_ppc32_restore() is called during exception entry, before > re-enabling the MMU. It substracts KERNELBASE from the address > of nap_save_msscr0 to access it. > > With CONFIG_VMAP_STACK enabled, data MMU translation has already been > re-enabled, so power_save_ppc32_restore() has to access > nap_save_msscr0 by its virtual address. > > Reported-by: Larry Finger > Signed-off-by: Christophe Leroy > Fixes: cd08f109e262 ("powerpc/32s: Enable CONFIG_VMAP_STACK") > Tested-by: Larry Finger Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/477f3488a94e35380c82a7498d46f10fa5f3edd2 cheers
Re: [PATCH v3 1/3] powerpc/tm: Fix clearing MSR[TS] in current when reclaiming on signal delivery
On Tue, 2020-02-11 at 03:38:29 UTC, Gustavo Luiz Duarte wrote: > After a treclaim, we expect to be in non-transactional state. If we don't > clear > the current thread's MSR[TS] before we get preempted, then > tm_recheckpoint_new_task() will recheckpoint and we get rescheduled in > suspended transaction state. > > When handling a signal caught in transactional state, handle_rt_signal64() > calls get_tm_stackpointer() that treclaims the transaction using > tm_reclaim_current() but without clearing the thread's MSR[TS]. This can cause > the TM Bad Thing exception below if later we pagefault and get preempted > trying > to access the user's sigframe, using __put_user(). Afterwards, when we are > rescheduled back into do_page_fault() (but now in suspended state since the > thread's MSR[TS] was not cleared), upon executing 'rfid' after completion of > the page fault handling, the exception is raised because a transition from > suspended to non-transactional state is invalid. > > Unexpected TM Bad Thing exception at c000de44 (msr > 0x800302a03031) tm_scratch=80010280b033 > Oops: Unrecoverable exception, sig: 6 [#1] > LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries > Modules linked in: nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 > nf_defrag_ipv4 ip6_tables ip_tables nft_compat ip_set nf_tables nfnetlink xts > vmx_crypto sg virtio_balloon > r_mod cdrom virtio_net net_failover virtio_blk virtio_scsi failover > dm_mirror dm_region_hash dm_log dm_mod > CPU: 25 PID: 15547 Comm: a.out Not tainted 5.4.0-rc2 #32 > NIP: c000de44 LR: c0034728 CTR: > REGS: c0003fe7bd70 TRAP: 0700 Not tainted (5.4.0-rc2) > MSR: 800302a03031 CR: 44000884 > XER: > CFAR: c000dda4 IRQMASK: 0 > PACATMSCRATCH: 80010280b033 > GPR00: c0034728 c00f65a17c80 c1662800 > 7fffacf3fd78 > GPR04: 1000 1000 > c00f611f8af0 > GPR08: 78006001 > 000c > GPR12: c00f611f84b0 c0003ffcb200 > > GPR16: > > GPR20: > c00f611f8140 > GPR24: 7fffacf3fd68 c00f65a17d90 > c00f611f7800 > GPR28: c00f65a17e90 c00f65a17e90 c1685e18 > 7fffacf3f000 > NIP [c000de44] fast_exception_return+0xf4/0x1b0 > LR [c0034728] handle_rt_signal64+0x78/0xc50 > Call Trace: > [c00f65a17c80] [c0034710] handle_rt_signal64+0x60/0xc50 > (unreliable) > [c00f65a17d30] [c0023640] do_notify_resume+0x330/0x460 > [c00f65a17e20] [c000dcc4] ret_from_except_lite+0x70/0x74 > Instruction dump: > 7c4ff120 e8410170 7c5a03a6 3840 f8410060 e8010070 e8410080 e8610088 > 6000 6000 e8810090 e8210078 <4c24> 4800 e8610178 > 88ed0989 > ---[ end trace 93094aa44b442f87 ]--- > > The simplified sequence of events that triggers the above exception is: > > ... # userspace in NON-TRANSACTIONAL state > tbegin # userspace in TRANSACTIONAL state > signal delivery # kernelspace in SUSPENDED state > handle_rt_signal64() > get_tm_stackpointer() > treclaim# kernelspace in NON-TRANSACTIONAL state > __put_user() > page fault happens. We will never get back here because of the TM Bad > Thing exception. > > page fault handling kicks in and we voluntarily preempt ourselves > do_page_fault() > __schedule() > __switch_to(other_task) > > our task is rescheduled and we recheckpoint because the thread's MSR[TS] > was not cleared > __switch_to(our_task) > switch_to_tm() > tm_recheckpoint_new_task() > trechkpt # kernelspace in SUSPENDED state > > The page fault handling resumes, but now we are in suspended transaction > state > do_page_fault()completes > rfid <- trying to get back where the page fault happened (we were > non-transactional back then) > TM Bad Thing# illegal transition from suspended to > non-transactional > > This patch fixes that issue by clearing the current thread's MSR[TS] just > after > treclaim in get_tm_stackpointer() so that we stay in non-transactional state > in > case we are preempted. In order to make treclaim and clearing the thread's > MSR[TS] atomic from a preemption perspective when CONFIG_PREEMPT is set, > preempt_disable/enable() is used. It's also necessary to save the previous > value of the thread's MSR before get_tm_stackpointer() is called so that it > can > be exposed to the signal handler later in setup_t
Re: [PATCH] powerpc/8xx: Fix clearing of bits 20-23 in ITLB miss
On Sun, 2020-02-09 at 18:14:42 UTC, Christophe Leroy wrote: > In ITLB miss handled the line supposed to clear bits 20-23 on the > L2 ITLB entry is buggy and does indeed nothing, leading to undefined > value which could allow execution when it shouldn't. > > Properly do the clearing with the relevant instruction. > > Fixes: 74fabcadfd43 ("powerpc/8xx: don't use r12/SPRN_SPRG_SCRATCH2 in TLB > Miss handlers") > Cc: sta...@vger.kernel.org > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/a4031afb9d10d97f4d0285844abbc0ab04245304 cheers
Re: [PATCH v2] powerpc/hugetlb: Fix 8M hugepages on 8xx
On Sun, 2020-02-09 at 16:02:41 UTC, Christophe Leroy wrote: > With HW assistance all page tables must be 4k aligned, the 8xx > drops the last 12 bits during the walk. > > Redefine HUGEPD_SHIFT_MASK to mask last 12 bits out. > HUGEPD_SHIFT_MASK is used to for alignment of page table cache. > > Fixes: 22569b881d37 ("powerpc/8xx: Enable 8M hugepage support with HW > assistance") > Cc: sta...@vger.kernel.org > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/50a175dd18de7a647e72aca7daf4744e3a5a81e3 cheers
Re: [PATCH 1/1] powerpc/eeh: fix deadlock handling dead PHB
On Fri, 2020-02-07 at 04:57:31 UTC, Sam Bobroff wrote: > Recovering a dead PHB can currently cause a deadlock as the PCI > rescan/remove lock is taken twice. > > This is caused as part of an existing bug in > eeh_handle_special_event(). The pe is processed while traversing the > PHBs even though the pe is unrelated to the loop. This causes the pe > to be, incorrectly, processed more than once. > > Untangling this section can move the pe processing out of the loop and > also outside the locked section, correcting both problems. > > Signed-off-by: Sam Bobroff Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/d4f194ed9eb9841a8f978710e4d24296f791a85b cheers
Re: [PATCH] powerpc/hugetlb: Fix 512k hugepages on 8xx with 16k page size
On Thu, 2020-02-06 at 13:50:28 UTC, Christophe Leroy wrote: > Commit 55c8fc3f4930 ("powerpc/8xx: reintroduce 16K pages with HW > assistance") redefined pte_t as a struct of 4 pte_basic_t, because > in 16K pages mode there are four identical entries in the > page table. But the size of hugepage tables is calculated based > of the size of (void *). Therefore, we end up with page tables > of size 1k instead of 4k for 512k pages. > > As 512k hugepage tables are the same size as standard page tables, > ie 4k, use the standard page tables instead of PGT_CACHE tables. > > Fixes: 3fb69c6a1a13 ("powerpc/8xx: Enable 512k hugepage support with HW > assistance") > Cc: sta...@vger.kernel.org > Signed-off-by: Christophe Leroy Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/f2b67ef90b0d5eca0f2255e02cf2f620bc0ddcdb cheers
Re: [PATCH v2 07/13] powerpc: add support for folded p4d page tables
Le 16/02/2020 à 09:18, Mike Rapoport a écrit : From: Mike Rapoport Implement primitives necessary for the 4th level folding, add walks of p4d level where appropriate and replace 5level-fixup.h with pgtable-nop4d.h. Signed-off-by: Mike Rapoport Tested-by: Christophe Leroy # 8xx and 83xx --- arch/powerpc/include/asm/book3s/32/pgtable.h | 1 - arch/powerpc/include/asm/book3s/64/hash.h | 4 +- arch/powerpc/include/asm/book3s/64/pgalloc.h | 4 +- arch/powerpc/include/asm/book3s/64/pgtable.h | 58 ++ arch/powerpc/include/asm/book3s/64/radix.h| 6 +- arch/powerpc/include/asm/nohash/32/pgtable.h | 1 - arch/powerpc/include/asm/nohash/64/pgalloc.h | 2 +- .../include/asm/nohash/64/pgtable-4k.h| 32 +- arch/powerpc/include/asm/nohash/64/pgtable.h | 6 +- arch/powerpc/include/asm/pgtable.h| 8 +++ arch/powerpc/kvm/book3s_64_mmu_radix.c| 59 --- arch/powerpc/lib/code-patching.c | 7 ++- arch/powerpc/mm/book3s32/mmu.c| 2 +- arch/powerpc/mm/book3s32/tlb.c| 4 +- arch/powerpc/mm/book3s64/hash_pgtable.c | 4 +- arch/powerpc/mm/book3s64/radix_pgtable.c | 19 -- arch/powerpc/mm/book3s64/subpage_prot.c | 6 +- arch/powerpc/mm/hugetlbpage.c | 28 + arch/powerpc/mm/kasan/kasan_init_32.c | 8 +-- arch/powerpc/mm/mem.c | 4 +- arch/powerpc/mm/nohash/40x.c | 4 +- arch/powerpc/mm/nohash/book3e_pgtable.c | 15 +++-- arch/powerpc/mm/pgtable.c | 25 +++- arch/powerpc/mm/pgtable_32.c | 28 + arch/powerpc/mm/pgtable_64.c | 10 ++-- arch/powerpc/mm/ptdump/hashpagetable.c| 20 ++- arch/powerpc/mm/ptdump/ptdump.c | 22 ++- arch/powerpc/xmon/xmon.c | 17 +- 28 files changed, 284 insertions(+), 120 deletions(-) diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c index 206156255247..7bd4b81d5b5d 100644 --- a/arch/powerpc/mm/ptdump/ptdump.c +++ b/arch/powerpc/mm/ptdump/ptdump.c @@ -277,9 +277,9 @@ static void walk_pmd(struct pg_state *st, pud_t *pud, unsigned long start) } } -static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) +static void walk_pud(struct pg_state *st, p4d_t *p4d, unsigned long start) { - pud_t *pud = pud_offset(pgd, 0); + pud_t *pud = pud_offset(p4d, 0); unsigned long addr; unsigned int i; @@ -293,6 +293,22 @@ static void walk_pud(struct pg_state *st, pgd_t *pgd, unsigned long start) } } +static void walk_p4d(struct pg_state *st, pgd_t *pgd, unsigned long start) +{ + p4d_t *p4d = p4d_offset(pgd, 0); + unsigned long addr; + unsigned int i; + + for (i = 0; i < PTRS_PER_P4D; i++, p4d++) { + addr = start + i * P4D_SIZE; + if (!p4d_none(*p4d) && !p4d_is_leaf(*p4d)) + /* p4d exists */ + walk_pud(st, p4d, addr); + else + note_page(st, addr, 2, p4d_val(*p4d)); Level 2 is already used by walk_pud(). I think you have to increment the level used in walk_pud() and walk_pmd() and walk_pte() + } +} + static void walk_pagetables(struct pg_state *st) { unsigned int i; @@ -306,7 +322,7 @@ static void walk_pagetables(struct pg_state *st) for (i = pgd_index(addr); i < PTRS_PER_PGD; i++, pgd++, addr += PGDIR_SIZE) { if (!pgd_none(*pgd) && !pgd_is_leaf(*pgd)) /* pgd exists */ - walk_pud(st, pgd, addr); + walk_p4d(st, pgd, addr); else note_page(st, addr, 1, pgd_val(*pgd)); } Christophe
Re: powerpc Linux scv support and scv system call ABI proposal
Tulio Magno Quites Machado Filho's on January 30, 2020 1:51 am: > Nicholas Piggin writes: > >> Adhemerval Zanella's on January 29, 2020 3:26 am: >>> >>> We already had to push a similar hack where glibc used to abort transactions >>> prior syscalls to avoid some side-effects on kernel (commit 56cf2763819d2f). >>> It was eventually removed from syscall handling by f0458cf4f9ff3d870, where >>> we only enable TLE if kernel suppors PPC_FEATURE2_HTM_NOSC. >>> >>> The transaction syscall abort used to read a variable directly from TCB, >>> so this could be an option. I would expect that we could optimize it where >>> if glibc is building against a recent kernel and compiler is building >>> for a ISA 3.0+ cpu we could remove the 'sc' code. >>> >> >> We would just have to be careful of running on ISA 3.0 CPUs on older >> kernels which do not support scv. > > Can we assume that, if a syscall is available through sc it's also available > in scv 0? Was on vacation, thanks for waiting. Yes, except for the difference in calling convention, we would require that the syscalls available to `sc` is exactly the same as `scv 0`. This happens as a natural consequence of the kernel implementation which re-uses code to select the syscall. > > Because if that's true, I believe your suggestion to interpret > PPC_FEATURE2_SCV > as scv 0 support would be helpful to provide this support via IFUNC even > when glibc is built using --with-cpu=power8, which is the most common scenario > in ppc64le. > > In that scenario, it seems new HWCAP bits for new vectors wouldn't be too > frequent, which was the only downside of this proposal. Okay good feedback, thanks. Thanks, Nick
[PATCH] powerpc/xmon: Lower limits on nidump and ndump
In xmon we have two variables that are used by the dump commands. There's ndump which is the number of bytes to dump using 'd', and nidump which is the number of instructions to dump using 'di'. ndump starts as 64 and nidump starts as 16, but both can be set by the user. It's fairly common to be pasting addresses into xmon when trying to debug something, and if you inadvertently double paste an address like so: 0:mon> di c2101f6c c2101f6c The second value is interpreted as the number of instructions to dump. Luckily it doesn't dump 13 quintrillion instructions, the value is limited to MAX_DUMP (128K). But as each instruction is dumped on a single line, that's still a lot of output. If you're on a slow console that can take multiple minutes to print. If you were "just popping in and out of xmon quickly before the RCU/hardlockup detector fires" you are now having a bad day. Things are not as bad with 'd' because we print 16 bytes per line, so it's fewer lines. But it's still quite a lot. So shrink the maximum for 'd' to 64K (one page), which is 4096 lines. For 'di' add a new limit which is the above / 4 - because instructions are 4 bytes, meaning again we can dump one page. Signed-off-by: Michael Ellerman --- arch/powerpc/xmon/xmon.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index e8c84d265602..722bf7ed0eda 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -81,8 +81,9 @@ static bool xmon_is_ro = IS_ENABLED(CONFIG_XMON_DEFAULT_RO_MODE); static unsigned long adrs; static int size = 1; -#define MAX_DUMP (128 * 1024) +#define MAX_DUMP (64 * 1024) static unsigned long ndump = 64; +#define MAX_IDUMP (MAX_DUMP >> 2) static unsigned long nidump = 16; static unsigned long ncsum = 4096; static int termch; @@ -2756,8 +2757,8 @@ dump(void) scanhex(&nidump); if (nidump == 0) nidump = 16; - else if (nidump > MAX_DUMP) - nidump = MAX_DUMP; + else if (nidump > MAX_IDUMP) + nidump = MAX_IDUMP; adrs += ppc_inst_dump(adrs, nidump, 1); last_cmd = "di\n"; } else if (c == 'l') { -- 2.21.1
Re: Surprising code generated for vdso_read_begin()
On Wed, Feb 19, 2020 at 9:45 AM Christophe Leroy wrote: > Le 16/02/2020 à 19:10, Arnd Bergmann a écrit : > > On Sat, Jan 11, 2020 at 12:33 PM Segher Boessenkool > > wrote: > >> > >> On Fri, Jan 10, 2020 at 07:45:44AM +0100, Christophe Leroy wrote: > >>> Le 09/01/2020 à 21:07, Segher Boessenkool a écrit : > It looks like the compiler did loop peeling. What GCC version is this? > Please try current trunk (to become GCC 10), or at least GCC 9? > >>> > >>> It is with GCC 5.5 > >>> > >>> https://mirrors.edge.kernel.org/pub/tools/crosstool/ doesn't have more > >>> recent than 8.1 > >> > >> Arnd, can you update the tools? We are at 8.3 and 9.2 now :-) Or is > >> this hard and/or painful to do? > > > > To follow up on this older thread, I have now uploaded 6.5, 7.5, 8.3 and 9.2 > > binaries, as well as a recent 10.0 snapshot. > > > > Thanks Arnd, > > I have built the VDSO with 9.2, I get less performant result than with > 8.2 (same performance as with 5.5). > > After a quick look, I see: > - Irrelevant NOPs to align loops and stuff, allthough -mpcu=860 should > avoid that. > - A stack frame is set for saving r31 in __c_kernel_clock_gettime. GCC > 8.1 don't need that, all VDSO functions are frameless with 8.1 If you think it should be fixed in gcc, maybe try to reproduce it in https://godbolt.org/ and open a gcc bug against that. Also, please try the gcc-10 snapshot, which has the highest chance of getting fixes if it shows the same issue (or worse). Arnd
Re: Surprising code generated for vdso_read_begin()
Le 16/02/2020 à 19:10, Arnd Bergmann a écrit : On Sat, Jan 11, 2020 at 12:33 PM Segher Boessenkool wrote: On Fri, Jan 10, 2020 at 07:45:44AM +0100, Christophe Leroy wrote: Le 09/01/2020 à 21:07, Segher Boessenkool a écrit : It looks like the compiler did loop peeling. What GCC version is this? Please try current trunk (to become GCC 10), or at least GCC 9? It is with GCC 5.5 https://mirrors.edge.kernel.org/pub/tools/crosstool/ doesn't have more recent than 8.1 Arnd, can you update the tools? We are at 8.3 and 9.2 now :-) Or is this hard and/or painful to do? To follow up on this older thread, I have now uploaded 6.5, 7.5, 8.3 and 9.2 binaries, as well as a recent 10.0 snapshot. Thanks Arnd, I have built the VDSO with 9.2, I get less performant result than with 8.2 (same performance as with 5.5). After a quick look, I see: - Irrelevant NOPs to align loops and stuff, allthough -mpcu=860 should avoid that. - A stack frame is set for saving r31 in __c_kernel_clock_gettime. GCC 8.1 don't need that, all VDSO functions are frameless with 8.1 Christophe
[PATCH] powerpc/kprobes: Blacklist functions running with MMU disabled on PPC32
kprobe does not handle events happening in real mode, all functions running with MMU disabled have to be blacklisted. As already done for PPC64, do it for PPC32. Signed-off-by: Christophe Leroy --- arch/powerpc/include/asm/ppc_asm.h | 10 +++ arch/powerpc/kernel/cpu_setup_6xx.S | 4 +- arch/powerpc/kernel/entry_32.S | 68 arch/powerpc/kernel/fpu.S| 1 + arch/powerpc/kernel/idle_6xx.S | 2 +- arch/powerpc/kernel/idle_e500.S | 2 +- arch/powerpc/kernel/l2cr_6xx.S | 2 +- arch/powerpc/kernel/misc.S | 2 + arch/powerpc/kernel/misc_32.S| 4 +- arch/powerpc/kernel/swsusp_32.S | 6 +- arch/powerpc/kernel/vector.S | 1 + arch/powerpc/mm/book3s32/hash_low.S | 38 +-- arch/powerpc/mm/mem.c| 2 + arch/powerpc/platforms/52xx/lite5200_sleep.S | 2 + arch/powerpc/platforms/82xx/pq2.c| 1 + arch/powerpc/platforms/83xx/suspend-asm.S| 1 + arch/powerpc/platforms/powermac/cache.S | 2 + arch/powerpc/platforms/powermac/sleep.S | 13 ++-- 18 files changed, 85 insertions(+), 76 deletions(-) diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h index 6b03dff61a05..e8f34ba89497 100644 --- a/arch/powerpc/include/asm/ppc_asm.h +++ b/arch/powerpc/include/asm/ppc_asm.h @@ -267,8 +267,18 @@ GLUE(.,name): .pushsection "_kprobe_blacklist","aw"; \ PPC_LONG (entry) ; \ .popsection +#define _NOKPROBE_ENTRY(entry) \ + _ASM_NOKPROBE_SYMBOL(entry) \ + _ENTRY(entry) +#define _NOKPROBE_GLOBAL(entry)\ + _ASM_NOKPROBE_SYMBOL(entry) \ + _GLOBAL(entry) #else #define _ASM_NOKPROBE_SYMBOL(entry) +#define _NOKPROBE_ENTRY(entry) \ + _ENTRY(entry) +#define _NOKPROBE_GLOBAL(entry)\ + _GLOBAL(entry) #endif #define FUNC_START(name) _GLOBAL(name) diff --git a/arch/powerpc/kernel/cpu_setup_6xx.S b/arch/powerpc/kernel/cpu_setup_6xx.S index f6517f67265a..1cb947268546 100644 --- a/arch/powerpc/kernel/cpu_setup_6xx.S +++ b/arch/powerpc/kernel/cpu_setup_6xx.S @@ -276,7 +276,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NO_DPM) * in some 750 cpus where using a not yet initialized FPU register after * power on reset may hang the CPU */ -_GLOBAL(__init_fpu_registers) +_NOKPROBE_GLOBAL(__init_fpu_registers) mfmsr r10 ori r11,r10,MSR_FP mtmsr r11 @@ -381,7 +381,7 @@ _GLOBAL(__save_cpu_setup) * restore CPU state as backed up by the previous * function. This does not include cache setting */ -_GLOBAL(__restore_cpu_setup) +_NOKPROBE_GLOBAL(__restore_cpu_setup) /* Some CR fields are volatile, we back it up all */ mfcrr7 diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S index 0713daa651d9..cf9a7640abf0 100644 --- a/arch/powerpc/kernel/entry_32.S +++ b/arch/powerpc/kernel/entry_32.S @@ -44,24 +44,21 @@ .align 12 #ifdef CONFIG_BOOKE - .globl mcheck_transfer_to_handler -mcheck_transfer_to_handler: +_NOKPROBE_ENTRY(mcheck_transfer_to_handler) mfspr r0,SPRN_DSRR0 stw r0,_DSRR0(r11) mfspr r0,SPRN_DSRR1 stw r0,_DSRR1(r11) /* fall through */ - .globl debug_transfer_to_handler -debug_transfer_to_handler: +_NOKPROBE_ENTRY(debug_transfer_to_handler) mfspr r0,SPRN_CSRR0 stw r0,_CSRR0(r11) mfspr r0,SPRN_CSRR1 stw r0,_CSRR1(r11) /* fall through */ - .globl crit_transfer_to_handler -crit_transfer_to_handler: +_NOKPROBE_ENTRY(crit_transfer_to_handler) #ifdef CONFIG_PPC_BOOK3E_MMU mfspr r0,SPRN_MAS0 stw r0,MAS0(r11) @@ -97,8 +94,7 @@ crit_transfer_to_handler: #endif #ifdef CONFIG_40x - .globl crit_transfer_to_handler -crit_transfer_to_handler: +_NOKPROBE_ENTRY(crit_transfer_to_handler) lwz r0,crit_r10@l(0) stw r0,GPR10(r11) lwz r0,crit_r11@l(0) @@ -124,13 +120,11 @@ crit_transfer_to_handler: * Note that we rely on the caller having set cr0.eq iff the exception * occurred in kernel mode (i.e. MSR:PR = 0). */ - .globl transfer_to_handler_full -transfer_to_handler_full: +_NOKPROBE_ENTRY(transfer_to_handler_full) SAVE_NVGPRS(r11) /* fall through */ - .globl transfer_to_handler -transfer_to_handler: +_NOKPROBE_ENTRY(transfer_to_handler) stw r2,GPR2(r11) stw r12,_NIP(r11) stw r9,_MSR(r11) @@ -194,8 +188,7 @@ transfer_to_handler: bt- 31-TLF_NAPPING,4f bt- 31-TLF_SLEEPING,7f #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */ - .globl transfer_t
[PATCH v2] powerpc/kprobes: Remove redundant code
At the time being we have something like if (something) { p = get(); if (p) { if (something_wrong) goto out; ... return; } else if (a != b) { if (some_error) goto out; ... } goto out; } p = get(); if (!p) { if (a != b) { if (some_error) goto out; ... } goto out; } This is similar to p = get(); if (!p) { if (a != b) { if (some_error) goto out; ... } goto out; } if (something) { if (something_wrong) goto out; ... return; } Signed-off-by: Christophe Leroy --- v2: Reverse the logic by testing (!p) before kprobe_running() as suggested by Naveen. --- arch/powerpc/kernel/kprobes.c | 80 ++- 1 file changed, 32 insertions(+), 48 deletions(-) diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c index 9b340af02c38..84567406b53d 100644 --- a/arch/powerpc/kernel/kprobes.c +++ b/arch/powerpc/kernel/kprobes.c @@ -274,54 +274,6 @@ int kprobe_handler(struct pt_regs *regs) preempt_disable(); kcb = get_kprobe_ctlblk(); - /* Check we're not actually recursing */ - if (kprobe_running()) { - p = get_kprobe(addr); - if (p) { - kprobe_opcode_t insn = *p->ainsn.insn; - if (kcb->kprobe_status == KPROBE_HIT_SS && - is_trap(insn)) { - /* Turn off 'trace' bits */ - regs->msr &= ~MSR_SINGLESTEP; - regs->msr |= kcb->kprobe_saved_msr; - goto no_kprobe; - } - /* We have reentered the kprobe_handler(), since -* another probe was hit while within the handler. -* We here save the original kprobes variables and -* just single step on the instruction of the new probe -* without calling any user handlers. -*/ - save_previous_kprobe(kcb); - set_current_kprobe(p, regs, kcb); - kprobes_inc_nmissed_count(p); - kcb->kprobe_status = KPROBE_REENTER; - if (p->ainsn.boostable >= 0) { - ret = try_to_emulate(p, regs); - - if (ret > 0) { - restore_previous_kprobe(kcb); - preempt_enable_no_resched(); - return 1; - } - } - prepare_singlestep(p, regs); - return 1; - } else if (*addr != BREAKPOINT_INSTRUCTION) { - /* If trap variant, then it belongs not to us */ - kprobe_opcode_t cur_insn = *addr; - - if (is_trap(cur_insn)) - goto no_kprobe; - /* The breakpoint instruction was removed by -* another cpu right after we hit, no further -* handling of this interrupt is appropriate -*/ - ret = 1; - } - goto no_kprobe; - } - p = get_kprobe(addr); if (!p) { if (*addr != BREAKPOINT_INSTRUCTION) { @@ -346,6 +298,38 @@ int kprobe_handler(struct pt_regs *regs) goto no_kprobe; } + /* Check we're not actually recursing */ + if (kprobe_running()) { + kprobe_opcode_t insn = *p->ainsn.insn; + if (kcb->kprobe_status == KPROBE_HIT_SS && is_trap(insn)) { + /* Turn off 'trace' bits */ + regs->msr &= ~MSR_SINGLESTEP; + regs->msr |= kcb->kprobe_saved_msr; + goto no_kprobe; + } + /* We have reentered the kprobe_handler(), since +* another probe was hit while within the handler. +* We here save the original kprobes variables and +* just single step on the instruction of the new probe +* without calling any user handlers. +*/ +