[PATCH v8 6/6] powerpc/code-patching: Use CPU local patch address directly

2022-10-20 Thread Benjamin Gray
With the isolated mm context support, there is a CPU local variable that
can hold the patch address. Use it instead of adding a level of
indirection through the text_poke_area vm_struct.

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/lib/code-patching.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index eabdd74a26c0..ce58c1b3fcf1 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -122,6 +122,7 @@ static int text_area_cpu_up(unsigned int cpu)
unmap_patch_area(addr);
 
this_cpu_write(text_poke_area, area);
+   this_cpu_write(cpu_patching_addr, addr);
 
return 0;
 }
@@ -365,7 +366,7 @@ static int __do_patch_instruction(u32 *addr, ppc_inst_t 
instr)
pte_t *pte;
unsigned long pfn = get_patch_pfn(addr);
 
-   text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr & 
PAGE_MASK;
+   text_poke_addr = (unsigned long)__this_cpu_read(cpu_patching_addr) & 
PAGE_MASK;
patch_addr = (u32 *)(text_poke_addr + offset_in_page(addr));
 
pte = virt_to_kpte(text_poke_addr);
-- 
2.37.3



[PATCH v8 5/6] powerpc/code-patching: Use temporary mm for Radix MMU

2022-10-20 Thread Benjamin Gray
From: "Christopher M. Riedl" 

x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K)
bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization for each CPU as it
comes online.

The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
which ignores the AMR (so no need to unlock/lock KUAP) according to
PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

---

Synchronisation is done according to Book 3 Chapter 13 "Synchronization
Requirements for Context Alterations". Switching the mm is a change to
the PID, which requires a context synchronising instruction before and
after the change, and a hwsync between the last instruction that
performs address translation for an associated storage access.

Instruction fetch is an associated storage access, but the instruction
address mappings are not being changed, so it should not matter which
context they use. We must still perform a hwsync to guard arbitrary
prior code that may have access a userspace address.

TLB invalidation is local and VA specific. Local because only this core
used the patching mm, and VA specific because we only care that the
writable mapping is purged. Leaving the other mappings intact is more
efficient, especially when performing many code patches in a row (e.g.,
as ftrace would).

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/lib/code-patching.c | 226 ++-
 1 file changed, 221 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 9b9eba574d7e..eabdd74a26c0 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -4,12 +4,17 @@
  */
 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -42,11 +47,59 @@ int raw_patch_instruction(u32 *addr, ppc_inst_t instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(pte_t *, cpu_patching_pte);
 
 static int map_patch_area(void *addr, unsigned long text_poke_addr);
 static void unmap_patch_area(unsigned long addr);
 
+struct temp_mm_state {
+   struct mm_struct *mm;
+};
+
+static bool mm_patch_enabled(void)
+{
+   return IS_ENABLED(CONFIG_SMP) && radix_enabled();
+}
+
+/*
+ * The following applies for Radix MMU. Hash MMU has different requirements,
+ * and so is not supported.
+ *
+ * Changing mm requires context synchronising instructions on both sides of
+ * the context switch, as well as a hwsync between the last instruction for
+ * which the address of an associated storage access was translated using
+ * the current context.
+ *
+ * switch_mm_irqs_off performs an isync after the context switch. It is
+ * the responsibility of the caller to perform the CSI and hwsync before
+ * starting/stopping the 

[PATCH v8 4/6] powerpc/tlb: Add local flush for page given mm_struct and psize

2022-10-20 Thread Benjamin Gray
Adds a local TLB flush operation that works given an mm_struct, VA to
flush, and page size representation.

This removes the need to create a vm_area_struct, which the temporary
patching mm work does not need.

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/include/asm/book3s/32/tlbflush.h  | 9 +
 arch/powerpc/include/asm/book3s/64/tlbflush-hash.h | 5 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h  | 8 
 arch/powerpc/include/asm/nohash/tlbflush.h | 1 +
 4 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/tlbflush.h 
b/arch/powerpc/include/asm/book3s/32/tlbflush.h
index ba1743c52b56..e5a688cebf69 100644
--- a/arch/powerpc/include/asm/book3s/32/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/32/tlbflush.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
 #define _ASM_POWERPC_BOOK3S_32_TLBFLUSH_H
 
+#include 
+
 #define MMU_NO_CONTEXT  (0)
 /*
  * TLB flushing for "classic" hash-MMU 32-bit CPUs, 6xx, 7xx, 7xxx
@@ -74,6 +76,13 @@ static inline void local_flush_tlb_page(struct 
vm_area_struct *vma,
 {
flush_tlb_page(vma, vmaddr);
 }
+
+static inline void local_flush_tlb_page_psize(struct mm_struct *mm, unsigned 
long vmaddr, int psize)
+{
+   BUILD_BUG_ON(psize != MMU_PAGE_4K);
+   flush_range(mm, vmaddr, vmaddr + PAGE_SIZE);
+}
+
 static inline void local_flush_tlb_mm(struct mm_struct *mm)
 {
flush_tlb_mm(mm);
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
index fab8332fe1ad..8fd9dc49b2a1 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush-hash.h
@@ -94,6 +94,11 @@ static inline void hash__local_flush_tlb_page(struct 
vm_area_struct *vma,
 {
 }
 
+static inline void hash__local_flush_tlb_page_psize(struct mm_struct *mm,
+   unsigned long vmaddr, int 
psize)
+{
+}
+
 static inline void hash__flush_tlb_page(struct vm_area_struct *vma,
unsigned long vmaddr)
 {
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 67655cd60545..2d839dd5c08c 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -92,6 +92,14 @@ static inline void local_flush_tlb_page(struct 
vm_area_struct *vma,
return hash__local_flush_tlb_page(vma, vmaddr);
 }
 
+static inline void local_flush_tlb_page_psize(struct mm_struct *mm,
+ unsigned long vmaddr, int psize)
+{
+   if (radix_enabled())
+   return radix__local_flush_tlb_page_psize(mm, vmaddr, psize);
+   return hash__local_flush_tlb_page_psize(mm, vmaddr, psize);
+}
+
 static inline void local_flush_all_mm(struct mm_struct *mm)
 {
if (radix_enabled())
diff --git a/arch/powerpc/include/asm/nohash/tlbflush.h 
b/arch/powerpc/include/asm/nohash/tlbflush.h
index bdaf34ad41ea..59bce0ebdcf4 100644
--- a/arch/powerpc/include/asm/nohash/tlbflush.h
+++ b/arch/powerpc/include/asm/nohash/tlbflush.h
@@ -58,6 +58,7 @@ static inline void flush_tlb_kernel_range(unsigned long 
start, unsigned long end
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
 extern void local_flush_tlb_mm(struct mm_struct *mm);
 extern void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long 
vmaddr);
+extern void local_flush_tlb_page_psize(struct mm_struct *mm, unsigned long 
vmaddr, int psize);
 
 extern void __local_flush_tlb_page(struct mm_struct *mm, unsigned long vmaddr,
   int tsize, int ind);
-- 
2.37.3



[PATCH v8 3/6] powerpc/code-patching: Verify instruction patch succeeded

2022-10-20 Thread Benjamin Gray
Verifies that if the instruction patching did not return an error then
the value stored at the given address to patch is now equal to the
instruction we patched it to.

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/lib/code-patching.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 34fc7ac34d91..9b9eba574d7e 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -186,6 +186,8 @@ static int do_patch_instruction(u32 *addr, ppc_inst_t instr)
err = __do_patch_instruction(addr, instr);
local_irq_restore(flags);
 
+   WARN_ON(!err && !ppc_inst_equal(instr, ppc_inst_read(addr)));
+
return err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
-- 
2.37.3



[PATCH v8 2/6] powerpc/code-patching: Use WARN_ON and fix check in poking_init

2022-10-20 Thread Benjamin Gray
From: "Christopher M. Riedl" 

The latest kernel docs list BUG_ON() as 'deprecated' and that they
should be replaced with WARN_ON() (or pr_warn()) when possible. The
BUG_ON() in poking_init() warrants a WARN_ON() rather than a pr_warn()
since the error condition is deemed "unreachable".

Also take this opportunity to fix the failure check in the WARN_ON():
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ...) returns a positive integer
on success and a negative integer on failure.

Signed-off-by: Benjamin Gray 
---
 arch/powerpc/lib/code-patching.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index ad0cf3108dd0..34fc7ac34d91 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -81,16 +81,13 @@ static int text_area_cpu_down(unsigned int cpu)
 
 static __ro_after_init DEFINE_STATIC_KEY_FALSE(poking_init_done);
 
-/*
- * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
- * we judge it as being preferable to a kernel that will crash later when
- * someone tries to use patch_instruction().
- */
 void __init poking_init(void)
 {
-   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
-   "powerpc/text_poke:online", text_area_cpu_up,
-   text_area_cpu_down));
+   WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+ "powerpc/text_poke:online",
+ text_area_cpu_up,
+ text_area_cpu_down) < 0);
+
static_branch_enable(_init_done);
 }
 
-- 
2.37.3



[PATCH v8 1/6] powerpc: Allow clearing and restoring registers independent of saved breakpoint state

2022-10-20 Thread Benjamin Gray
From: Jordan Niethe 

For the coming temporary mm used for instruction patching, the
breakpoint registers need to be cleared to prevent them from
accidentally being triggered. As soon as the patching is done, the
breakpoints will be restored. The breakpoint state is stored in the per
cpu variable current_brk[]. Add a pause_breakpoints() function which will
clear the breakpoint registers without touching the state in
current_bkr[]. Add a pair function unpause_breakpoints() which will move
the state in current_brk[] back to the registers.

Signed-off-by: Jordan Niethe 
Signed-off-by: Benjamin Gray 
---
 arch/powerpc/include/asm/debug.h |  2 ++
 arch/powerpc/kernel/process.c| 36 +---
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c..83f2dc3785e8 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,8 @@ static inline int debugger_fault_handler(struct pt_regs 
*regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void pause_breakpoints(void);
+void unpause_breakpoints(void);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 67da147fe34d..7aee1b30e73c 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -685,6 +685,7 @@ DEFINE_INTERRUPT_HANDLER(do_break)
 
 static DEFINE_PER_CPU(struct arch_hw_breakpoint, current_brk[HBP_NUM_MAX]);
 
+
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 /*
  * Set the debug registers back to their default "safe" values.
@@ -862,10 +863,8 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
-void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+static void set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
-   memcpy(this_cpu_ptr(_brk[nr]), brk, sizeof(*brk));
-
if (dawr_enabled())
// Power8 or later
set_dawr(nr, brk);
@@ -879,6 +878,12 @@ void __set_breakpoint(int nr, struct arch_hw_breakpoint 
*brk)
WARN_ON_ONCE(1);
 }
 
+void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+   memcpy(this_cpu_ptr(_brk[nr]), brk, sizeof(*brk));
+   set_breakpoint(nr, brk);
+}
+
 /* Check if we have DAWR or DABR hardware */
 bool ppc_breakpoint_available(void)
 {
@@ -891,6 +896,31 @@ bool ppc_breakpoint_available(void)
 }
 EXPORT_SYMBOL_GPL(ppc_breakpoint_available);
 
+/* Disable the breakpoint in hardware without touching current_brk[] */
+void pause_breakpoints(void)
+{
+   struct arch_hw_breakpoint brk = {0};
+   int i;
+
+   if (!ppc_breakpoint_available())
+   return;
+
+   for (i = 0; i < nr_wp_slots(); i++)
+   set_breakpoint(i, );
+}
+
+/* Renable the breakpoint in hardware from current_brk[] */
+void unpause_breakpoints(void)
+{
+   int i;
+
+   if (!ppc_breakpoint_available())
+   return;
+
+   for (i = 0; i < nr_wp_slots(); i++)
+   set_breakpoint(i, this_cpu_ptr(_brk[i]));
+}
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 
 static inline bool tm_enabled(struct task_struct *tsk)
-- 
2.37.3



[PATCH v8 0/6] Use per-CPU temporary mappings for patching

2022-10-20 Thread Benjamin Gray
This is a revision of Chris and Jordan's series to introduces a per cpu 
temporary
mm to be used for patching with strict rwx on radix mmus.

It is just rebased on powerpc/next. I am aware there are several code patching
patches on the list and can rebase when necessary. For now I figure this'll get
changes requested for a v9 either way.

v8: * Merge the temp mm 'introduction' and usage into one patch.
  x86 split it because their temp MMU swap mechanism may be
  used for other purposes, but ours cannot (it is local to
  code-patching.c).
* Shuffle v7,3/5 cpu_patching_addr usage to the end (v8,4/4)
  after cpu_patching_addr is actually introduced.
* Clearer formatting of the cpuhp_setup_state arguments
* Only allocate patching resources as CPU comes online. Free
  them when CPU goes offline or if an error occurs during allocation.
* Refactored the random address calculation to make the page
  alignment more obvious.
* Manually perform the allocation page walk to avoid taking locks
  (which, given they are not necessary to take, is misleading) and
  prevent memory leaks if page tree allocation fails.
* Cache the pte pointer.
* Stop using the patching mm first, then clear the patching PTE & TLB.
* Only clear the VA with the writable mapping from the TLB. Leaving
  the other TLB entries helps performance, especially when patching
  many times in a row (e.g., ftrace activation).
* Instruction patch verification moved to it's own patch onto shared
  path with existing mechanism.
* Detect missing patching_mm and return an error for the caller to
  decide what to do.
* Comment the purposes of each synchronisation, and why it is safe to
  omit some at certain points.

Previous versions:
v7: https://lore.kernel.org/all/2020003717.1150965-1-jniet...@gmail.com/
v6: https://lore.kernel.org/all/20210911022904.30962-1-...@bluescreens.de/
v5: https://lore.kernel.org/all/20210713053113.4632-1-...@linux.ibm.com/
v4: https://lore.kernel.org/all/20210429072057.8870-1-...@bluescreens.de/
v3: https://lore.kernel.org/all/20200827052659.24922-1-...@codefail.de/
v2: https://lore.kernel.org/all/20200709040316.12789-1-...@informatik.wtf/
v1: https://lore.kernel.org/all/20200603051912.23296-1-...@informatik.wtf/
RFC: https://lore.kernel.org/all/20200323045205.20314-1-...@informatik.wtf/
x86: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Benjamin Gray (5):
  powerpc/code-patching: Use WARN_ON and fix check in poking_init
  powerpc/code-patching: Verify instruction patch succeeded
  powerpc/tlb: Add local flush for page given mm_struct and psize
  powerpc/code-patching: Use temporary mm for Radix MMU
  powerpc/code-patching: Use CPU local patch address directly

Jordan Niethe (1):
  powerpc: Allow clearing and restoring registers independent of saved
breakpoint state

 arch/powerpc/include/asm/book3s/32/tlbflush.h |   9 +
 .../include/asm/book3s/64/tlbflush-hash.h |   5 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/debug.h  |   2 +
 arch/powerpc/include/asm/nohash/tlbflush.h|   1 +
 arch/powerpc/kernel/process.c |  36 ++-
 arch/powerpc/lib/code-patching.c  | 236 +-
 7 files changed, 284 insertions(+), 13 deletions(-)


base-commit: 8636df94ec917019c4cb744ba0a1f94cf9057790
prerequisite-patch-id: b8387303be6478fdf94264d485d5e08994f305c7
prerequisite-patch-id: 06e54849e6c9e45a9b24668fa12cc0ece3f831a7
prerequisite-patch-id: f4be9e7d613761fba33fb2f7a81839cef36fe0fe
prerequisite-patch-id: 4ea0e36de5c393f9f6ae6243cb21a0ddb364c263
prerequisite-patch-id: 47a1294f0a5d5531ec5c32a761269cb5a1158515
prerequisite-patch-id: d72e371d3d820fdf529f03d2544c7f7f8bb6327a
prerequisite-patch-id: 3024e700433cb6a20dc1e1c6476ea1e98409d8b7
prerequisite-patch-id: f136637f7a8fe92dc4f60b908e2e7aa24aac3f43
--
2.37.3


RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Luck, Tony
>> When we do return to user mode the task is going to be busy servicing
>> a SIGBUS ... so shouldn't try to touch the poison page before the
>> memory_failure() called by the worker thread cleans things up.
>
> What about an RT process on a busy system?
> The worker threads are pretty low priority.

Most tasks don't have a SIGBUS handler ... so they just die without possibility 
of accessing poison

If this task DOES have a SIGBUS handler, and that for some bizarre reason just 
does a "return"
so the task jumps back to the instruction that cause the COW then there is a 
63/64
likelihood that it is touching a different cache line from the poisoned one.

In the 1/64 case ... its probably a simple store (since there was a COW, we 
know it was trying to
modify the page) ... so won't generate another machine check (those only happen 
for reads).

But maybe it is some RMW instruction ... then, if all the above options didn't 
happen ... we
could get another machine check from the same address. But then we just follow 
the usual
recovery path.

-Tony


RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread David Laight
From: Tony Luck
> Sent: 21 October 2022 05:08

> When we do return to user mode the task is going to be busy servicing
> a SIGBUS ... so shouldn't try to touch the poison page before the
> memory_failure() called by the worker thread cleans things up.

What about an RT process on a busy system?
The worker threads are pretty low priority.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)


Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Tony Luck
On Fri, Oct 21, 2022 at 09:52:01AM +0800, Shuai Xue wrote:
> 
> 
> 在 2022/10/21 AM4:05, Tony Luck 写道:
> > On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
> >>
> >>
> >> 在 2022/10/20 AM1:08, Tony Luck 写道:

> > I'm experimenting with using sched_work() to handle the call to
> > memory_failure() (echoing what the machine check handler does using
> > task_work)_add() to avoid the same problem of not being able to directly
> > call memory_failure()).
> 
> Work queues permit work to be deferred outside of the interrupt context
> into the kernel process context. If we return to user-space before the
> queued memory_failure() work is processed, we will take the fault again,
> as we discussed recently.
> 
> commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for 
> synchronous errors
> commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to 
> avoid memory leak
> 
> So, in my opinion, we should add memory failure as a task work, like
> do_machine_check does, e.g.
> 
> queue_task_work(, msg, kill_me_maybe);

Maybe ... but this case isn't pending back to a user instruction
that is trying to READ the poison memory address. The task is just
trying to WRITE to any address within the page.

So this is much more like a patrol scrub error found asynchronously
by the memory controller (in this case found asynchronously by the
Linux page copy function).  So I don't feel that it's really the
responsibility of the current task.

When we do return to user mode the task is going to be busy servicing
a SIGBUS ... so shouldn't try to touch the poison page before the
memory_failure() called by the worker thread cleans things up.

> > +   INIT_WORK(>work, do_sched_memory_failure);
> > +   p->pfn = pfn;
> > +   schedule_work(>work);
> > +}
> 
> I think there is already a function to do such work in mm/memory-failure.c.
> 
>   void memory_failure_queue(unsigned long pfn, int flags)

Also pointed out by Miaohe Lin  ... this does
exacly what I want, and is working well in tests so far. So perhaps
a cleaner solution than making the kill_me_maybe() function globally
visible.

-Tony


RE: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Luck, Tony
>> +INIT_WORK(>work, do_sched_memory_failure);
>> +p->pfn = pfn;
>> +schedule_work(>work);
>
> There is already memory_failure_queue() that can do this. Can we use it 
> directly?

Miaohe Lin,

Yes, can use that. A thousand thanks for pointing it out. I just tried it, and 
it works
perfectly.

I think I'll need to add an empty stub version for the CONFIG_MEMORY_FAILURE=n
build. But that's trivial.

-Tony



Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue



在 2022/10/21 AM4:05, Tony Luck 写道:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>>   # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
> 
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).


Thanks for your explanation.

> 
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed. 
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
> 
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
> 
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
> 
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.

I see your point. Thank you.

> 
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
> 
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
> 
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).

Work queues permit work to be deferred outside of the interrupt context
into the kernel process context. If we return to user-space before the
queued memory_failure() work is processed, we will take the fault again,
as we discussed recently.

commit 7f17b4a121d0d ACPI: APEI: Kick the memory_failure() queue for 
synchronous errors
commit 415fed694fe11 ACPI: APEI: do not add task_work to kernel thread to 
avoid memory leak

So, in my opinion, we should add memory failure as a task work, like
do_machine_check does, e.g.

queue_task_work(, msg, kill_me_maybe);

> 
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
> 
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
> 
> -Tony
> 
> From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck 
> Date: Thu, 20 Oct 2022 09:57:28 -0700
> Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
>  failure
> 
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
> 
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
> 
> Use schedule_work() to queue a request to call memory_failure() for the
> page with the error.
> 
> Signed-off-by: Tony Luck 
> ---
>  mm/memory.c | 35 

Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Miaohe Lin
On 2022/10/21 4:05, Tony Luck wrote:
> On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
>>
>>
>> 在 2022/10/20 AM1:08, Tony Luck 写道:
>>> If the kernel is copying a page as the result of a copy-on-write
>>> fault and runs into an uncorrectable error, Linux will crash because
>>> it does not have recovery code for this case where poison is consumed
>>> by the kernel.
>>>
>>> It is easy to set up a test case. Just inject an error into a private
>>> page, fork(2), and have the child process write to the page.
>>>
>>> I wrapped that neatly into a test at:
>>>
>>>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
>>>
>>> just enable ACPI error injection and run:
>>>
>>>   # ./einj_mem-uc -f copy-on-write
>>>
>>> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
>>> on architectures where that is available (currently x86 and powerpc).
>>> When an error is detected during the page copy, return VM_FAULT_HWPOISON
>>> to caller of wp_page_copy(). This propagates up the call stack. Both x86
>>> and powerpc have code in their fault handler to deal with this code by
>>> sending a SIGBUS to the application.
>>
>> Does it send SIGBUS to only child process or both parent and child process?
> 
> This only sends a SIGBUS to the process that wrote the page (typically
> the child, but also possible that the parent is the one that does the
> write that causes the COW).
> 
>>>
>>> Note that this patch avoids a system crash and signals the process that
>>> triggered the copy-on-write action. It does not take any action for the
>>> memory error that is still in the shared page. To handle that a call to
>>> memory_failure() is needed. 
>>
>> If the error page is not poisoned, should the return value of wp_page_copy
>> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
>> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
>> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.
> 
> The page has uncorrected data in it, but this patch doesn't mark it
> as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
> that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
> also skip the:
> 
>   "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"
> 
> console message. So might result in confusion and attepmts to debug a
> s/w problem with the application instead of blaming the death on a bad
> DIMM.
> 
>>> But this cannot be done from wp_page_copy()
>>> because it holds mmap_lock(). Perhaps the architecture fault handlers
>>> can deal with this loose end in a subsequent patch?
> 
> I started looking at this for x86 ... but I have changed my mind
> about this being a good place for a fix. When control returns back
> to the architecture fault handler it no longer has easy access to
> the physical page frame number. It has the virtual address, so it
> could descend back into somee new mm/memory.c function to get the
> physical address ... but that seems silly.
> 
> I'm experimenting with using sched_work() to handle the call to
> memory_failure() (echoing what the machine check handler does using
> task_work)_add() to avoid the same problem of not being able to directly
> call memory_failure()).
> 
> So far it seems to be working. Patch below (goes on top of original
> patch ... well on top of the internal version with mods based on
> feedback from Dan Williams ... but should show the general idea)
> 
> With this patch applied the page does get unmapped from all users.
> Other tasks that shared the page will get a SIGBUS if they attempt
> to access it later (from the page fault handler because of
> is_hwpoison_entry() as you mention above.
> 
> -Tony
> 
>>From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
> From: Tony Luck 
> Date: Thu, 20 Oct 2022 09:57:28 -0700
> Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
>  failure
> 
> Cannot call memory_failure() directly from the fault handler because
> mmap_lock (and others) are held.
> 
> It is important, but not urgent, to mark the source page as h/w poisoned
> and unmap it from other tasks.
> 
> Use schedule_work() to queue a request to call memory_failure() for the
> page with the error.
> 
> Signed-off-by: Tony Luck 
> ---
>  mm/memory.c | 35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b6056eef2f72..4a1304cf1f4e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>   return same;
>  }
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +struct pfn_work {
> + struct work_struct work;
> + unsigned long pfn;
> +};
> +
> +static void do_sched_memory_failure(struct work_struct *w)
> +{
> + struct pfn_work *p = container_of(w, struct pfn_work, work);
> +
> + memory_failure(p->pfn, 0);
> + kfree(p);
> +}
> 

Re: [kbuild-all] Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Chen, Rong A




On 10/20/2022 11:07 PM, Bjorn Helgaas wrote:

On Thu, Oct 20, 2022 at 10:13:10PM +0800, kernel test robot wrote:

Hi Bjorn,

I love your patch! Yet something to improve:

[auto build test ERROR on helgaas-pci/next]
[also build test ERROR on xilinx-xlnx/master rockchip/for-next linus/master 
v6.1-rc1 next-20221020]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
patch link:
https://lore.kernel.org/r/20221019195452.37606-1-helgaas%40kernel.org
patch subject: [PATCH] PCI: Remove unnecessary of_irq.h includes
config: s390-randconfig-r044-20221019
compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 
791a7ae1ba3efd6bca96338e10ffde557ba83920)
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # install s390 cross compiling tool for clang build
 # apt-get install binutils-s390x-linux-gnu
 # 
https://github.com/intel-lab-lkp/linux/commit/273a24b16a40ffd6a64c6c55aecbfae00a1cd996
 git remote add linux-review https://github.com/intel-lab-lkp/linux
 git fetch --no-tags linux-review 
Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
 git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
 # save the config file
 mkdir build_dir && cp config build_dir/.config
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
O=build_dir ARCH=s390 SHELL=/bin/bash drivers/pci/controller/


Maybe more user error?

   $ COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir 
ARCH=s390 SHELL=/bin/bash drivers/pci/controller/
   Compiler will be installed in /home/bjorn/0day
   make --keep-going HOSTCC=/home/bjorn/0day/clang/bin/clang 
CC=/home/bjorn/0day/clang/bin/clang OBJCOPY=/usr/s390x-linux-gnu/bin/objcopy 
AR=llvm-ar NM=llvm-nm STRIP=llvm-strip OBJDUMP=llvm-objdump OBJSIZE=llvm-size 
READELF=llvm-readelf HOSTCXX=clang++ HOSTAR=llvm-ar 
CROSS_COMPILE=s390x-linux-gnu- --jobs=16 W=1 O=build_dir ARCH=s390 
SHELL=/bin/bash drivers/pci/controller/
   make[1]: Entering directory '/home/bjorn/linux/build_dir'
 SYNCinclude/config/auto.conf.cmd
 GEN Makefile
   scripts/Kconfig.include:40: linker 's390x-linux-gnu-ld' not found



Hi Bjorn,

You may need to install the below package, or similar package for other OS:

$ dpkg -S /usr/bin/s390x-linux-gnu-ld
binutils-s390x-linux-gnu: /usr/bin/s390x-linux-gnu-ld

>>  # install s390 cross compiling tool for clang build
>>  # apt-get install binutils-s390x-linux-gnu

Best Regards,
Rong Chen


   make[3]: *** [../scripts/kconfig/Makefile:77: syncconfig] Error 1
   make[2]: *** [../Makefile:697: syncconfig] Error 2
   make[1]: *** [/home/bjorn/linux/Makefile:798: include/config/auto.conf.cmd] 
Error 2
   make[1]: Failed to remake makefile 'include/config/auto.conf.cmd'.
   make[1]: Failed to remake makefile 'include/config/auto.conf'.
 GEN Makefile
   Error: kernelrelease not valid - run 'make prepare' to update it
   ../scripts/mkcompile_h: 19: s390x-linux-gnu-ld: not found
   make[1]: Target 'drivers/pci/controller/' not remade because of errors.
   make[1]: Leaving directory '/home/bjorn/linux/build_dir'
   make: *** [Makefile:231: __sub-make] Error 2
   make: Target 'drivers/pci/controller/' not remade because of errors.
___
kbuild-all mailing list -- kbuild-...@lists.01.org
To unsubscribe send an email to kbuild-all-le...@lists.01.org



Re: [kbuild-all] Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Chen, Rong A




On 10/20/2022 9:41 PM, Bjorn Helgaas wrote:

On Thu, Oct 20, 2022 at 04:09:37PM +0800, kernel test robot wrote:

Hi Bjorn,

I love your patch! Yet something to improve:

[auto build test ERROR on helgaas-pci/next]
[also build test ERROR on xilinx-xlnx/master rockchip/for-next linus/master 
v6.1-rc1 next-20221020]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
patch link:
https://lore.kernel.org/r/20221019195452.37606-1-helgaas%40kernel.org
patch subject: [PATCH] PCI: Remove unnecessary of_irq.h includes
config: ia64-randconfig-r026-20221020
compiler: ia64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # 
https://github.com/intel-lab-lkp/linux/commit/273a24b16a40ffd6a64c6c55aecbfae00a1cd996
 git remote add linux-review https://github.com/intel-lab-lkp/linux
 git fetch --no-tags linux-review 
Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
 git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
 # save the config file
 mkdir build_dir && cp config build_dir/.config
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/


FYI, the instructions above didn't work for me.  Missing "config".

   $ git remote add linux-review https://github.com/intel-lab-lkp/linux
   $ git fetch --no-tags linux-review 
Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
   $ git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
   HEAD is now at 273a24b16a40 PCI: Remove unnecessary of_irq.h includes
   $ mkdir build_dir && cp config build_dir/.config
   cp: cannot stat 'config': No such file or directory

   $ COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/
   Compiler will be installed in /home/bjorn/0day
   Cannot find ia64-linux under https://download.01.org/0day-ci/cross-package 
check /tmp/0day-ci-crosstool-files


Hi Bjorn,

Sorry for the inconvenience, the 01.org website is unstable recently, 
could you try 
"URL=https://cdn.kernel.org/pub/tools/crosstool/files/bin/x86_64 
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/"?


Best Regards,
Rong Chen


   Please set new url, e.g. export 
URL=https://cdn.kernel.org/pub/tools/crosstool/files/bin/x86_64
   gcc crosstool install failed
   Install gcc cross compiler failed
   setup_crosstool failed
___
kbuild-all mailing list -- kbuild-...@lists.01.org
To unsubscribe send an email to kbuild-all-le...@lists.01.org



Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Tony Luck
On Thu, Oct 20, 2022 at 09:57:04AM +0800, Shuai Xue wrote:
> 
> 
> 在 2022/10/20 AM1:08, Tony Luck 写道:
> > If the kernel is copying a page as the result of a copy-on-write
> > fault and runs into an uncorrectable error, Linux will crash because
> > it does not have recovery code for this case where poison is consumed
> > by the kernel.
> > 
> > It is easy to set up a test case. Just inject an error into a private
> > page, fork(2), and have the child process write to the page.
> > 
> > I wrapped that neatly into a test at:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> > 
> > just enable ACPI error injection and run:
> > 
> >   # ./einj_mem-uc -f copy-on-write
> > 
> > Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> > on architectures where that is available (currently x86 and powerpc).
> > When an error is detected during the page copy, return VM_FAULT_HWPOISON
> > to caller of wp_page_copy(). This propagates up the call stack. Both x86
> > and powerpc have code in their fault handler to deal with this code by
> > sending a SIGBUS to the application.
> 
> Does it send SIGBUS to only child process or both parent and child process?

This only sends a SIGBUS to the process that wrote the page (typically
the child, but also possible that the parent is the one that does the
write that causes the COW).

> > 
> > Note that this patch avoids a system crash and signals the process that
> > triggered the copy-on-write action. It does not take any action for the
> > memory error that is still in the shared page. To handle that a call to
> > memory_failure() is needed. 
> 
> If the error page is not poisoned, should the return value of wp_page_copy
> be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
> PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
> And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

The page has uncorrected data in it, but this patch doesn't mark it
as poisoned.  Returning VM_FAULT_SIGBUS would send an "ordinary" SIGBUS
that doesn't include the BUS_MCEERR_AR and "lsb" information. It would
also skip the:

"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n"

console message. So might result in confusion and attepmts to debug a
s/w problem with the application instead of blaming the death on a bad
DIMM.

> > But this cannot be done from wp_page_copy()
> > because it holds mmap_lock(). Perhaps the architecture fault handlers
> > can deal with this loose end in a subsequent patch?

I started looking at this for x86 ... but I have changed my mind
about this being a good place for a fix. When control returns back
to the architecture fault handler it no longer has easy access to
the physical page frame number. It has the virtual address, so it
could descend back into somee new mm/memory.c function to get the
physical address ... but that seems silly.

I'm experimenting with using sched_work() to handle the call to
memory_failure() (echoing what the machine check handler does using
task_work)_add() to avoid the same problem of not being able to directly
call memory_failure()).

So far it seems to be working. Patch below (goes on top of original
patch ... well on top of the internal version with mods based on
feedback from Dan Williams ... but should show the general idea)

With this patch applied the page does get unmapped from all users.
Other tasks that shared the page will get a SIGBUS if they attempt
to access it later (from the page fault handler because of
is_hwpoison_entry() as you mention above.

-Tony

>From d3879e83bf91cd6c61e12d32d3e15eb6ef069204 Mon Sep 17 00:00:00 2001
From: Tony Luck 
Date: Thu, 20 Oct 2022 09:57:28 -0700
Subject: [PATCH] mm, hwpoison: Call memory_failure() for source page of COW
 failure

Cannot call memory_failure() directly from the fault handler because
mmap_lock (and others) are held.

It is important, but not urgent, to mark the source page as h/w poisoned
and unmap it from other tasks.

Use schedule_work() to queue a request to call memory_failure() for the
page with the error.

Signed-off-by: Tony Luck 
---
 mm/memory.c | 35 ++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index b6056eef2f72..4a1304cf1f4e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2848,6 +2848,37 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
return same;
 }
 
+#ifdef CONFIG_MEMORY_FAILURE
+struct pfn_work {
+   struct work_struct work;
+   unsigned long pfn;
+};
+
+static void do_sched_memory_failure(struct work_struct *w)
+{
+   struct pfn_work *p = container_of(w, struct pfn_work, work);
+
+   memory_failure(p->pfn, 0);
+   kfree(p);
+}
+
+static void sched_memory_failure(unsigned long pfn)
+{
+   struct pfn_work *p;
+
+   p = kmalloc(sizeof *p, GFP_KERNEL);
+   if (!p)
+   return;
+   

[PATCH 3/5] powerpc/kprobes: Use preempt_enable() rather than the no_resched variant

2022-10-20 Thread Naveen N. Rao
preempt_enable_no_resched() is just the same as preempt_enable() when we
are in a irqs disabled context. kprobe_handler() and the post/fault
handlers are all called with irqs disabled. As such, convert those to
just use preempt_enable().

Reported-by: Nicholas Piggin 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 88f42de681e1f8..86ca5a61ea9afb 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -369,7 +369,7 @@ int kprobe_handler(struct pt_regs *regs)
 
if (ret > 0) {
restore_previous_kprobe(kcb);
-   preempt_enable_no_resched();
+   preempt_enable();
return 1;
}
}
@@ -382,7 +382,7 @@ int kprobe_handler(struct pt_regs *regs)
if (p->pre_handler && p->pre_handler(p, regs)) {
/* handler changed execution path, so skip ss setup */
reset_current_kprobe();
-   preempt_enable_no_resched();
+   preempt_enable();
return 1;
}
 
@@ -395,7 +395,7 @@ int kprobe_handler(struct pt_regs *regs)
 
kcb->kprobe_status = KPROBE_HIT_SSDONE;
reset_current_kprobe();
-   preempt_enable_no_resched();
+   preempt_enable();
return 1;
}
}
@@ -404,7 +404,7 @@ int kprobe_handler(struct pt_regs *regs)
return 1;
 
 no_kprobe:
-   preempt_enable_no_resched();
+   preempt_enable();
return ret;
 }
 NOKPROBE_SYMBOL(kprobe_handler);
@@ -490,7 +490,7 @@ int kprobe_post_handler(struct pt_regs *regs)
}
reset_current_kprobe();
 out:
-   preempt_enable_no_resched();
+   preempt_enable();
 
/*
 * if somebody else is singlestepping across a probe point, msr
@@ -529,7 +529,7 @@ int kprobe_fault_handler(struct pt_regs *regs, int trapnr)
restore_previous_kprobe(kcb);
else
reset_current_kprobe();
-   preempt_enable_no_resched();
+   preempt_enable();
break;
case KPROBE_HIT_ACTIVE:
case KPROBE_HIT_SSDONE:
-- 
2.38.0



[PATCH 5/5] powerpc/kprobes: Remove unnecessary headers from kprobes

2022-10-20 Thread Naveen N. Rao
Many of these headers are not necessary since those are included
indirectly, or the code using those headers has been removed.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes-ftrace.c | 4 
 arch/powerpc/kernel/kprobes.c| 2 --
 2 files changed, 6 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes-ftrace.c 
b/arch/powerpc/kernel/kprobes-ftrace.c
index 072ebe7f290ba7..08ed8a158fd724 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -7,10 +7,6 @@
  *   IBM Corporation
  */
 #include 
-#include 
-#include 
-#include 
-#include 
 
 /* Ftrace callback handler for kprobes */
 void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 86ca5a61ea9afb..3bf2507f07e6c6 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -14,8 +14,6 @@
  */
 
 #include 
-#include 
-#include 
 #include 
 #include 
 #include 
-- 
2.38.0



[PATCH 4/5] powerpc/kprobes: Setup consistent pt_regs across kprobes, optprobes and KPROBES_ON_FTRACE

2022-10-20 Thread Naveen N. Rao
Ensure a more consistent pt_regs across kprobes, optprobes and
KPROBES_ON_FTRACE:
- Drop setting trap to 0x700 under optprobes. This is not accurate and
  is unnecessary. Instead, zero it out for both optprobes and
  KPROBES_ON_FTRACE.
- Save irq soft mask in the ftrace handler, similar to what we do in
  optprobes and trap-based kprobes.
- Drop setting orig_gpr3 and result to zero in optprobes. These are not
  relevant under kprobes and should not be used by the handlers.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/optprobes_head.S| 5 +
 arch/powerpc/kernel/trace/ftrace_mprofile.S | 6 ++
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/optprobes_head.S 
b/arch/powerpc/kernel/optprobes_head.S
index cd4e7bc32609d3..06df09b4e8b155 100644
--- a/arch/powerpc/kernel/optprobes_head.S
+++ b/arch/powerpc/kernel/optprobes_head.S
@@ -49,11 +49,8 @@ optprobe_template_entry:
/* Save SPRS */
mfmsr   r5
PPC_STL r5,_MSR(r1)
-   li  r5,0x700
-   PPC_STL r5,_TRAP(r1)
li  r5,0
-   PPC_STL r5,ORIG_GPR3(r1)
-   PPC_STL r5,RESULT(r1)
+   PPC_STL r5,_TRAP(r1)
mfctr   r5
PPC_STL r5,_CTR(r1)
mflrr5
diff --git a/arch/powerpc/kernel/trace/ftrace_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_mprofile.S
index d031093bc43671..f82004089426e6 100644
--- a/arch/powerpc/kernel/trace/ftrace_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_mprofile.S
@@ -107,6 +107,12 @@
PPC_STL r9, _CTR(r1)
PPC_STL r10, _XER(r1)
PPC_STL r11, _CCR(r1)
+#ifdef CONFIG_PPC64
+   lbz r7, PACAIRQSOFTMASK(r13)
+   std r7, SOFTE(r1)
+#endif
+   li  r8, 0
+   PPC_STL r8, _TRAP(r1)
.endif
 
/* Load _regs in r6 for call below */
-- 
2.38.0



[PATCH 2/5] powerpc/kprobes: Have optimized_callback() use preempt_enable()

2022-10-20 Thread Naveen N. Rao
Similar to x86 commit 2e62024c265aa6 ("kprobes/x86: Use preempt_enable()
in optimized_callback()"), change powerpc optprobes to use
preempt_enable() rather than preempt_enable_no_resched() since powerpc
also removed irq disabling for optprobes in commit f72180cc93a2c6
("powerpc/kprobes: Do not disable interrupts for optprobes and
kprobes_on_ftrace").

Reported-by: Nicholas Piggin 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/optprobes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index 3b1c2236cbee57..004fae2044a3e0 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -112,7 +112,7 @@ static void optimized_callback(struct optimized_kprobe *op,
__this_cpu_write(current_kprobe, NULL);
}
 
-   preempt_enable_no_resched();
+   preempt_enable();
 }
 NOKPROBE_SYMBOL(optimized_callback);
 
-- 
2.38.0



[PATCH 1/5] powerpc/kprobes: Remove preempt disable around call to get_kprobe() in arch_prepare_kprobe()

2022-10-20 Thread Naveen N. Rao
arch_prepare_kprobe() is called from register_kprobe() via
prepare_kprobe(), or through register_aggr_kprobe(), both with the
kprobe_mutex held. Per the comment for get_kprobe():
  /*
   * This routine is called either:
   *- under the 'kprobe_mutex' - during kprobe_[un]register().
   *OR
   *- with preemption disabled - from architecture specific code.
   */

As such, there is no need to disable preemption around the call to
get_kprobe(). Drop the same.

Reported-by: Nicholas Piggin 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/kprobes.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index bd7b1a03545948..88f42de681e1f8 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -158,9 +158,7 @@ int arch_prepare_kprobe(struct kprobe *p)
printk("Cannot register a kprobe on the second word of prefixed 
instruction\n");
ret = -EINVAL;
}
-   preempt_disable();
prev = get_kprobe(p->addr - 1);
-   preempt_enable_no_resched();
 
/*
 * When prev is a ftrace-based kprobe, we don't have an insn, and it
-- 
2.38.0



[PATCH 0/5] powerpc/kprobes: preempt related changes and cleanups

2022-10-20 Thread Naveen N. Rao
This series attempts to address some of the concerns raised in 
https://github.com/linuxppc/issues/issues/440

The last two patches are minor cleanups in related kprobes code.

- Naveen


Naveen N. Rao (5):
  powerpc/kprobes: Remove preempt disable around call to get_kprobe() in
arch_prepare_kprobe()
  powerpc/kprobes: Have optimized_callback() use preempt_enable()
  powerpc/kprobes: Use preempt_enable() rather than the no_resched
variant
  powerpc/kprobes: Setup consistent pt_regs across kprobes, optprobes
and KPROBES_ON_FTRACE
  powerpc/kprobes: Remove unnecessary headers from kprobes

 arch/powerpc/kernel/kprobes-ftrace.c|  4 
 arch/powerpc/kernel/kprobes.c   | 16 ++--
 arch/powerpc/kernel/optprobes.c |  2 +-
 arch/powerpc/kernel/optprobes_head.S|  5 +
 arch/powerpc/kernel/trace/ftrace_mprofile.S |  6 ++
 5 files changed, 14 insertions(+), 19 deletions(-)


base-commit: 7dc2a00fdd44a4d0c3bac9fd10558b3933586a0c
-- 
2.38.0



Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Pali Rohár
On Thursday 20 October 2022 08:45:47 Bjorn Helgaas wrote:
> [+cc Pali, heads-up for trivial addition of  to
> pci-mvebu.c]
...
> pci-mvebu.c also relies on getting  via
> , but it actually depends on of_irq.h, so I'll just
> add an irqdomain.h include there.
> 
> Bjorn
> 

Ok, that is fine!


Re: [PATCH v7 0/8] phy: Add support for Lynx 10G SerDes

2022-10-20 Thread Sean Anderson



On 10/19/22 10:44 PM, Bagas Sanjaya wrote:
> On 10/19/22 06:11, Sean Anderson wrote:
>> This adds support for the Lynx 10G SerDes found on the QorIQ T-series
>> and Layerscape series. Due to limited time and hardware, only support
>> for the LS1046ARDB is added in this initial series. There is a sketch
>> for LS1088ARDB support, but it is incomplete.
>> 
>> Dynamic reconfiguration does not work. That is, the configuration must
>> match what is set in the RCW. From my testing, SerDes register settings
>> appear identical. The issue appears to be between the PCS and the MAC.
>> The link itself comes up at both ends, and a mac loopback succeeds.
>> However, a PCS loopback results in dropped packets. Perhaps there is
>> some undocumented register in the PCS?
>> 
>> I suspect this driver is around 95% complete, but, unfortunately, I no
>> longer have time to investigate this further. At the very least it is
>> useful for two cases:
>> - Although this is untested, it should support 2.5G SGMII as well as
>>   1000BASE-KX. The latter needs MAC and PCS support, but the former
>>   should work out of the box.
>> - It allows for clock configurations not supported by the RCW. This is
>>   very useful if you want to use e.g. SRDS_PRTCL_S1=0x and =0x1133
>>   on the same board. This is because the former setting will use PLL1
>>   as the 1G reference, but the latter will use PLL1 as the 10G
>>   reference. Because we can reconfigure the PLLs, it is possible to
>>   always use PLL1 as the 1G reference.
>> 
>> Changes in v7:
>> - Use double quotes everywhere in yaml
>> - Break out call order into generic documentation
>> - Refuse to switch "major" protocols
>> - Update Kconfig to reflect restrictions
>> - Remove set/clear of "pcs reset" bit, since it doesn't seem to fix
>>   anything.
>> 
>> Changes in v6:
>> - Bump PHY_TYPE_2500BASEX to 13, since PHY_TYPE_USXGMII was added in the
>>   meantime
>> - fsl,type -> phy-type
>> - frequence -> frequency
>> - Update MAINTAINERS to include new files
>> - Include bitfield.h and slab.h to allow compilation on non-arm64
>>   arches.
>> - Depend on COMMON_CLK and either layerscape/ppc
>> - XGI.9 -> XFI.9
>> 
>> Changes in v5:
>> - Update commit description
>> - Dual id header
>> - Remove references to PHY_INTERFACE_MODE_1000BASEKX to allow this
>>   series to be applied directly to linux/master.
>> - Add fsl,lynx-10g.h to MAINTAINERS
>> 
>> Changes in v4:
>> - Add 2500BASE-X and 10GBASE-R phy types
>> - Use subnodes to describe lane configuration, instead of describing
>>   PCCRs. This is the same style used by phy-cadence-sierra et al.
>> - Add ids for Lynx 10g PLLs
>> - Rework all debug statements to remove use of __func__. Additional
>>   information has been provided as necessary.
>> - Consider alternative parent rates in round_rate and not in set_rate.
>>   Trying to modify out parent's rate in set_rate will deadlock.
>> - Explicitly perform a stop/reset sequence in set_rate. This way we
>>   always ensure that the PLL is properly stopped.
>> - Set the power-down bit when disabling the PLL. We can do this now that
>>   enable/disable aren't abused during the set rate sequence.
>> - Fix typos in QSGMII_OFFSET and XFI_OFFSET
>> - Rename LNmTECR0_TEQ_TYPE_PRE to LNmTECR0_TEQ_TYPE_POST to better
>>   reflect its function (adding post-cursor equalization).
>> - Use of_clk_hw_onecell_get instead of a custom function.
>> - Return struct clks from lynx_clks_init instead of embedding lynx_clk
>>   in lynx_priv.
>> - Rework PCCR helper functions; T-series SoCs differ from Layerscape SoCs
>>   primarily in the layout and offset of the PCCRs. This will help bring a
>>   cleaner abstraction layer. The caps have been removed, since this handles 
>> the
>>   only current usage.
>> - Convert to use new binding format. As a result of this, we no longer need 
>> to
>>   have protocols for PCIe or SATA. Additionally, modes now live in lynx_group
>>   instead of lynx_priv.
>> - Remove teq from lynx_proto_params, since it can be determined from
>>   preq_ratio/postq_ratio.
>> - Fix an early return from lynx_set_mode not releasing serdes->lock.
>> - Rename lynx_priv.conf to .cfg, since I kept mistyping it.
>> 
>> Changes in v3:
>> - Manually expand yaml references
>> - Add mode configuration to device tree
>> - Rename remaining references to QorIQ SerDes to Lynx 10G
>> - Fix PLL enable sequence by waiting for our reset request to be cleared
>>   before continuing. Do the same for the lock, even though it isn't as
>>   critical. Because we will delay for 1.5ms on average, use prepare
>>   instead of enable so we can sleep.
>> - Document the status of each protocol
>> - Fix offset of several bitfields in RECR0
>> - Take into account PLLRST_B, SDRST_B, and SDEN when considering whether
>>   a PLL is "enabled."
>> - Only power off unused lanes.
>> - Split mode lane mask into first/last lane (like group)
>> - Read modes from device tree
>> - Use caps to determine whether KX/KR are supported
>> - 

Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Bjorn Helgaas
On Thu, Oct 20, 2022 at 10:13:10PM +0800, kernel test robot wrote:
> Hi Bjorn,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on helgaas-pci/next]
> [also build test ERROR on xilinx-xlnx/master rockchip/for-next linus/master 
> v6.1-rc1 next-20221020]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
> patch link:
> https://lore.kernel.org/r/20221019195452.37606-1-helgaas%40kernel.org
> patch subject: [PATCH] PCI: Remove unnecessary of_irq.h includes
> config: s390-randconfig-r044-20221019
> compiler: clang version 16.0.0 (https://github.com/llvm/llvm-project 
> 791a7ae1ba3efd6bca96338e10ffde557ba83920)
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # install s390 cross compiling tool for clang build
> # apt-get install binutils-s390x-linux-gnu
> # 
> https://github.com/intel-lab-lkp/linux/commit/273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> git remote add linux-review https://github.com/intel-lab-lkp/linux
>     git fetch --no-tags linux-review 
> Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
> O=build_dir ARCH=s390 SHELL=/bin/bash drivers/pci/controller/

Maybe more user error?

  $ COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir 
ARCH=s390 SHELL=/bin/bash drivers/pci/controller/
  Compiler will be installed in /home/bjorn/0day
  make --keep-going HOSTCC=/home/bjorn/0day/clang/bin/clang 
CC=/home/bjorn/0day/clang/bin/clang OBJCOPY=/usr/s390x-linux-gnu/bin/objcopy 
AR=llvm-ar NM=llvm-nm STRIP=llvm-strip OBJDUMP=llvm-objdump OBJSIZE=llvm-size 
READELF=llvm-readelf HOSTCXX=clang++ HOSTAR=llvm-ar 
CROSS_COMPILE=s390x-linux-gnu- --jobs=16 W=1 O=build_dir ARCH=s390 
SHELL=/bin/bash drivers/pci/controller/
  make[1]: Entering directory '/home/bjorn/linux/build_dir'
SYNCinclude/config/auto.conf.cmd
GEN Makefile
  scripts/Kconfig.include:40: linker 's390x-linux-gnu-ld' not found
  make[3]: *** [../scripts/kconfig/Makefile:77: syncconfig] Error 1
  make[2]: *** [../Makefile:697: syncconfig] Error 2
  make[1]: *** [/home/bjorn/linux/Makefile:798: include/config/auto.conf.cmd] 
Error 2
  make[1]: Failed to remake makefile 'include/config/auto.conf.cmd'.
  make[1]: Failed to remake makefile 'include/config/auto.conf'.
GEN Makefile
  Error: kernelrelease not valid - run 'make prepare' to update it
  ../scripts/mkcompile_h: 19: s390x-linux-gnu-ld: not found
  make[1]: Target 'drivers/pci/controller/' not remade because of errors.
  make[1]: Leaving directory '/home/bjorn/linux/build_dir'
  make: *** [Makefile:231: __sub-make] Error 2
  make: Target 'drivers/pci/controller/' not remade because of errors.



Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Bjorn Helgaas
On Thu, Oct 20, 2022 at 08:41:01AM -0500, Bjorn Helgaas wrote:
> On Thu, Oct 20, 2022 at 04:09:37PM +0800, kernel test robot wrote:
> > Hi Bjorn,
> > 
> > I love your patch! Yet something to improve:
> > 
> > [auto build test ERROR on helgaas-pci/next]
> > [also build test ERROR on xilinx-xlnx/master rockchip/for-next linus/master 
> > v6.1-rc1 next-20221020]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > 
> > url:
> > https://github.com/intel-lab-lkp/linux/commits/Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> > base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
> > patch link:
> > https://lore.kernel.org/r/20221019195452.37606-1-helgaas%40kernel.org
> > patch subject: [PATCH] PCI: Remove unnecessary of_irq.h includes
> > config: ia64-randconfig-r026-20221020
> > compiler: ia64-linux-gcc (GCC) 12.1.0
> > reproduce (this is a W=1 build):
> > wget 
> > https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> > ~/bin/make.cross
> > chmod +x ~/bin/make.cross
> > # 
> > https://github.com/intel-lab-lkp/linux/commit/273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> > git remote add linux-review https://github.com/intel-lab-lkp/linux
> > git fetch --no-tags linux-review 
> > Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> > git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> > # save the config file
> > mkdir build_dir && cp config build_dir/.config
> > COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
> > O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/
> 
> FYI, the instructions above didn't work for me.  Missing "config".

Sorry, my fault, "config" was a MIME attachment.  Possibly update the
instructions:

  - # save the config file
  + # save the config file from the MIME attachment


Re: [PATCH] [perf/core: Update sample_flags for raw_data in perf_output_sample

2022-10-20 Thread Peter Zijlstra
On Thu, Oct 20, 2022 at 12:36:56PM +0530, Athira Rajeev wrote:
> commit 838d9bb62d13 ("perf: Use sample_flags for raw_data")
> added check for PERF_SAMPLE_RAW in sample_flags in
> perf_prepare_sample(). But while copying the sample in memory,
> the check for sample_flags is not added in perf_output_sample().
> Fix adds the same in perf_output_sample as well.
> 
> Fixes: 838d9bb62d13 ("perf: Use sample_flags for raw_data")
> Signed-off-by: Athira Rajeev 
> ---
>  kernel/events/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 4ec3717003d5..daf387c75d33 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -7099,7 +7099,7 @@ void perf_output_sample(struct perf_output_handle 
> *handle,
>   if (sample_type & PERF_SAMPLE_RAW) {
>   struct perf_raw_record *raw = data->raw;
>  
> - if (raw) {
> + if (raw && (data->sample_flags & PERF_SAMPLE_RAW)) {
>   struct perf_raw_frag *frag = >frag;
>  
>   perf_output_put(handle, raw->size);

Urgh.. something smells here. We already did a PERF_SAMPLE_RAW test.

And perf_prepare_sample() explicitly makes data->raw be NULL when not
set earlier.

So what's going wrong?



Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Conor Dooley
On Thu, Oct 20, 2022 at 08:45:47AM -0500, Bjorn Helgaas wrote:
> [+cc Pali, heads-up for trivial addition of  to
> pci-mvebu.c]
> 
> On Thu, Oct 20, 2022 at 08:20:25AM +0100, Conor Dooley wrote:
> > On Thu, Oct 20, 2022 at 03:08:50PM +0800, kernel test robot wrote:
> > > Hi Bjorn,
> > > 
> > > I love your patch! Yet something to improve:
> > > 
> > > >> drivers/pci/controller/pcie-microchip-host.c:473:31: error: incomplete 
> > > >> definition of type 'struct irq_domain'
> > >struct mc_pcie *port = domain->host_data;
> > 
> > That's what I get for only visually inspecting the patch before Acking
> > it.. Un-ack I suppose.
> 
> No problem!
> 
> I think what happened is the pcie-microchip-host.c uses
> irq_domain_add_linear() so it needs , but it
> currently gets it via , which it doesn't otherwise
> need.
> 
> I added a preparatory patch to include  explicitly,
> but I haven't been able to cross-build either riscv or ia64 to verify
> this fix.  I'll wait a few days and post an updated series for the
> 0-day bot to test.

I saw you saying you couldn't find the config from LKP, FWIW a build
using riscv defconfig w/ CONFIG_PCIE_MICROCHIP_HOST=y fails for me
in the same way as lkp reports.
Otherwise, dump the patch in response to this and I'll give it a shot
later if you like?

HTH,
Conor.

> 
> Same situation for pcie-altera-msi.c.
> 
> pci-mvebu.c also relies on getting  via
> , but it actually depends on of_irq.h, so I'll just
> add an irqdomain.h include there.
> 
> Bjorn
> 


Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Bjorn Helgaas
[+cc Pali, heads-up for trivial addition of  to
pci-mvebu.c]

On Thu, Oct 20, 2022 at 08:20:25AM +0100, Conor Dooley wrote:
> On Thu, Oct 20, 2022 at 03:08:50PM +0800, kernel test robot wrote:
> > Hi Bjorn,
> > 
> > I love your patch! Yet something to improve:
> > 
> > >> drivers/pci/controller/pcie-microchip-host.c:473:31: error: incomplete 
> > >> definition of type 'struct irq_domain'
> >struct mc_pcie *port = domain->host_data;
> 
> That's what I get for only visually inspecting the patch before Acking
> it.. Un-ack I suppose.

No problem!

I think what happened is the pcie-microchip-host.c uses
irq_domain_add_linear() so it needs , but it
currently gets it via , which it doesn't otherwise
need.

I added a preparatory patch to include  explicitly,
but I haven't been able to cross-build either riscv or ia64 to verify
this fix.  I'll wait a few days and post an updated series for the
0-day bot to test.

Same situation for pcie-altera-msi.c.

pci-mvebu.c also relies on getting  via
, but it actually depends on of_irq.h, so I'll just
add an irqdomain.h include there.

Bjorn



Re: [PATCH] PCI: Remove unnecessary of_irq.h includes

2022-10-20 Thread Bjorn Helgaas
On Thu, Oct 20, 2022 at 04:09:37PM +0800, kernel test robot wrote:
> Hi Bjorn,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on helgaas-pci/next]
> [also build test ERROR on xilinx-xlnx/master rockchip/for-next linus/master 
> v6.1-rc1 next-20221020]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:
> https://github.com/intel-lab-lkp/linux/commits/Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git next
> patch link:
> https://lore.kernel.org/r/20221019195452.37606-1-helgaas%40kernel.org
> patch subject: [PATCH] PCI: Remove unnecessary of_irq.h includes
> config: ia64-randconfig-r026-20221020
> compiler: ia64-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://github.com/intel-lab-lkp/linux/commit/273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> git remote add linux-review https://github.com/intel-lab-lkp/linux
> git fetch --no-tags linux-review 
> Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
> git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
> # save the config file
> mkdir build_dir && cp config build_dir/.config
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
> O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/

FYI, the instructions above didn't work for me.  Missing "config".

  $ git remote add linux-review https://github.com/intel-lab-lkp/linux
  $ git fetch --no-tags linux-review 
Bjorn-Helgaas/PCI-Remove-unnecessary-of_irq-h-includes/20221020-100633
  $ git checkout 273a24b16a40ffd6a64c6c55aecbfae00a1cd996
  HEAD is now at 273a24b16a40 PCI: Remove unnecessary of_irq.h includes
  $ mkdir build_dir && cp config build_dir/.config
  cp: cannot stat 'config': No such file or directory

  $ COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 
O=build_dir ARCH=ia64 SHELL=/bin/bash drivers/pci/controller/
  Compiler will be installed in /home/bjorn/0day
  Cannot find ia64-linux under https://download.01.org/0day-ci/cross-package 
check /tmp/0day-ci-crosstool-files
  Please set new url, e.g. export 
URL=https://cdn.kernel.org/pub/tools/crosstool/files/bin/x86_64
  gcc crosstool install failed
  Install gcc cross compiler failed
  setup_crosstool failed



Re: [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults

2022-10-20 Thread Shuai Xue



在 2022/10/20 AM1:08, Tony Luck 写道:
> If the kernel is copying a page as the result of a copy-on-write
> fault and runs into an uncorrectable error, Linux will crash because
> it does not have recovery code for this case where poison is consumed
> by the kernel.
> 
> It is easy to set up a test case. Just inject an error into a private
> page, fork(2), and have the child process write to the page.
> 
> I wrapped that neatly into a test at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> 
> just enable ACPI error injection and run:
> 
>   # ./einj_mem-uc -f copy-on-write
> 
> Add a new copy_user_highpage_mc() function that uses copy_mc_to_kernel()
> on architectures where that is available (currently x86 and powerpc).
> When an error is detected during the page copy, return VM_FAULT_HWPOISON
> to caller of wp_page_copy(). This propagates up the call stack. Both x86
> and powerpc have code in their fault handler to deal with this code by
> sending a SIGBUS to the application.

Does it send SIGBUS to only child process or both parent and child process?

> 
> Note that this patch avoids a system crash and signals the process that
> triggered the copy-on-write action. It does not take any action for the
> memory error that is still in the shared page. To handle that a call to
> memory_failure() is needed. 

If the error page is not poisoned, should the return value of wp_page_copy
be VM_FAULT_HWPOISON or VM_FAULT_SIGBUS? When is_hwpoison_entry(entry) or
PageHWPoison(page) is true, do_swap_page return VM_FAULT_HWPOISON to caller.
And when is_swapin_error_entry is true, do_swap_page return VM_FAULT_SIGBUS.

Thanks.

Best Regards,
Shuai


> But this cannot be done from wp_page_copy()
> because it holds mmap_lock(). Perhaps the architecture fault handlers
> can deal with this loose end in a subsequent patch?
> 
> On Intel/x86 this loose end will often be handled automatically because
> the memory controller provides an additional notification of the h/w
> poison in memory, the handler for this will call memory_failure(). This
> isn't a 100% solution. If there are multiple errors, not all may be
> logged in this way.
> 
> Signed-off-by: Tony Luck 
> 
> ---
> Changes in V2:
>Naoya Horiguchi:
>   1) Use -EHWPOISON error code instead of minus one.
>   2) Poison path needs also to deal with old_page
>Tony Luck:
>   Rewrote commit message
>   Added some powerpc folks to Cc: list
> ---
>  include/linux/highmem.h | 19 +++
>  mm/memory.c | 28 +++-
>  2 files changed, 38 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/highmem.h b/include/linux/highmem.h
> index e9912da5441b..5967541fbf0e 100644
> --- a/include/linux/highmem.h
> +++ b/include/linux/highmem.h
> @@ -319,6 +319,25 @@ static inline void copy_user_highpage(struct page *to, 
> struct page *from,
>  
>  #endif
>  
> +static inline int copy_user_highpage_mc(struct page *to, struct page *from,
> + unsigned long vaddr, struct 
> vm_area_struct *vma)
> +{
> + unsigned long ret = 0;
> +#ifdef copy_mc_to_kernel
> + char *vfrom, *vto;
> +
> + vfrom = kmap_local_page(from);
> + vto = kmap_local_page(to);
> + ret = copy_mc_to_kernel(vto, vfrom, PAGE_SIZE);
> + kunmap_local(vto);
> + kunmap_local(vfrom);
> +#else
> + copy_user_highpage(to, from, vaddr, vma);
> +#endif
> +
> + return ret;
> +}
> +
>  #ifndef __HAVE_ARCH_COPY_HIGHPAGE
>  
>  static inline void copy_highpage(struct page *to, struct page *from)
> diff --git a/mm/memory.c b/mm/memory.c
> index f88c351aecd4..a32556c9b689 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2848,8 +2848,14 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
>   return same;
>  }
>  
> -static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
> -struct vm_fault *vmf)
> +/*
> + * Return:
> + *   -EHWPOISON: copy failed due to hwpoison in source page
> + *   0:  copied failed (some other reason)
> + *   1:  copied succeeded
> + */
> +static inline int __wp_page_copy_user(struct page *dst, struct page *src,
> +   struct vm_fault *vmf)
>  {
>   bool ret;
>   void *kaddr;
> @@ -2860,8 +2866,9 @@ static inline bool __wp_page_copy_user(struct page 
> *dst, struct page *src,
>   unsigned long addr = vmf->address;
>  
>   if (likely(src)) {
> - copy_user_highpage(dst, src, addr, vma);
> - return true;
> + if (copy_user_highpage_mc(dst, src, addr, vma))
> + return -EHWPOISON;
> + return 1;
>   }
>  
>   /*
> @@ -2888,7 +2895,7 @@ static inline bool __wp_page_copy_user(struct page 
> *dst, struct page *src,
>* and update local tlb only
>*/
>   update_mmu_tlb(vma, addr, 

Re: [PATCH] powerpc/pseries: Use lparcfg to reconfig VAS windows for DLPAR CPU

2022-10-20 Thread Michael Ellerman
On Thu, 06 Oct 2022 22:29:59 -0700, Haren Myneni wrote:
> The hypervisor assigns VAS (Virtual Accelerator Switchboard)
> windows depends on cores configured in LPAR. The kernel uses
> OF reconfig notifier to reconfig VAS windows for DLPAR CPU event.
> In the case of shared CPU mode partition, the hypervisor assigns
> VAS windows depends on CPU entitled capacity, not based on vcpus.
> When the user changes CPU entitled capacity for the partition,
> drmgr uses /proc/ppc64/lparcfg interface to notify the kernel.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries: Use lparcfg to reconfig VAS windows for DLPAR CPU
  https://git.kernel.org/powerpc/c/2147783d6bf0b7ca14c72a25527dc5135bd17f65

cheers


Re: [PATCH v3] powerpc/pseries/vas: Add VAS IRQ primary handler

2022-10-20 Thread Michael Ellerman
On Sun, 09 Oct 2022 20:41:25 -0700, Haren Myneni wrote:
> irq_default_primary_handler() can be used only with IRQF_ONESHOT
> flag, but the flag disables IRQ before executing the thread handler
> and enables it after the interrupt is handled. But this IRQ disable
> sets the VAS IRQ OFF state in the hypervisor. In case if NX faults
> during this window, the hypervisor will not deliver the fault
> interrupt to the partition and the user space may wait continuously
> for the CSB update. So use VAS specific IRQ handler instead of
> calling the default primary handler.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/pseries/vas: Add VAS IRQ primary handler
  https://git.kernel.org/powerpc/c/89ed0b769d6adf30364f60e6b1566961821a9893

cheers


Re: warning from change_protection in 6.1 rc1

2022-10-20 Thread Michael Ellerman
"Nicholas Piggin"  writes:
> On Wed Oct 19, 2022 at 8:21 PM AEST, Dan Horák wrote:
>> Hi,
>>
>> in my first boot with the 6.1 rc1 kernel I have received a couple of
>> warnings from change_protection on Talos II P9 system, see the details
>> below. Nothing like that was noticed in 6.0 or earlier.
>
> Thanks for the report. This is a false positive in a warning I added
> because page_savedwrite overloads the _PAGE_PRIVILEGED bit. The warning
> should be harmless and the code will do the right thing (and it will
> flush). I think this should do it as a minimal fix.
>
> I don't really like that we use that bit for this, I think it should not
> cause a hardware access issue like with KUAP because there are no RWX
> permissions, but having something like pte_user suddenly return false on
> these seems a bit fragile. I'd rather use another bit for this,
> something like _PAGE_SAO. But that shouldn't be done for this release...

There is an RFC to remove the savedwrite stuff entirely:

  https://lore.kernel.org/all/20220926152618.194810-1-da...@redhat.com/

cheers

> diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
> b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> index 67655cd60545..4b9eab0995ec 100644
> --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
> +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
> @@ -178,9 +178,19 @@ static inline bool __pte_flags_need_flush(unsigned long 
> oldval,
>  
>   /*
>* We do not expect kernel mappings or non-PTEs or not-present PTEs.
> +  * pte_savedwrite does use _PAGE_PRIVILEGED for user mappings, so
> +  * have to filter that out.
>*/
> - VM_WARN_ON_ONCE(oldval & _PAGE_PRIVILEGED);
> - VM_WARN_ON_ONCE(newval & _PAGE_PRIVILEGED);
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING) ||
> + ((oldval & (_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) !=
> +  (_PAGE_PRESENT | _PAGE_PTE)))
> + VM_WARN_ON_ONCE(oldval & _PAGE_PRIVILEGED);
> +
> + if (!IS_ENABLED(CONFIG_NUMA_BALANCING) ||
> + ((newval & (_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) !=
> +  (_PAGE_PRESENT | _PAGE_PTE)))
> + VM_WARN_ON_ONCE(newval & _PAGE_PRIVILEGED);
> +
>   VM_WARN_ON_ONCE(!(oldval & _PAGE_PTE));
>   VM_WARN_ON_ONCE(!(newval & _PAGE_PTE));
>   VM_WARN_ON_ONCE(!(oldval & _PAGE_PRESENT));
>>
>>
>>  Thanks,
>>
>>  Dan
>>
>> [   79.229100] [ cut here ]
>> [   79.229109] WARNING: CPU: 61 PID: 2987 at 
>> arch/powerpc/include/asm/book3s/64/tlbflush.h:183 
>> change_protection+0xfd0/0x1610
>> [   79.229125] Modules linked in: nft_reject_inet nf_reject_ipv4 
>> nf_reject_ipv6 nft_reject nft_objref nf_conntrack_tftp nft_ct kvm_hv kvm 
>> nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
>> sunrpc fscache netfs nf_tables ebtable_nat ebtable_broute ip6table_nat 
>> ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat 
>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw 
>> iptable_security bridge stp llc ip_set nfnetlink rfkill ebtable_filter 
>> ebtables ip6table_filter iptable_filter binfmt_misc dm_crypt xfs 
>> snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi 
>> ftdi_sio onboard_usb_hub snd_hda_intel snd_intel_dspcfg snd_hda_codec 
>> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ses 
>> enclosure ofpart snd soundcore scsi_transport_sas at24 ipmi_powernv 
>> ipmi_devintf powernv_flash regmap_i2c opal_prd crct10dif_vpmsum i2c_opal 
>> ipmi_msghandler mtd rtc_opal amdgpu raid1 drm_ttm_helper ttm mfd_core gpu_sc
>>  hed vmx_crypto
>> [   79.229258]  crc32c_vpmsum drm_buddy nvme drm_display_helper nvme_core 
>> tg3 nvme_common aacraid cec ip6_tables ip_tables i2c_dev fuse
>> [   79.229283] CPU: 61 PID: 2987 Comm: lightdm-gtk-gre Not tainted 
>> 6.1.0-0.rc1.15.fc38.ppc64le #1
>> [   79.229289] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 
>> opal:skiboot-bc106a0 PowerNV
>> [   79.229291] NIP:  c0495aa0 LR: c0495608 CTR: 
>> 
>> [   79.229295] REGS: c00020001766f690 TRAP: 0700   Not tainted  
>> (6.1.0-0.rc1.15.fc38.ppc64le)
>> [   79.229299] MSR:  90029033   CR: 
>> 44242420  XER: 0156
>> [   79.229316] CFAR: c049562c IRQMASK: 0 
>>GPR00: c0495608 c00020001766f930 c1dd7100 
>> c00020001e4e3700 
>>GPR04: 00015444 c00020004eac3920 84030a73002000c0 
>> 88030a73002000c0 
>>GPR08: 0040 0001 0040 
>> 0009 
>>GPR12: c00020001795f708 c0002007be1a9700 c0002000554b 
>> 00015444 
>>GPR16: c01f9f40 fe7f c2acdbb8 
>> c2a3aef0 
>>GPR20: c00020001766fac8 ff7fefbf 08010080 
>> c00c00080013ab28 
>>GPR24: 0004 

Re: [6.1-rc1] Warning arch/powerpc/kernel/irq_64.c:285

2022-10-20 Thread Nicholas Piggin
On Thu Oct 20, 2022 at 2:55 PM AEST, Sachin Sant wrote:
> While running powerpc kselftests (mm/stress_code_patching.sh)
> on a PowerVM LPAR following warning is seen. The test passes.
> I can reliably recreate it on a Power9 server, not so easily on
> Power10.
>
> # ./stress_code_patching.sh 
> Testing for spurious faults when mapping kernel memory...
> [  175.289418] [ cut here ]
> [  175.289434] WARNING: CPU: 11 PID: 5436 at arch/powerpc/kernel/irq_64.c:285 
> arch_local_irq_restore+0x230/0x260
> [  175.289450] Modules linked in: dm_mod(E) nft_fib_inet(E) nft_fib_ipv4(E) 
> nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) 
> nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) 
> nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) bonding(E) tls(E) 
> ip_set(E) rfkill(E) nf_tables(E) libcrc32c(E) nfnetlink(E) sunrpc(E) 
> pseries_rng(E) vmx_crypto(E) ext4(E) mbcache(E) jbd2(E) sd_mod(E) t10_pi(E) 
> crc64_rocksoft(E) crc64(E) sg(E) ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) 
> ipmi_devintf(E) ipmi_msghandler(E) fuse(E)
> [  175.289582] CPU: 11 PID: 5436 Comm: stress_code_pat Tainted: G
> E  6.1.0-rc1-00025-gaae703b02f92 #1
> [  175.289591] Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 
> of:IBM,FW950.50 (VL950_105) hv:phyp pSeries
> [  175.289599] NIP:  c003e9a0 LR: c00b16dc CTR: 
> a6a4
> [  175.289607] REGS: c000297b35f0 TRAP: 0700   Tainted: GE
>(6.1.0-rc1-00025-gaae703b02f92)
> [  175.289616] MSR:  8282b033   CR: 
> 4824  XER: 
> [  175.289654] CFAR: c003e7f4 IRQMASK: 1 
> [  175.289654] GPR00: c00b179c c000297b3890 c135e900 
>  
> [  175.289654] GPR04:  201760241794  
> 4009287a7705 
> [  175.289654] GPR08:  8000 cac27d80 
> 0040 
> [  175.289654] GPR12: 2000 c0001ec52700 4000 
> 000101239798 
> [  175.289654] GPR16: 000101239724 0001011d8128 000101170370 
> 00010123d568 
> [  175.289654] GPR20: 01002f8f5490 0001 0001011eaf18 
> 7fffc1696ab4 
> [  175.289654] GPR24: 7fffc1696ab0  c0080018 
> 4be3bca9 
> [  175.289654] GPR28: c2a590a0   
> c35010c0 
> [  175.289787] NIP [c003e9a0] arch_local_irq_restore+0x230/0x260
> [  175.289796] LR [c00b16dc] patch_instruction+0x26c/0x340
> [  175.289805] Call Trace:
> [  175.289810] [c000297b3890] [c2a590a0] init_mm+0x0/0x5c0 
> (unreliable)
> [  175.289824] [c000297b38c0] [c00b179c] 
> patch_instruction+0x32c/0x340
> [  175.289835] [c000297b3910] [c007ef40] 
> ftrace_make_call+0x220/0x4b0
> [  175.289846] [c000297b39a0] [c02e00a8] 
> __ftrace_replace_code+0x138/0x140
> [  175.289858] [c000297b39f0] [c02e0678] 
> ftrace_replace_code+0xa8/0x140
> [  175.289869] [c000297b3a40] [c02e095c] 
> ftrace_modify_all_code+0x11c/0x240
> [  175.289880] [c000297b3a70] [c007f918] 
> arch_ftrace_update_code+0x18/0x30
> [  175.289891] [c000297b3a90] [c02e0bc8] 
> ftrace_startup_enable+0x68/0xa0
> [  175.289902] [c000297b3ac0] [c02e6618] ftrace_startup+0xf8/0x1c0
> [  175.289913] [c000297b3b00] [c02e672c] 
> register_ftrace_function+0x4c/0xc0
> [  175.289924] [c000297b3b30] [c030c908] 
> function_trace_init+0x88/0x100
> [  175.289936] [c000297b3b60] [c030079c] 
> tracing_set_tracer+0x2ac/0x540
> [  175.289946] [c000297b3c00] [c0300ad4] 
> tracing_set_trace_write+0xa4/0x110
> [  175.289957] [c000297b3cc0] [c0553a00] vfs_write+0x100/0x460
> [  175.289968] [c000297b3d80] [c0553f3c] ksys_write+0x7c/0x140
> [  175.289979] [c000297b3dd0] [c0035160] 
> system_call_exception+0x140/0x350
> [  175.289990] [c000297b3e10] [c000c654] 
> system_call_common+0xf4/0x278
> [  175.290002] --- interrupt: c00 at 0x7fff83c50c34
> [  175.290009] NIP:  7fff83c50c34 LR: 7fff83bc7c74 CTR: 
> 
> [  175.290016] REGS: c000297b3e80 TRAP: 0c00   Tainted: GE
>(6.1.0-rc1-00025-gaae703b02f92)
> [  175.290025] MSR:  8280f033   
> CR: 2822  XER: 
> [  175.290065] IRQMASK: 0 
> [  175.290065] GPR00: 0004 7fffc1696890 7fff83d37300 
> 0001 
> [  175.290065] GPR04: 01002f8f2bb0 0009 0010 
> 6e6f6974 
> [  175.290065] GPR08:    
>  
> [  175.290065] GPR12:  7fff83e6ae60 4000 
> 000101239798 
> [  175.290065] GPR16: 000101239724 0001011d8128 000101170370 
> 00010123d568 
> [  175.290065] GPR20: 01002f8f5490 

Re: warning from change_protection in 6.1 rc1

2022-10-20 Thread Nicholas Piggin
On Wed Oct 19, 2022 at 8:21 PM AEST, Dan Horák wrote:
> Hi,
>
> in my first boot with the 6.1 rc1 kernel I have received a couple of
> warnings from change_protection on Talos II P9 system, see the details
> below. Nothing like that was noticed in 6.0 or earlier.

Thanks for the report. This is a false positive in a warning I added
because page_savedwrite overloads the _PAGE_PRIVILEGED bit. The warning
should be harmless and the code will do the right thing (and it will
flush). I think this should do it as a minimal fix.

I don't really like that we use that bit for this, I think it should not
cause a hardware access issue like with KUAP because there are no RWX
permissions, but having something like pte_user suddenly return false on
these seems a bit fragile. I'd rather use another bit for this,
something like _PAGE_SAO. But that shouldn't be done for this release...

Thanks,
Nick


diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 67655cd60545..4b9eab0995ec 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -178,9 +178,19 @@ static inline bool __pte_flags_need_flush(unsigned long 
oldval,
 
/*
 * We do not expect kernel mappings or non-PTEs or not-present PTEs.
+* pte_savedwrite does use _PAGE_PRIVILEGED for user mappings, so
+* have to filter that out.
 */
-   VM_WARN_ON_ONCE(oldval & _PAGE_PRIVILEGED);
-   VM_WARN_ON_ONCE(newval & _PAGE_PRIVILEGED);
+   if (!IS_ENABLED(CONFIG_NUMA_BALANCING) ||
+   ((oldval & (_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) !=
+(_PAGE_PRESENT | _PAGE_PTE)))
+   VM_WARN_ON_ONCE(oldval & _PAGE_PRIVILEGED);
+
+   if (!IS_ENABLED(CONFIG_NUMA_BALANCING) ||
+   ((newval & (_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)) !=
+(_PAGE_PRESENT | _PAGE_PTE)))
+   VM_WARN_ON_ONCE(newval & _PAGE_PRIVILEGED);
+
VM_WARN_ON_ONCE(!(oldval & _PAGE_PTE));
VM_WARN_ON_ONCE(!(newval & _PAGE_PTE));
VM_WARN_ON_ONCE(!(oldval & _PAGE_PRESENT));
>
>
>   Thanks,
>
>   Dan
>
> [   79.229100] [ cut here ]
> [   79.229109] WARNING: CPU: 61 PID: 2987 at 
> arch/powerpc/include/asm/book3s/64/tlbflush.h:183 
> change_protection+0xfd0/0x1610
> [   79.229125] Modules linked in: nft_reject_inet nf_reject_ipv4 
> nf_reject_ipv6 nft_reject nft_objref nf_conntrack_tftp nft_ct kvm_hv kvm 
> nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace 
> sunrpc fscache netfs nf_tables ebtable_nat ebtable_broute ip6table_nat 
> ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat 
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw 
> iptable_security bridge stp llc ip_set nfnetlink rfkill ebtable_filter 
> ebtables ip6table_filter iptable_filter binfmt_misc dm_crypt xfs 
> snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi 
> ftdi_sio onboard_usb_hub snd_hda_intel snd_intel_dspcfg snd_hda_codec 
> snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ses enclosure 
> ofpart snd soundcore scsi_transport_sas at24 ipmi_powernv ipmi_devintf 
> powernv_flash regmap_i2c opal_prd crct10dif_vpmsum i2c_opal ipmi_msghandler 
> mtd rtc_opal amdgpu raid1 drm_ttm_helper ttm mfd_core gpu_sc
>  hed vmx_crypto
> [   79.229258]  crc32c_vpmsum drm_buddy nvme drm_display_helper nvme_core tg3 
> nvme_common aacraid cec ip6_tables ip_tables i2c_dev fuse
> [   79.229283] CPU: 61 PID: 2987 Comm: lightdm-gtk-gre Not tainted 
> 6.1.0-0.rc1.15.fc38.ppc64le #1
> [   79.229289] Hardware name: T2P9D01 REV 1.00 POWER9 0x4e1202 
> opal:skiboot-bc106a0 PowerNV
> [   79.229291] NIP:  c0495aa0 LR: c0495608 CTR: 
> 
> [   79.229295] REGS: c00020001766f690 TRAP: 0700   Not tainted  
> (6.1.0-0.rc1.15.fc38.ppc64le)
> [   79.229299] MSR:  90029033   CR: 44242420 
>  XER: 0156
> [   79.229316] CFAR: c049562c IRQMASK: 0 
>GPR00: c0495608 c00020001766f930 c1dd7100 
> c00020001e4e3700 
>GPR04: 00015444 c00020004eac3920 84030a73002000c0 
> 88030a73002000c0 
>GPR08: 0040 0001 0040 
> 0009 
>GPR12: c00020001795f708 c0002007be1a9700 c0002000554b 
> 00015444 
>GPR16: c01f9f40 fe7f c2acdbb8 
> c2a3aef0 
>GPR20: c00020001766fac8 ff7fefbf 08010080 
> c00c00080013ab28 
>GPR24: 0004 c00c00080013ab00 00015460 
> 0001549b 
>GPR28: 88030a73002000c0 c000200054354510 000d 
> c00020004eac3920 
> [   79.229377] NIP [c0495aa0] change_protection+0xfd0/0x1610
> [   79.229384] LR 

[PATCH] [perf/core: Update sample_flags for raw_data in perf_output_sample

2022-10-20 Thread Athira Rajeev
commit 838d9bb62d13 ("perf: Use sample_flags for raw_data")
added check for PERF_SAMPLE_RAW in sample_flags in
perf_prepare_sample(). But while copying the sample in memory,
the check for sample_flags is not added in perf_output_sample().
Fix adds the same in perf_output_sample as well.

Fixes: 838d9bb62d13 ("perf: Use sample_flags for raw_data")
Signed-off-by: Athira Rajeev 
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4ec3717003d5..daf387c75d33 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7099,7 +7099,7 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_RAW) {
struct perf_raw_record *raw = data->raw;
 
-   if (raw) {
+   if (raw && (data->sample_flags & PERF_SAMPLE_RAW)) {
struct perf_raw_frag *frag = >frag;
 
perf_output_put(handle, raw->size);
-- 
2.31.1



[PATCH] [perf/core: Update sample_flags for raw_data in perf_output_sample

2022-10-20 Thread Athira Rajeev
commit 838d9bb62d13 ("perf: Use sample_flags for raw_data")
added check for PERF_SAMPLE_RAW in sample_flags in
perf_prepare_sample(). But while copying the sample in memory,
the check for sample_flags is not added in perf_output_sample().
Fix adds the same in perf_output_sample as well.

Fixes: 838d9bb62d13 ("perf: Use sample_flags for raw_data")
Signed-off-by: Athira Rajeev 
---
 kernel/events/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4ec3717003d5..daf387c75d33 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7099,7 +7099,7 @@ void perf_output_sample(struct perf_output_handle *handle,
if (sample_type & PERF_SAMPLE_RAW) {
struct perf_raw_record *raw = data->raw;
 
-   if (raw) {
+   if (raw && (data->sample_flags & PERF_SAMPLE_RAW)) {
struct perf_raw_frag *frag = >frag;
 
perf_output_put(handle, raw->size);
-- 
2.31.1



Re: [PATCH] powerpc/hv-gpci: Fix hv_gpci event list

2022-10-20 Thread Madhavan Srinivasan



On 10/18/22 1:35 PM, Michael Ellerman wrote:

Kajol Jain  writes:

Based on getPerfCountInfo v1.018 documentation, some of the
hv_gpci events got deprecated for platforms firmware that
supports counter_info_version 0x8 or above.

Patch fixes the hv_gpci event list by adding a new attribute
group called "hv_gpci_event_attrs_v6" and a "EVENT_ENABLE"
macro to enable these events for platform firmware
that supports counter_info_version 0x6 or below.

Does this handle CPUs booted in compat mode?
Nice catch. Sorry I missed that part completely during internal review. 
my bad.

ie. where the firmware is newer but the kernel is told to behave as if
the CPU is an older version - so cpu_has_feature() doesn't necessarily
match the underlying hardware.

Is there some reason the event list is populated based on the CPU
features rather than by calling the hypervisor and asking what version
is supported?

I will review the hcall doc again for that option.

maddy



Fixes: 97bf2640184f4 ("powerpc/perf/hv-gpci: add the remaining gpci
requests")

Please don't wrap the fixes tag.

cheers


Signed-off-by: Kajol Jain 
---
  arch/powerpc/perf/hv-gpci-requests.h |  4 
  arch/powerpc/perf/hv-gpci.c  |  9 +++--
  arch/powerpc/perf/hv-gpci.h  |  1 +
  arch/powerpc/perf/req-gen/perf.h | 17 +
  4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/hv-gpci-requests.h 
b/arch/powerpc/perf/hv-gpci-requests.h
index 8965b4463d43..baef3d082de9 100644
--- a/arch/powerpc/perf/hv-gpci-requests.h
+++ b/arch/powerpc/perf/hv-gpci-requests.h
@@ -79,6 +79,7 @@ REQUEST(__field(0,8,  partition_id)
  )
  #include I(REQUEST_END)
  
+#ifdef EVENT_ENABLE

  /*
   * Not available for counter_info_version >= 0x8, use
   * run_instruction_cycles_by_partition(0x100) instead.
@@ -92,6 +93,7 @@ REQUEST(__field(0,8,  partition_id)
__count(0x10,   8,  cycles)
  )
  #include I(REQUEST_END)
+#endif
  
  #define REQUEST_NAME system_performance_capabilities

  #define REQUEST_NUM 0x40
@@ -103,6 +105,7 @@ REQUEST(__field(0,  1,  perf_collect_privileged)
  )
  #include I(REQUEST_END)
  
+#ifdef EVENT_ENABLE

  #define REQUEST_NAME processor_bus_utilization_abc_links
  #define REQUEST_NUM 0x50
  #define REQUEST_IDX_KIND "hw_chip_id=?"
@@ -194,6 +197,7 @@ REQUEST(__field(0,  4,  phys_processor_idx)
__count(0x28,   8,  instructions_completed)
  )
  #include I(REQUEST_END)
+#endif
  
  /* Processor_core_power_mode (0x95) skipped, no counters */

  /* Affinity_domain_information_by_virtual_processor (0xA0) skipped,
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index 5eb60ed5b5e8..065a01812b3e 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -70,9 +70,9 @@ static const struct attribute_group format_group = {
.attrs = format_attrs,
  };
  
-static const struct attribute_group event_group = {

+static struct attribute_group event_group = {
.name  = "events",
-   .attrs = hv_gpci_event_attrs,
+   /* .attrs is set in init */
  };
  
  #define HV_CAPS_ATTR(_name, _format)\

@@ -353,6 +353,11 @@ static int hv_gpci_init(void)
/* sampling not supported */
h_gpci_pmu.capabilities |= PERF_PMU_CAP_NO_INTERRUPT;
  
+	if (cpu_has_feature(CPU_FTR_ARCH_207S))

+   event_group.attrs = hv_gpci_event_attrs;
+   else
+   event_group.attrs = hv_gpci_event_attrs_v6;
+
r = perf_pmu_register(_gpci_pmu, h_gpci_pmu.name, -1);
if (r)
return r;
diff --git a/arch/powerpc/perf/hv-gpci.h b/arch/powerpc/perf/hv-gpci.h
index 4d108262bed7..866172c1651c 100644
--- a/arch/powerpc/perf/hv-gpci.h
+++ b/arch/powerpc/perf/hv-gpci.h
@@ -26,6 +26,7 @@ enum {
  #define REQUEST_FILE "../hv-gpci-requests.h"
  #define NAME_LOWER hv_gpci
  #define NAME_UPPER HV_GPCI
+#define EVENT_ENABLE   1
  #include "req-gen/perf.h"
  #undef REQUEST_FILE
  #undef NAME_LOWER
diff --git a/arch/powerpc/perf/req-gen/perf.h b/arch/powerpc/perf/req-gen/perf.h
index fa9bc804e67a..78d407e3fcc6 100644
--- a/arch/powerpc/perf/req-gen/perf.h
+++ b/arch/powerpc/perf/req-gen/perf.h
@@ -139,6 +139,23 @@ PMU_EVENT_ATTR_STRING( 
\
  #define REQUEST_(r_name, r_value, r_idx_1, r_fields)  \
r_fields
  
+/* Generate event list for platforms with counter_info_version 0x6 or below */

+static __maybe_unused struct attribute *hv_gpci_event_attrs_v6[] = {
+#include REQUEST_FILE
+   NULL
+};
+
+/*
+ * Based on getPerfCountInfo v1.018 documentation, some of the hv-gpci
+ * events got deprecated for platforms firmware that supports
+ * counter_info_version 0x8 or above.
+ * Undefining macro EVENT_ENABLE, to disable the addition of deprecated
+ * events in "hv_gpci_event_attrs" attribute group, for platforms that
+ * supports counter_info_version 0x8 or above.
+ */
+#undef EVENT_ENABLE
+
+/* Generate