Re: [PATCH v4.19 0/2] Custom backports for powerpc SLB issues

2022-04-29 Thread Greg KH
On Thu, Apr 28, 2022 at 10:41:48PM +1000, Michael Ellerman wrote:
> Hi Greg,
> 
> Here are two custom backports to v4.19 for some powerpc issues we've 
> discovered.
> Both were fixed upstream as part of a large non-backportable rewrite. Other 
> stable
> kernel versions are not affected.
> 
> cheers
> 
> Michael Ellerman (1):
>   powerpc/64s: Unmerge EX_LR and EX_DAR
> 
> Nicholas Piggin (1):
>   powerpc/64/interrupt: Temporarily save PPR on stack to fix register
> corruption due to SLB miss
> 
>  arch/powerpc/include/asm/exception-64s.h | 37 ++--
>  1 file changed, 22 insertions(+), 15 deletions(-)
> 
> -- 
> 2.35.1
> 

Both now queued up, thanks.

greg k-h


Patch "powerpc/64/interrupt: Temporarily save PPR on stack to fix register corruption due to SLB miss" has been added to the 4.19-stable tree

2022-04-29 Thread gregkh


This is a note to let you know that I've just added the patch titled

powerpc/64/interrupt: Temporarily save PPR on stack to fix register 
corruption due to SLB miss

to the 4.19-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 
powerpc-64-interrupt-temporarily-save-ppr-on-stack-to-fix-register-corruption-due-to-slb-miss.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Fri Apr 29 10:56:14 AM CEST 2022
From: Michael Ellerman 
Date: Thu, 28 Apr 2022 22:41:49 +1000
Subject: powerpc/64/interrupt: Temporarily save PPR on stack to fix register 
corruption due to SLB miss
To: , 
Cc: , 
Message-ID: <20220428124150.375623-2-...@ellerman.id.au>

From: Nicholas Piggin 

This is a minimal stable kernel fix for the problem solved by
4c2de74cc869 ("powerpc/64: Interrupts save PPR on stack rather than
thread_struct").

Upstream kernels between 4.17-4.20 have this bug, so I propose this
patch for 4.19 stable.

Longer description from mpe:

In commit f384796c4 ("powerpc/mm: Add support for handling > 512TB
address in SLB miss") we added support for using multiple context ids
per process. Previously accessing past the first context id was a fatal
error for the process. With the new support it became non-fatal, and so
the previous "bad_addr_slb" handler was changed to be the
"large_addr_slb" handler.

That handler uses the EXCEPTION_PROLOG_COMMON() macro, which in-turn
calls the SAVE_PPR() macro. At the point where SAVE_PPR() is used, the
r9-13 register values from the original user fault are saved in
paca->exslb. It's not until later in EXCEPTION_PROLOG_COMMON_2() that
they are saved from paca->exslb onto the kernel stack.

The PPR is saved into current->thread.ppr, which is notably not on the
kernel stack the way pt_regs are. This means we can take an SLB miss on
current->thread.ppr. If that happens in the "large_addr_slb" case we
will clobber the saved user r9-r13 in paca->exslb with kernel values.
Later we will save those clobbered values into the pt_regs on the stack,
and when we return to userspace those kernel values will be restored.

Typically this appears as some sort of segfault in userspace, with an
address that looks like a kernel address. In dmesg it can appear as:

  [19117.440331] some_program[1869625]: unhandled signal 11 at cf6bda10 
nip 7fff780d559c lr 7fff781ae56c code 30001

The upstream fix for this issue was to move PPR into pt_regs, on the
kernel stack, avoiding the possibility of an SLB fault when saving it.

However changing the size of pt_regs is an intrusive change, and has
side effects in other parts of the kernel. A minimal fix is to
temporarily save the PPR in an unused part of pt_regs, then save the
user register values from paca->exslb into pt_regs, and then move the
saved PPR into thread.ppr.

Fixes: f384796c40dc ("powerpc/mm: Add support for handling > 512TB address in 
SLB miss")
Signed-off-by: Nicholas Piggin 
Signed-off-by: Michael Ellerman 
Link: https://lore.kernel.org/r/20220316033235.903657-1-npig...@gmail.com
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/include/asm/exception-64s.h |   22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -243,10 +243,22 @@
  * PPR save/restore macros used in exceptions_64s.S  
  * Used for P7 or later processors
  */
-#define SAVE_PPR(area, ra, rb) \
+#define SAVE_PPR(area, ra) \
+BEGIN_FTR_SECTION_NESTED(940)  \
+   ld  ra,area+EX_PPR(r13);/* Read PPR from paca */\
+   std ra,RESULT(r1);  /* Store PPR in RESULT for now */ \
+END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
+
+/*
+ * This is called after we are finished accessing 'area', so we can now take
+ * SLB faults accessing the thread struct, which will use PACA_EXSLB area.
+ * This is required because the large_addr_slb handler uses EXSLB and it also
+ * uses the common exception macros including this PPR saving.
+ */
+#define MOVE_PPR_TO_THREAD(ra, rb) \
 BEGIN_FTR_SECTION_NESTED(940)  \
ld  ra,PACACURRENT(r13);\
-   ld  rb,area+EX_PPR(r13);/* Read PPR from paca */\
+   ld  rb,RESULT(r1);  /* Read PPR from stack */   \
std rb,TASKTHREADPPR(ra);   \
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
 
@@ -515,9 +527,11 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 3: EXCEPTION_PROLOG_COMMON_1();   \
 

Patch "powerpc/64s: Unmerge EX_LR and EX_DAR" has been added to the 4.19-stable tree

2022-04-29 Thread gregkh


This is a note to let you know that I've just added the patch titled

powerpc/64s: Unmerge EX_LR and EX_DAR

to the 4.19-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 powerpc-64s-unmerge-ex_lr-and-ex_dar.patch
and it can be found in the queue-4.19 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Fri Apr 29 10:56:14 AM CEST 2022
From: Michael Ellerman 
Date: Thu, 28 Apr 2022 22:41:50 +1000
Subject: powerpc/64s: Unmerge EX_LR and EX_DAR
To: , 
Cc: , 
Message-ID: <20220428124150.375623-3-...@ellerman.id.au>

From: Michael Ellerman 

The SLB miss handler is not fully re-entrant, it is able to work because
we ensure that the SLB entries for the kernel text and data segment, as
well as the kernel stack are pinned in the SLB. Accesses to kernel data
outside of those areas has to be carefully managed and can only occur in
certain parts of the code. One way we deal with that is by storing some
values in temporary slots in the paca.

In v4.13 in commit dbeea1d6b4bd ("powerpc/64s/paca: EX_LR can be merged
with EX_DAR") we merged the storage for two temporary slots for register
storage during SLB miss handling. That was safe at the time because the
two slots were never used at the same time.

Unfortunately in v4.17 in commit c2b4d8b7417a ("powerpc/mm/hash64:
Increase the VA range") we broke that condition, and introduced a case
where the two slots could be in use at the same time, leading to one
being corrupted.

Specifically in slb_miss_common() when we detect that we're handling a
fault for a large virtual address (> 512TB) we go to the "8" label,
there we store the original fault address into paca->exslb[EX_DAR],
before jumping to large_addr_slb() (using rfid).

We then use the EXCEPTION_PROLOG_COMMON and RECONCILE_IRQ_STATE macros
to do exception setup, before reloading the fault address from
paca->exslb[EX_DAR] and storing it into pt_regs->dar (Data Address
Register).

However the code generated by those macros can cause a recursive SLB
miss on a kernel address in three places.

Firstly is the saving of the PPR (Program Priority Register), which
happens on all CPUs since Power7, the PPR is saved to the thread struct
which can be anywhere in memory. There is also the call to
accumulate_stolen_time() if CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y and
CONFIG_PPC_SPLPAR=y, and also the call to trace_hardirqs_off() if
CONFIG_TRACE_IRQFLAGS=y. The latter two call into generic C code and can
lead to accesses anywhere in memory.

On modern 64-bit CPUs we have 1TB segments, so for any of those accesses
to cause an SLB fault they must access memory more than 1TB away from
the kernel text, data and kernel stack. That typically only happens on
machines with more than 1TB of RAM. However it is possible on multi-node
Power9 systems, because memory on the 2nd node begins at 32TB in the
linear mapping.

If we take a recursive SLB fault then we will corrupt the original fault
address with the LR (Link Register) value, because the EX_DAR and EX_LR
slots share storage. Subsequently we will think we're trying to fault
that LR address, which is the wrong address, and will also mostly likely
lead to a segfault because the LR address will be < 512TB and so will be
rejected by slb_miss_large_addr().

This appears as a spurious segfault to userspace, and if
show_unhandled_signals is enabled you will see a fault reported in dmesg
with the LR address, not the expected fault address, eg:

  prog[123]: segfault (11) at 128a61808 nip 128a618cc lr 128a61808 code 3 in 
prog[128a6+1]
  prog[123]: code: 4ba4 39200040 3ce4 7d2903a6 3c000200 78e707c6 
780083e4 7d3b4b78
  prog[123]: code: 7d455378 7d7d5b78 7d9f6378 7da46b78  7d3a4b78 
7d465378 7d7c5b78

Notice that the fault address == the LR, and the faulting instruction is
a simple store that should never use LR.

In upstream this was fixed in v4.20 in commit
48e7b7695745 ("powerpc/64s/hash: Convert SLB miss handlers to C"),
however that is a huge rewrite and not backportable.

The minimal fix for stable is to just unmerge the EX_LR and EX_DAR slots
again, avoiding the corruption of the DAR value. This uses an extra 8
bytes per CPU, which is negligble.

Signed-off-by: Michael Ellerman 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/powerpc/include/asm/exception-64s.h |   15 ---
 1 file changed, 4 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -48,11 +48,12 @@
 #define EX_CCR 52
 #define EX_CFAR56
 #define EX_PPR 64
+#define EX_LR  72
 #if defined(CONFIG_RELOCATABLE)
-#define EX_CTR 72
-#define EX_SIZE10  /* size in u64 units */
+#define EX_CTR 80
+#define EX_SIZE11  /* size in u64 units */
 #else
-#define EX_SI

Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Sergei Shtylyov
Hello!

On 4/28/22 1:49 AM, Guilherme G. Piccoli wrote:

> Currently the tracing dump_on_oops feature is implemented
> through separate notifiers, one for die/oops and the other
> for panic. With the addition of panic notifier "id", this
> patch makes use of such "id" to unify both functions.
> 
> It also comments the function and changes the priority of the
> notifier blocks, in order they run early compared to other
> notifiers, to prevent useless trace data (like the callback
> names for the other notifiers). Finally, we also removed an
> unnecessary header inclusion.
> 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  kernel/trace/trace.c | 57 +---
>  1 file changed, 32 insertions(+), 25 deletions(-)
> 
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index f4de111fa18f..c1d8a3622ccc 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
[...]
> @@ -9767,38 +9766,46 @@ static __init int tracer_init_tracefs(void)
>  
>  fs_initcall(tracer_init_tracefs);
>  
> -static int trace_panic_handler(struct notifier_block *this,
> -unsigned long event, void *unused)
> +/*
> + * The idea is to execute the following die/panic callback early, in order
> + * to avoid showing irrelevant information in the trace (like other panic
> + * notifier functions); we are the 2nd to run, after hung_task/rcu_stall
> + * warnings get disabled (to prevent potential log flooding).
> + */
> +static int trace_die_panic_handler(struct notifier_block *self,
> + unsigned long ev, void *unused)
>  {
> - if (ftrace_dump_on_oops)
> + int do_dump;

   bool?

> +
> + if (!ftrace_dump_on_oops)
> + return NOTIFY_DONE;
> +
> + switch (ev) {
> + case DIE_OOPS:
> + do_dump = 1;
> + break;
> + case PANIC_NOTIFIER:
> + do_dump = 1;
> + break;

   Why not:

case DIE_OOPS:
case PANIC_NOTIFIER:
do_dump = 1;
break;

> + default:
> + do_dump = 0;
> + break;
> + }
> +
> + if (do_dump)
>   ftrace_dump(ftrace_dump_on_oops);
> - return NOTIFY_OK;
> +
> + return NOTIFY_DONE;
>  }
[...]

MBR, Sergey


Re: [PATCH] powerpc/time: Always set decrementer in timer_interrupt()

2022-04-29 Thread Michael Ellerman
Nicholas Piggin  writes:
> Excerpts from Nicholas Piggin's message of April 21, 2022 12:07 pm:
>> Excerpts from Michal Suchánek's message of April 21, 2022 12:28 am:
>>> Hello,
>>> 
>>> On Thu, Apr 21, 2022 at 12:16:57AM +1000, Michael Ellerman wrote:
 This is a partial revert of commit 0faf20a1ad16 ("powerpc/64s/interrupt:
 Don't enable MSR[EE] in irq handlers unless perf is in use").
 
 Prior to that commit, we always set the decrementer in
 timer_interrupt(), to clear the timer interrupt. Otherwise we could end
 up continuously taking timer interrupts.
 
 When high res timers are enabled there is no problem seen with leaving
 the decrementer untouched in timer_interrupt(), because it will be
 programmed via hrtimer_interrupt() -> tick_program_event() ->
 clockevents_program_event() -> decrementer_set_next_event().
 
 However with CONFIG_HIGH_RES_TIMERS=n or booting with highres=off, we
>>> 
>>> How difficult is it to detect this condition?
>>> 
>>> Maybe detecting this could be just added?
>> 
>> Possibly not too difficult but I'd like to see if we can get this to work
>> in core timer code -
>> 
>> https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-April/242212.html
>> 
>> I'll resend as a patch and see what flamage I get...
>
> tglx merged it into his tree, so we could try again after its
> upstream.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=timers/core&id=62c1256d544747b38e77ca9b5bfe3a26f9592576
>
> I'm kind of worried the patch will explode some strange clock event 
> device in an obscure way so we may wait for a release or two first.

Hah yep, :face-with-cold-sweat:

I created an issue so hopefully we don't forget:

  https://github.com/linuxppc/issues/issues/408

cheers


Re: L2 SRAM on PowerPC e500 and Caching-inhibited bit

2022-04-29 Thread Michael Ellerman
Pali Rohár  writes:
> Hello!
>
> I started playing with PowerPC e500 architecture, it is something really
> new for me and I suspect that I found a bug in U-Boot code which
> configures L2 cache as initial SRAM (L2 with locked lines).
>
> U-Boot code for the first half of L2 cache sets Caching-inhibited
> (MAS2_I) in TLB and for second half of L2 cache it unsets this bit.
> And I think that this is a bug as it seems strange if one half of L2
> should be mapped differently than second half.
>
> I wrote about it email to U-Boot mailing list:
> https://lore.kernel.org/u-boot/20220413092633.gmz4rqpiha4rwecb@pali/
>
> I discussed about it on U-Boot IRC channel and developers suggested me
> to write on Linux PowerPC mailing list as there could be more skilled
> people.
>
> Michael, or anybody else, could you help me with this? Do you know if L2
> SRAM entry in TLB for e500v2 / BookE architecture should have MAS2_I bit
> set or not?

Sorry I don't know those sort of low-level details for Freescale
machines.

Hopefully some former Freescale person will remember and reply here.

It's also possible that Linux ignores what U-Boot did and sets it up
itself, have you looked at the Linux code?

cheers


[PATCH 1/3] mm: change huge_ptep_clear_flush() to return the original pte

2022-04-29 Thread Baolin Wang
It is incorrect to use ptep_clear_flush() to nuke a hugetlb page
table when unmapping or migrating a hugetlb page, and will change
to use huge_ptep_clear_flush() instead in the following patches.

So this is a preparation patch, which changes the huge_ptep_clear_flush()
to return the original pte to help to nuke a hugetlb page table.

Signed-off-by: Baolin Wang 
---
 arch/arm64/include/asm/hugetlb.h   |  4 ++--
 arch/arm64/mm/hugetlbpage.c| 12 +---
 arch/ia64/include/asm/hugetlb.h|  4 ++--
 arch/mips/include/asm/hugetlb.h|  9 ++---
 arch/parisc/include/asm/hugetlb.h  |  4 ++--
 arch/powerpc/include/asm/hugetlb.h |  9 ++---
 arch/s390/include/asm/hugetlb.h|  6 +++---
 arch/sh/include/asm/hugetlb.h  |  4 ++--
 arch/sparc/include/asm/hugetlb.h   |  4 ++--
 include/asm-generic/hugetlb.h  |  4 ++--
 10 files changed, 32 insertions(+), 28 deletions(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index 1242f71..616b2ca 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -39,8 +39,8 @@ extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
unsigned long addr, pte_t *ptep);
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
-extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
- unsigned long addr, pte_t *ptep);
+extern pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
+  unsigned long addr, pte_t *ptep);
 #define __HAVE_ARCH_HUGE_PTE_CLEAR
 extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
   pte_t *ptep, unsigned long sz);
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index cbace1c..ca8e65c 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -486,19 +486,17 @@ void huge_ptep_set_wrprotect(struct mm_struct *mm,
set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
 }
 
-void huge_ptep_clear_flush(struct vm_area_struct *vma,
-  unsigned long addr, pte_t *ptep)
+pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
+   unsigned long addr, pte_t *ptep)
 {
size_t pgsize;
int ncontig;
 
-   if (!pte_cont(READ_ONCE(*ptep))) {
-   ptep_clear_flush(vma, addr, ptep);
-   return;
-   }
+   if (!pte_cont(READ_ONCE(*ptep)))
+   return ptep_clear_flush(vma, addr, ptep);
 
ncontig = find_num_contig(vma->vm_mm, addr, ptep, &pgsize);
-   clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
+   return get_clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
 }
 
 static int __init hugetlbpage_init(void)
diff --git a/arch/ia64/include/asm/hugetlb.h b/arch/ia64/include/asm/hugetlb.h
index 7e46ebd..65d3811 100644
--- a/arch/ia64/include/asm/hugetlb.h
+++ b/arch/ia64/include/asm/hugetlb.h
@@ -23,8 +23,8 @@ static inline int is_hugepage_only_range(struct mm_struct *mm,
 #define is_hugepage_only_range is_hugepage_only_range
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
-static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
-unsigned long addr, pte_t *ptep)
+static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
 {
 }
 
diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
index c214440..fd69c88 100644
--- a/arch/mips/include/asm/hugetlb.h
+++ b/arch/mips/include/asm/hugetlb.h
@@ -43,16 +43,19 @@ static inline pte_t huge_ptep_get_and_clear(struct 
mm_struct *mm,
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
-static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
-unsigned long addr, pte_t *ptep)
+static inline pte_t huge_ptep_clear_flush(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep)
 {
+   pte_t pte;
+
/*
 * clear the huge pte entry firstly, so that the other smp threads will
 * not get old pte entry after finishing flush_tlb_page and before
 * setting new huge pte entry
 */
-   huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
+   pte = huge_ptep_get_and_clear(vma->vm_mm, addr, ptep);
flush_tlb_page(vma, addr);
+   return pte;
 }
 
 #define __HAVE_ARCH_HUGE_PTE_NONE
diff --git a/arch/parisc/include/asm/hugetlb.h 
b/arch/parisc/include/asm/hugetlb.h
index a69cf9e..25bc560 100644
--- a/arch/parisc/include/asm/hugetlb.h
+++ b/arch/parisc/include/asm/hugetlb.h
@@ -28,8 +28,8 @@ static inline int prepare_hugepage_range(struct file *file,
 }
 
 #define __HAVE_ARCH_HUGE_PTEP_CLEAR_FLUSH
-static inline void huge_ptep_clear_flush(struct vm_area_struct *vma,
-

[PATCH 2/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when migration

2022-04-29 Thread Baolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb:
2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
size specified.

When migrating a hugetlb page, we will get the relevant page table
entry by huge_pte_offset() only once to nuke it and remap it with
a migration pte entry. This is correct for PMD or PUD size hugetlb,
since they always contain only one pmd entry or pud entry in the
page table.

However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
since they can contain several continuous pte or pmd entry with
same page table attributes. So we will nuke or remap only one pte
or pmd entry for this CONT-PTE/PMD size hugetlb page, which is
not expected for hugetlb migration. The problem is we can still
continue to modify the subpages' data of a hugetlb page during
migrating a hugetlb page, which can cause a serious data consistent
issue, since we did not nuke the page table entry and set a
migration pte for the subpages of a hugetlb page.

To fix this issue, we should change to use huge_ptep_clear_flush()
to nuke a hugetlb page table, and remap it with set_huge_pte_at()
and set_huge_swap_pte_at() when migrating a hugetlb page, which
already considered the CONT-PTE or CONT-PMD size hugetlb.

Signed-off-by: Baolin Wang 
---
 mm/rmap.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 6fdd198..7cf2408 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1924,13 +1924,15 @@ static bool try_to_migrate_one(struct folio *folio, 
struct vm_area_struct *vma,
break;
}
}
+
+   /* Nuke the hugetlb page table entry */
+   pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+   /* Nuke the page table entry. */
+   pteval = ptep_clear_flush(vma, address, pvmw.pte);
}
 
-   /* Nuke the page table entry. */
-   pteval = ptep_clear_flush(vma, address, pvmw.pte);
-
/* Set the dirty flag on the folio now the pte is gone. */
if (pte_dirty(pteval))
folio_mark_dirty(folio);
@@ -2015,7 +2017,10 @@ static bool try_to_migrate_one(struct folio *folio, 
struct vm_area_struct *vma,
pte_t swp_pte;
 
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-   set_pte_at(mm, address, pvmw.pte, pteval);
+   if (folio_test_hugetlb(folio))
+   set_huge_pte_at(mm, address, pvmw.pte, 
pteval);
+   else
+   set_pte_at(mm, address, pvmw.pte, 
pteval);
ret = false;
page_vma_mapped_walk_done(&pvmw);
break;
@@ -2024,7 +2029,10 @@ static bool try_to_migrate_one(struct folio *folio, 
struct vm_area_struct *vma,
   !anon_exclusive, subpage);
if (anon_exclusive &&
page_try_share_anon_rmap(subpage)) {
-   set_pte_at(mm, address, pvmw.pte, pteval);
+   if (folio_test_hugetlb(folio))
+   set_huge_pte_at(mm, address, pvmw.pte, 
pteval);
+   else
+   set_pte_at(mm, address, pvmw.pte, 
pteval);
ret = false;
page_vma_mapped_walk_done(&pvmw);
break;
@@ -2050,7 +2058,11 @@ static bool try_to_migrate_one(struct folio *folio, 
struct vm_area_struct *vma,
swp_pte = pte_swp_mksoft_dirty(swp_pte);
if (pte_uffd_wp(pteval))
swp_pte = pte_swp_mkuffd_wp(swp_pte);
-   set_pte_at(mm, address, pvmw.pte, swp_pte);
+   if (folio_test_hugetlb(folio))
+   set_huge_swap_pte_at(mm, address, pvmw.pte,
+swp_pte, 
vma_mmu_pagesize(vma));
+   else
+   set_pte_at(mm, address, pvmw.pte, swp_pte);
trace_set_migration_pte(address, pte_val(swp_pte),
compound_order(&folio->page));
/*
-- 
1.8.3.1



[PATCH 0/3] Fix CONT-PTE/PMD size hugetlb issue when unmapping or migrating

2022-04-29 Thread Baolin Wang
Hi,

Now migrating a hugetlb page or unmapping a poisoned hugetlb page, we'll
use ptep_clear_flush() and set_pte_at() to nuke the page table entry
and remap it, and this is incorrect for CONT-PTE or CONT-PMD size hugetlb
page, which will cause potential data consistent issue. This patch set
will change to use hugetlb related APIs to fix this issue, please find
details in each patch. Thanks.

Baolin Wang (3):
  mm: change huge_ptep_clear_flush() to return the original pte
  mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when migration
  mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

 arch/arm64/include/asm/hugetlb.h   |  4 +--
 arch/arm64/mm/hugetlbpage.c| 12 
 arch/ia64/include/asm/hugetlb.h|  4 +--
 arch/mips/include/asm/hugetlb.h|  9 --
 arch/parisc/include/asm/hugetlb.h  |  4 +--
 arch/powerpc/include/asm/hugetlb.h |  9 --
 arch/s390/include/asm/hugetlb.h|  6 ++--
 arch/sh/include/asm/hugetlb.h  |  4 +--
 arch/sparc/include/asm/hugetlb.h   |  4 +--
 include/asm-generic/hugetlb.h  |  4 +--
 mm/rmap.c  | 58 +++---
 11 files changed, 67 insertions(+), 51 deletions(-)

-- 
1.8.3.1



[PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-04-29 Thread Baolin Wang
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb:
2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
size specified.

When unmapping a hugetlb page, we will get the relevant page table
entry by huge_pte_offset() only once to nuke it. This is correct
for PMD or PUD size hugetlb, since they always contain only one
pmd entry or pud entry in the page table.

However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
since they can contain several continuous pte or pmd entry with
same page table attributes, so we will nuke only one pte or pmd
entry for this CONT-PTE/PMD size hugetlb page.

And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
which means now we will unmap only one pte entry for a CONT-PTE or
CONT-PMD size poisoned hugetlb page, and we can still access other
subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
which will cause serious issues possibly.

So we should change to use huge_ptep_clear_flush() to nuke the
hugetlb page table to fix this issue, which already considered
CONT-PTE and CONT-PMD size hugetlb.

Note we've already used set_huge_swap_pte_at() to set a poisoned
swap entry for a poisoned hugetlb page.

Signed-off-by: Baolin Wang 
---
 mm/rmap.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7cf2408..1e168d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio *folio, 
struct vm_area_struct *vma,
break;
}
}
+   pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
} else {
flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
-   }
-
-   /*
-* Nuke the page table entry. When having to clear
-* PageAnonExclusive(), we always have to flush.
-*/
-   if (should_defer_flush(mm, flags) && !anon_exclusive) {
/*
-* We clear the PTE but do not flush so potentially
-* a remote CPU could still be writing to the folio.
-* If the entry was previously clean then the
-* architecture must guarantee that a clear->dirty
-* transition on a cached TLB entry is written through
-* and traps if the PTE is unmapped.
+* Nuke the page table entry. When having to clear
+* PageAnonExclusive(), we always have to flush.
 */
-   pteval = ptep_get_and_clear(mm, address, pvmw.pte);
+   if (should_defer_flush(mm, flags) && !anon_exclusive) {
+   /*
+* We clear the PTE but do not flush so 
potentially
+* a remote CPU could still be writing to the 
folio.
+* If the entry was previously clean then the
+* architecture must guarantee that a 
clear->dirty
+* transition on a cached TLB entry is written 
through
+* and traps if the PTE is unmapped.
+*/
+   pteval = ptep_get_and_clear(mm, address, 
pvmw.pte);
 
-   set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
-   } else {
-   pteval = ptep_clear_flush(vma, address, pvmw.pte);
+   set_tlb_ubc_flush_pending(mm, 
pte_dirty(pteval));
+   } else {
+   pteval = ptep_clear_flush(vma, address, 
pvmw.pte);
+   }
}
 
/*
-- 
1.8.3.1



Re: L2 SRAM on PowerPC e500 and Caching-inhibited bit

2022-04-29 Thread Pali Rohár
On Friday 29 April 2022 22:57:03 Michael Ellerman wrote:
> Pali Rohár  writes:
> > Hello!
> >
> > I started playing with PowerPC e500 architecture, it is something really
> > new for me and I suspect that I found a bug in U-Boot code which
> > configures L2 cache as initial SRAM (L2 with locked lines).
> >
> > U-Boot code for the first half of L2 cache sets Caching-inhibited
> > (MAS2_I) in TLB and for second half of L2 cache it unsets this bit.
> > And I think that this is a bug as it seems strange if one half of L2
> > should be mapped differently than second half.
> >
> > I wrote about it email to U-Boot mailing list:
> > https://lore.kernel.org/u-boot/20220413092633.gmz4rqpiha4rwecb@pali/
> >
> > I discussed about it on U-Boot IRC channel and developers suggested me
> > to write on Linux PowerPC mailing list as there could be more skilled
> > people.
> >
> > Michael, or anybody else, could you help me with this? Do you know if L2
> > SRAM entry in TLB for e500v2 / BookE architecture should have MAS2_I bit
> > set or not?
> 
> Sorry I don't know those sort of low-level details for Freescale
> machines.
> 
> Hopefully some former Freescale person will remember and reply here.

Ok, so I hope that somebody with knowledge about these CPUs is still on
the list here.

> It's also possible that Linux ignores what U-Boot did and sets it up
> itself, have you looked at the Linux code?
> 
> cheers

Usage of L2 cache as SRAM makes sense only during early boot when DDR is
not configured yet. So Linux for sure does not use this L2 cache as SRAM
setup as it has full access to DDR, and use L2 cache as L2 cache.


Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 12:22:44 +0300
Sergei Shtylyov  wrote:

> > +   switch (ev) {
> > +   case DIE_OOPS:
> > +   do_dump = 1;
> > +   break;
> > +   case PANIC_NOTIFIER:
> > +   do_dump = 1;
> > +   break;  
> 
>Why not:
> 
>   case DIE_OOPS:
>   case PANIC_NOTIFIER:
>   do_dump = 1;
>   break;

Agreed.

Other than that.

Acked-by: Steven Rostedt (Google) 

-- Steve


Re: [powerpc] kernel BUG at mm/mmap.c:3164! w/ltp(mmapstress03)

2022-04-29 Thread Sachin Sant


> On 28-Apr-2022, at 10:26 PM, Sachin Sant  wrote:
> 
> While running LTP tests (mmapstress03 specifically) against 
> 5.18.0-rc4-next-20220428
> booted on IBM Power server mentioned BUG is encountered.
> 
> # ./mmapstress03
> mmapstress030  TINFO  :  uname.machine=ppc64le kernel is 64bit
> mmapstress03: errno = 12: failed to fiddle with brk at the end
> mmapstress031  TFAIL  :  mmapstress03.c:212: Test failed
> [   32.396145] mmap: mmapstress03 (3023): VmData 18446744073706799104 exceed 
> data ulimit 18446744073709551615. Update limits or use boot option 
> ignore_rlimit_data.
> [   32.396192] [ cut here ]
> [   32.396193] kernel BUG at mm/mmap.c:3164!
> [   32.396195] Oops: Exception in kernel mode, sig: 5 [#1]

This BUG_ON was introduced with following patch
commit d2367e383cf5
mm: remove the vma linked list


I tried removing the following three commits to see if it helps.
commit 49d281fa016f2906346f1707e5059b6f7674a948
commit e7ecf47d211aae50f3a1dd3dc75e5afd47745bb6
commit d2367e383cf5ecf93622e64a7b858f7034f3df62

Although I do not see the BUG_ON trace (as it is removed) but
the test still fails:

# ./mmapstress03
mmapstress030  TINFO  :  uname.machine=ppc64le kernel is 64bit
mmapstress03: errno = 12: failed to fiddle with brk at the end
mmapstress031  TFAIL  :  mmapstress03.c:212: Test failed
 [  579.877217] mmap: mmapstress03 (4287): VmData 18446744073706799104 exceed 
data ulimit 18446744073709551615. Update limits or use boot option 
ignore_rlimit_data.
[  579.970138] BUG: non-zero pgtables_bytes on freeing mm: 8192
# 

> [   32.396210] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> [   32.396213] Modules linked in: dm_mod mptcp_diag xsk_diag tcp_diag 
> udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag 
> nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack 
> nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set bonding tls nf_tables nfnetlink 
> sunrpc binfmt_misc pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c 
> sd_mod t10_pi sr_mod crc64_rocksoft_generic cdrom crc64_rocksoft crc64 sg 
> ibmvscsi scsi_transport_srp ibmveth xts vmx_crypto fuse
> [   32.396262] CPU: 5 PID: 3023 Comm: mmapstress03 Not tainted 
> 5.18.0-rc4-next-20220428 #16
> [   32.396267] NIP:  c04c4750 LR: c04c4730 CTR: 
> c04bf5d0
> [   32.396270] REGS: c0001abeb810 TRAP: 0700   Not tainted  
> (5.18.0-rc4-next-20220428)
> [   32.396274] MSR:  80029033   CR: 22002224  
> XER: 
> [   32.396283] CFAR: c08af740 IRQMASK: 0 
> [   32.396283] GPR00: c04c4730 c0001abebab0 c2a71300 
>  
> [   32.396283] GPR04: c00079dcd000  0008 
>  
> [   32.396283] GPR08: 0008 0001  
> c00079dcd040 
> [   32.396283] GPR12: c00079dcd008 c0087fffa300  
>  
> [   32.396283] GPR16:    
>  
> [   32.396283] GPR20:    
> c2aaae85 
> [   32.396283] GPR24:   7fffaa5c1200 
> c00020de3660 
> [   32.396283] GPR28: 000c c00020de3600 000d 
>  
> [   32.396320] NIP [c04c4750] exit_mmap+0x190/0x390
> [   32.396327] LR [c04c4730] exit_mmap+0x170/0x390
> [   32.396332] Call Trace:
> [   32.396334] [c0001abebab0] [c04c4730] exit_mmap+0x170/0x390 
> (unreliable)
> [   32.396340] [c0001abebbd0] [c01700f4] __mmput+0x54/0x200
> [   32.396344] [c0001abebc10] [c017fe5c] exit_mm+0xfc/0x190
> [   32.396348] [c0001abebc50] [c018016c] do_exit+0x27c/0x5a0
> [   32.396352] [c0001abebcf0] [c018063c] do_group_exit+0x4c/0xd0
> [   32.396356] [c0001abebd30] [c01806e4] sys_exit_group+0x24/0x30
> [   32.396360] [c0001abebd50] [c0037084] 
> system_call_exception+0x254/0x550
> [   32.396364] [c0001abebe10] [c000bfe8] 
> system_call_vectored_common+0xe8/0x278
> [   32.396369] --- interrupt: 3000 at 0x7fffaa318d04
> [   32.396374] NIP:  7fffaa318d04 LR:  CTR: 
> 
> [   32.396377] REGS: c0001abebe80 TRAP: 3000   Not tainted  
> (5.18.0-rc4-next-20220428)
> [   32.396380] MSR:  8280f033   
> CR: 4200  XER: 
> [   32.396389] IRQMASK: 0 
> [   32.396389] GPR00: 00ea 7fffe43f3420 7fffaa457100 
> 0001 
> [   32.396389] GPR04:  11a602a0 7fffaa5c1200 
>  
> [   32.396389] GPR08:    
>  
> [   32.396389] GPR12:  7fffaa5ca500  
>  
> [   32.396389] GPR16: 

Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 10:23, Steven Rostedt wrote:
> On Fri, 29 Apr 2022 12:22:44 +0300
> Sergei Shtylyov  wrote:
> 
>>> +   switch (ev) {
>>> +   case DIE_OOPS:
>>> +   do_dump = 1;
>>> +   break;
>>> +   case PANIC_NOTIFIER:
>>> +   do_dump = 1;
>>> +   break;  
>>
>>Why not:
>>
>>  case DIE_OOPS:
>>  case PANIC_NOTIFIER:
>>  do_dump = 1;
>>  break;
> 
> Agreed.
> 
> Other than that.
> 
> Acked-by: Steven Rostedt (Google) 
> 
> -- Steve

Thanks Sergei and Steven, good idea! I thought about the switch change
you propose, but I confess I got a bit confused by the "fallthrough"
keyword - do I need to use it?

About the s/int/bool, for sure! Not sure why I didn't use bool at
first...heheh

Cheers,


Guilherme


[RFC v2 01/39] Kconfig: introduce HAS_IOPORT option and select it as necessary

2022-04-29 Thread Niklas Schnelle
We introduce a new HAS_IOPORT Kconfig option to indicate support for
I/O Port access. In a future patch HAS_IOPORT=n will disable compilation
of the I/O accessor functions inb()/outb() and friends on architectures
which can not meaningfully support legacy I/O spaces such as s390 or
where such support is optional. The "depends on" relations on HAS_IOPORT
in drivers as well as ifdefs for HAS_IOPORT specific sections will be
added in subsequent patches on a per subsystem basis.

Co-developed-by: Arnd Bergmann 
Signed-off-by: Niklas Schnelle 
---
 arch/alpha/Kconfig  | 1 +
 arch/arm/Kconfig| 1 +
 arch/arm64/Kconfig  | 1 +
 arch/ia64/Kconfig   | 1 +
 arch/m68k/Kconfig   | 1 +
 arch/microblaze/Kconfig | 1 +
 arch/mips/Kconfig   | 1 +
 arch/parisc/Kconfig | 1 +
 arch/powerpc/Kconfig| 1 +
 arch/riscv/Kconfig  | 1 +
 arch/sh/Kconfig | 1 +
 arch/sparc/Kconfig  | 1 +
 arch/x86/Kconfig| 1 +
 drivers/bus/Kconfig | 2 +-
 lib/Kconfig | 4 
 lib/Kconfig.kgdb| 1 +
 16 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 7d0d26b5b3f5..2b9cf1b0bdb8 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -27,6 +27,7 @@ config ALPHA
select AUDIT_ARCH
select GENERIC_CPU_VULNERABILITIES
select GENERIC_SMP_IDLE_THREAD
+   select HAS_IOPORT
select HAVE_ARCH_AUDITSYSCALL
select HAVE_MOD_ARCH_SPECIFIC
select MODULES_USE_ELF_RELA
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2e8091e2d8a8..603ce00033a5 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -69,6 +69,7 @@ config ARM
select GENERIC_SCHED_CLOCK
select GENERIC_SMP_IDLE_THREAD
select HARDIRQS_SW_RESEND
+   select HAS_IOPORT
select HAVE_ARCH_AUDITSYSCALL if AEABI && !OABI_COMPAT
select HAVE_ARCH_BITREVERSE if (CPU_32v7M || CPU_32v7) && !CPU_32v6
select HAVE_ARCH_JUMP_LABEL if !XIP_KERNEL && !CPU_ENDIAN_BE32 && MMU
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 20ea89d9ac2f..234dc89a7654 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -136,6 +136,7 @@ config ARM64
select GENERIC_GETTIMEOFDAY
select GENERIC_VDSO_TIME_NS
select HARDIRQS_SW_RESEND
+   select HAS_IOPORT
select HAVE_MOVE_PMD
select HAVE_MOVE_PUD
select HAVE_PCI
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index cb93769a9f2a..0fffe5130a80 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -25,6 +25,7 @@ config IA64
select PCI_DOMAINS if PCI
select PCI_MSI
select PCI_SYSCALL if PCI
+   select HAS_IOPORT
select HAVE_ASM_MODVERSIONS
select HAVE_UNSTABLE_SCHED_CLOCK
select HAVE_EXIT_THREAD
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 936cce42ae9a..54bf0a40c2f0 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -18,6 +18,7 @@ config M68K
select GENERIC_CPU_DEVICES
select GENERIC_IOMAP
select GENERIC_IRQ_SHOW
+   select HAS_IOPORT if PCI || ISA || ATARI_ROM_ISA
select HAVE_ASM_MODVERSIONS
select HAVE_DEBUG_BUGVERBOSE
select HAVE_EFFICIENT_UNALIGNED_ACCESS if !CPU_HAS_NO_UNALIGNED
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index 8cf429ad1c84..966a6682f1fc 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -21,6 +21,7 @@ config MICROBLAZE
select GENERIC_IRQ_SHOW
select GENERIC_PCI_IOMAP
select GENERIC_SCHED_CLOCK
+   select HAS_IOPORT if PCI
select HAVE_ARCH_HASH
select HAVE_ARCH_KGDB
select HAVE_ARCH_SECCOMP
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index de3b32a507d2..4c55df08d6f1 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -47,6 +47,7 @@ config MIPS
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select GUP_GET_PTE_LOW_HIGH if CPU_MIPS32 && PHYS_ADDR_T_64BIT
+   select HAS_IOPORT
select HAVE_ARCH_COMPILER_H
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KGDB if MIPS_FP_SUPPORT
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 52e550b45692..741c5c64c173 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -46,6 +46,7 @@ config PARISC
select MODULES_USE_ELF_RELA
select CLONE_BACKWARDS
select TTY # Needed for pdc_cons.c
+   select HAS_IOPORT if PCI || EISA
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HASH
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 174edabb74fa..7133cc35b777 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -182,6 +182,7 @@ config PPC
select GENERIC_SMP_IDLE_THREAD
select GENERIC_TIME_VSYSCALL
select GENERIC_VDSO_TIME_NS
+   select HAS_IOPORT   if PCI
select HAVE_ARCH_AUDITSYSCAL

Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 10:46:35 -0300
"Guilherme G. Piccoli"  wrote:

> Thanks Sergei and Steven, good idea! I thought about the switch change
> you propose, but I confess I got a bit confused by the "fallthrough"
> keyword - do I need to use it?

No. The fallthrough keyword is only needed when there's code between case
labels. As it is very common to list multiple cases for the same code path.
That is:

case DIE_OOPS:
case PANIC_NOTIFIER:
do_dump = 1;
break;

Does not need a fall through label, as there's no code between the DIE_OOPS
and the PANIC_NOTIFIER. But if you had:

case DIE_OOPS:
x = true;
case PANIC_NOTIFIER:
do_dump = 1;
break;

Then you do.

-- Steve


Re: [PATCH 12/30] parisc: Replace regular spinlock with spin_trylock on panic path

2022-04-29 Thread Guilherme G. Piccoli
On 28/04/2022 13:55, Helge Deller wrote:
> [...]
> You may add:
> Acked-by: Helge Deller  # parisc
> 
> Helge

Thanks Helge, added!
Cheers,


Guilherme

> 
> 
>> ---
>>  arch/parisc/include/asm/pdc.h |  1 +
>>  arch/parisc/kernel/firmware.c | 27 +++
>>  drivers/parisc/power.c| 17 ++---
>>  3 files changed, 34 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/parisc/include/asm/pdc.h b/arch/parisc/include/asm/pdc.h
>> index b643092d4b98..7a106008e258 100644
>> --- a/arch/parisc/include/asm/pdc.h
>> +++ b/arch/parisc/include/asm/pdc.h
>> @@ -83,6 +83,7 @@ int pdc_do_firm_test_reset(unsigned long ftc_bitmap);
>>  int pdc_do_reset(void);
>>  int pdc_soft_power_info(unsigned long *power_reg);
>>  int pdc_soft_power_button(int sw_control);
>> +int pdc_soft_power_button_panic(int sw_control);
>>  void pdc_io_reset(void);
>>  void pdc_io_reset_devices(void);
>>  int pdc_iodc_getc(void);
>> diff --git a/arch/parisc/kernel/firmware.c b/arch/parisc/kernel/firmware.c
>> index 6a7e315bcc2e..0e2f70b592f4 100644
>> --- a/arch/parisc/kernel/firmware.c
>> +++ b/arch/parisc/kernel/firmware.c
>> @@ -1232,15 +1232,18 @@ int __init pdc_soft_power_info(unsigned long 
>> *power_reg)
>>  }
>>
>>  /*
>> - * pdc_soft_power_button - Control the soft power button behaviour
>> - * @sw_control: 0 for hardware control, 1 for software control
>> + * pdc_soft_power_button{_panic} - Control the soft power button behaviour
>> + * @sw_control: 0 for hardware control, 1 for software control
>>   *
>>   *
>>   * This PDC function places the soft power button under software or
>>   * hardware control.
>> - * Under software control the OS may control to when to allow to shut
>> - * down the system. Under hardware control pressing the power button
>> + * Under software control the OS may control to when to allow to shut
>> + * down the system. Under hardware control pressing the power button
>>   * powers off the system immediately.
>> + *
>> + * The _panic version relies in spin_trylock to prevent deadlock
>> + * on panic path.
>>   */
>>  int pdc_soft_power_button(int sw_control)
>>  {
>> @@ -1254,6 +1257,22 @@ int pdc_soft_power_button(int sw_control)
>>  return retval;
>>  }
>>
>> +int pdc_soft_power_button_panic(int sw_control)
>> +{
>> +int retval;
>> +unsigned long flags;
>> +
>> +if (!spin_trylock_irqsave(&pdc_lock, flags)) {
>> +pr_emerg("Couldn't enable soft power button\n");
>> +return -EBUSY; /* ignored by the panic notifier */
>> +}
>> +
>> +retval = mem_pdc_call(PDC_SOFT_POWER, PDC_SOFT_POWER_ENABLE, 
>> __pa(pdc_result), sw_control);
>> +spin_unlock_irqrestore(&pdc_lock, flags);
>> +
>> +return retval;
>> +}
>> +
>>  /*
>>   * pdc_io_reset - Hack to avoid overlapping range registers of Bridges 
>> devices.
>>   * Primarily a problem on T600 (which parisc-linux doesn't support) but
>> diff --git a/drivers/parisc/power.c b/drivers/parisc/power.c
>> index 456776bd8ee6..8512884de2cf 100644
>> --- a/drivers/parisc/power.c
>> +++ b/drivers/parisc/power.c
>> @@ -37,7 +37,6 @@
>>  #include 
>>  #include 
>>  #include 
>> -#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -175,16 +174,21 @@ static void powerfail_interrupt(int code, void *x)
>>
>>
>>
>> -/* parisc_panic_event() is called by the panic handler.
>> - * As soon as a panic occurs, our tasklets above will not be
>> - * executed any longer. This function then re-enables the
>> - * soft-power switch and allows the user to switch off the system
>> +/*
>> + * parisc_panic_event() is called by the panic handler.
>> + *
>> + * As soon as a panic occurs, our tasklets above will not
>> + * be executed any longer. This function then re-enables
>> + * the soft-power switch and allows the user to switch off
>> + * the system. We rely in pdc_soft_power_button_panic()
>> + * since this version spin_trylocks (instead of regular
>> + * spinlock), preventing deadlocks on panic path.
>>   */
>>  static int parisc_panic_event(struct notifier_block *this,
>>  unsigned long event, void *ptr)
>>  {
>>  /* re-enable the soft-power switch */
>> -pdc_soft_power_button(0);
>> +pdc_soft_power_button_panic(0);
>>  return NOTIFY_DONE;
>>  }
>>
>> @@ -193,7 +197,6 @@ static struct notifier_block parisc_panic_block = {
>>  .priority   = INT_MAX,
>>  };
>>
>> -
>>  static int __init power_init(void)
>>  {
>>  unsigned long ret;
> 


Re: [PATCH 17/30] tracing: Improve panic/die notifiers

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 10:56, Steven Rostedt wrote:
> [...]
> No. The fallthrough keyword is only needed when there's code between case
> labels. As it is very common to list multiple cases for the same code path.
> That is:
> 
>   case DIE_OOPS:
>   case PANIC_NOTIFIER:
>   do_dump = 1;
>   break;
> 
> Does not need a fall through label, as there's no code between the DIE_OOPS
> and the PANIC_NOTIFIER. But if you had:
> 
>   case DIE_OOPS:
>   x = true;
>   case PANIC_NOTIFIER:
>   do_dump = 1;
>   break;
> 
> Then you do.
> 
> -- Steve

Thanks a bunch for the clarification, changed that for V2 =)


[next] powerpc: multiple definition of `____cacheline_aligned'; sound/core/oss/pcm_oss.o:(.bss+0x40): first defined here

2022-04-29 Thread Naresh Kamboju
Following powerpc builds failed on Linux next-20220428 and next-20220429.

Regressions found on powerpc:
   - gcc-11-ppc64e_defconfig
   - gcc-10-ppc64e_defconfig
   - gcc-9-ppc64e_defconfig
   - gcc-8-ppc64e_defconfig
   - clang-14-ppc64e_defconfig
   - clang-nightly-ppc64e_defconfig
   - clang-13-ppc64e_defconfig


Build error:
-
Error: Section .bss not empty in prom_init.c
make[3]: *** [arch/powerpc/kernel/Makefile:191:
arch/powerpc/kernel/prom_init_check] Error 1
make[3]: Target '__build' not remade because of errors.
make[2]: *** [scripts/Makefile.build:595: arch/powerpc/kernel] Error 2
make[2]: Target '__build' not remade because of errors.
make[1]: *** [Makefile:1996: arch/powerpc] Error 2
powerpc64le-linux-gnu-ld: sound/core/oss/pcm_plugin.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/oss/pcm_oss.o:(.bss+0x40): first defined here
make[4]: *** [scripts/Makefile.build:530: sound/core/oss/snd-pcm-oss.o] Error 1
make[4]: Target '__build' not remade because of errors.
make[3]: *** [scripts/Makefile.build:595: sound/core/oss] Error 2
powerpc64le-linux-gnu-ld: sound/core/seq/seq_clientmgr.o:(.bss+0x900):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_memory.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_queue.o:(.bss+0x140):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_fifo.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_timer.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_system.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_ports.o:(.bss+0x0):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
powerpc64le-linux-gnu-ld: sound/core/seq/seq_info.o:(.bss+0x40):
multiple definition of `cacheline_aligned';
sound/core/seq/seq_lock.o:(.bss+0x0): first defined here
make[4]: *** [scripts/Makefile.build:530: sound/core/seq/snd-seq.o] Error 1
make[4]: Target '__build' not remade because of errors.


Reported-by: Linux Kernel Functional Testing 


steps to reproduce:
---
# To install tuxmake on your system globally:
# sudo pip3 install -U tuxmake

tuxmake --runtime podman --target-arch powerpc --toolchain gcc-11
--kconfig ppc64e_defconfig

--
Linaro LKFT
https://lkft.linaro.org

[1] https://builds.tuxbuild.com/28Sn15hB2la1PweieGMLrUdbFMQ/


Re: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Guilherme G. Piccoli
On 27/04/2022 21:28, Randy Dunlap wrote:
> 
> 
> On 4/27/22 15:49, Guilherme G. Piccoli wrote:
>> +crash_kexec_post_notifiers
>> +This was DEPRECATED - users should always prefer the
> 
>   This is DEPRECATED - users should always prefer the
> 
>> +parameter "panic_notifiers_level" - check its entry
>> +in this documentation for details on how it works.
>> +Setting this parameter is exactly the same as setting
>> +"panic_notifiers_level=4".
> 

Thanks Randy, for your suggestion - but I confess I couldn't understand
it properly. It's related to spaces/tabs, right? What you suggest me to
change in this formatting? Just by looking the email I can't parse.

Cheers,


Guilherme


Re: [PATCH v6] PCI hotplug: rpaphp: Error out on busy status from get-sensor-state

2022-04-29 Thread Bjorn Helgaas
On Thu, Apr 28, 2022 at 05:31:38PM -0500, Nathan Lynch wrote:
> Bjorn Helgaas  writes:
> > On Tue, Apr 26, 2022 at 11:07:39PM +0530, Mahesh Salgaonkar wrote:
> >> +/*
> >> + * RTAS call get-sensor-state(DR_ENTITY_SENSE) return values as per PAPR:
> >> + *-1: Hardware Error
> >> + *-2: RTAS_BUSY
> >> + *-3: Invalid sensor. RTAS Parameter Error.
> >> + * -9000: Need DR entity to be powered up and unisolated before RTAS call
> >> + * -9001: Need DR entity to be powered up, but not unisolated, before 
> >> RTAS call
> >> + * -9002: DR entity unusable
> >> + *  990x: Extended delay - where x is a number in the range of 0-5
> >> + */
> >> +#define RTAS_HARDWARE_ERROR   (-1)
> >> +#define RTAS_INVALID_SENSOR   (-3)
> >> +#define SLOT_UNISOLATED   (-9000)
> >> +#define SLOT_NOT_UNISOLATED   (-9001)

> >> +static int rtas_to_errno(int rtas_rc)
> >> +{
> >> +  int rc;
> >> +
> >> +  switch (rtas_rc) {
> >> +  case RTAS_HARDWARE_ERROR:
> >> +  rc = -EIO;
> >> +  break;
> >> +  case RTAS_INVALID_SENSOR:
> >> +  rc = -EINVAL;
> >> +  break;
> >> +  case SLOT_UNISOLATED:
> >> +  case SLOT_NOT_UNISOLATED:
> >> +  rc = -EFAULT;
> >> +  break;
> >> +  case SLOT_NOT_USABLE:
> >> +  rc = -ENODEV;
> >> +  break;
> >> +  case RTAS_BUSY:
> >> +  case RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX:
> >> +  rc = -EBUSY;
> >> +  break;
> >> +  default:
> >> +  err("%s: unexpected RTAS error %d\n", __func__, rtas_rc);
> >> +  rc = -ERANGE;
> >> +  break;
> >> +  }
> >> +  return rc;
> >
> > This basically duplicates rtas_error_rc().  Why do we need two copies?
> 
> It treats RTAS_BUSY, RTAS_EXTENDED_DELAY_MIN...RTAS_EXTENDED_DELAY_MAX
> differently, which is part of the point of this change.

I think it would reduce confusion overall to:

  - add RTAS_HARDWARE_ERROR, RTAS_INVALID_SENSOR to rtas.h

  - rename and move SLOT_UNISOLATED, etc to rtas.h; they look
analogous to function-specific things like RTAS_SUSPEND_ABORTED

  - change rtas_error_rc() to use the RTAS_HARDWARE_ERROR, etc
constants

  - use the constants (SLOT_NOT_USABLE) instead of "9902" in the
commit log and code comments

> Aside: rtas_error_rc() (from powerpc's rtas.c) is badly named. Its
> conversions make sense for only a handful of RTAS calls. RTAS error
> codes have function-specific interpretations.

Maybe there's a case for factoring out the generic error codes and
have rtas_to_errno() (which sounds like maybe it should be renamed as
well) handle the function-specific part and fall back to the generic
one otherwise:

  int rtas_to_errno(int rtas_rc)
  {
switch (rtas_rc) {
case SLOT_UNISOLATED:
case SLOT_NOT_UNISOLATED:
  return -EINVAL;
case SLOT_NOT_USABLE:
  return -ENODEV;
...
default:
  return rtas_error_rc(rtas_rc);
}
  }


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Naveen N. Rao

Steven Rostedt wrote:

On Thu, 28 Apr 2022 22:49:52 +0530
"Naveen N. Rao"  wrote:


But, with ppc64 elf abi v1 which only supports the old -pg flag, mcount
location can differ between the weak and non-weak variants of a
function. In such scenarios, one of the two mcount entries will be
invalid. Such architectures need to validate mcount locations by
ensuring that the instruction(s) at those locations are as expected. On
powerpc, this can be a simple check to ensure that the instruction is a
'bl'. This check can be further tightened as necessary.


I was thinking about this more, and I was thinking that we could create
another section; Perhaps __mcount_loc_weak. And place these in that
section. That way, we could check if these symbols to see if there's
already a symbol for it, and if there is, then drop it.


If I'm understanding your suggestion right:
- we now create a new section in each object file: __mcount_loc_weak, 
 and capture such relocations using weak symbols there.
- we then ask the linker to put these separately between, say, 
 __start_mcount_loc_weak and __stop_mcount_loc_weak
- on ftrace init, we go through entries in this range, but discard those 
 that belong to functions that also have an entry between 
 __start_mcount_loc and __stop_mcount loc.


The primary issue I see here is that the mcount locations within the new 
weak section will end up being offsets from a different function in 
vmlinux, since the linker does not create a symbol for the weak 
functions that were over-ridden.


As an example, in the issue described in this patch set, if powerpc 
starts over-riding kexec_arch_apply_relocations(), then the current weak 
implementation in kexec_file.o gets carried over to the final vmlinux, 
but the instructions will instead appear under the previous function in 
kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
not be traced to begin with, so we won't be able to figure out if this 
is valid or not.



- Naveen


Re: [PATCH 09/30] coresight: cpu-debug: Replace mutex with mutex_trylock on panic notifier

2022-04-29 Thread Guilherme G. Piccoli
On 28/04/2022 05:11, Suzuki K Poulose wrote:
> Hi Guilherme,
> [...] 
> How would you like to proceed with queuing this ? I am happy
> either way. In case you plan to push this as part of this
> series (I don't see any potential conflicts) :
> 
> Reviewed-by: Suzuki K Poulose 

Thanks for your review Suzuki, much appreciated!

About your question, I'm not sure yet - in case the core changes would
take a while (like if community find them polemic, require many changes,
etc) I might split this series in 2 parts, the fixes part vs the
improvements per se. Either way, a V2 is going to happen for sure, and
in that moment, I'll let you know what I think it's best.

But either way, any choice you prefer is fine by me as well (like if you
want to merge it now or postpone to get merged in the future), this is
not an urgent fix I think =)
Cheers,


Guilherme


Re: [PATCH 20/30] panic: Add the panic informational notifier list

2022-04-29 Thread Guilherme G. Piccoli
Thanks Paul and Suzuki for the ACKs.
Cheers,


Guilherme


Re: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-04-29 Thread Guilherme G. Piccoli
On 28/04/2022 13:26, Corey Minyard wrote:
> [...]
> 
> For the IPMI portion:
> 
> Acked-by: Corey Minyard 

Thanks Alex and Corey for the ACKs!

> 
> Note that the IPMI panic_event() should always return, but it may take
> some time, especially if the IPMI controller is no longer functional.
> So the risk of a long delay is there and it makes sense to move it very
> late.
> 

Thanks, I agree - the patch moves it to the (latest - 1) position, since
some arch code might run as the latest and effectively stops the machine.
Cheers,


Guilherme


Re: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-04-29 Thread Max Filippov
On Wed, Apr 27, 2022 at 3:55 PM Guilherme G. Piccoli
 wrote:
>
> This patch renames the panic_notifier_list to panic_pre_reboot_list;
> the idea is that a subsequent patch will refactor the panic path
> in order to better split the notifiers, running some of them very
> early, some of them not so early [but still before kmsg_dump()] and
> finally, the rest should execute late, after kdump. The latter ones
> are now in the panic pre-reboot list - the name comes from the idea
> that these notifiers execute before panic() attempts rebooting the
> machine (if that option is set).
>
> We also took the opportunity to clean-up useless header inclusions,
> improve some notifier block declarations (e.g. in ibmasm/heartbeat.c)
> and more important, change some priorities - we hereby set 2 notifiers
> to run late in the list [iss_panic_event() and the IPMI panic_event()]
> due to the risks they offer (may not return, for example).
> Proper documentation is going to be provided in a subsequent patch,
> that effectively refactors the panic path.

[...]

>  arch/xtensa/platforms/iss/setup.c |  4 ++--For xtensa:

For xtensa:
Acked-by: Max Filippov 

-- 
Thanks.
-- Max


RE: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently the regular CPU shutdown path for ARM disables IRQs/FIQs
> in the secondary CPUs - smp_send_stop() calls ipi_cpu_stop(), which
> is responsible for that. This makes sense, since we're turning off
> such CPUs, putting them in an endless busy-wait loop.
> 
> Problem is that there is an alternative path for disabling CPUs,
> in the form of function crash_smp_send_stop(), used for kexec/panic
> paths. This functions relies in a SMP call that also triggers a

s/functions relies in/function relies on/

> busy-wait loop [at machine_crash_nonpanic_core()], but *without*
> disabling interrupts. This might lead to odd scenarios, like early
> interrupts in the boot of kexec'd kernel or even interrupts in
> other CPUs while the main one still works in the panic path and
> assumes all secondary CPUs are (really!) off.
> 
> This patch mimics the ipi_cpu_stop() interrupt disable mechanism
> in the crash CPU shutdown path, hence disabling IRQs/FIQs in all
> secondary CPUs in the kexec/panic path as well.
> 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/arm/kernel/machine_kexec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index f567032a09c0..ef788ee00519 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -86,6 +86,9 @@ void machine_crash_nonpanic_core(void *unused)
>   set_cpu_online(smp_processor_id(), false);
>   atomic_dec(&waiting_for_crash_ipi);
> 
> + local_fiq_disable();
> + local_irq_disable();
> +
>   while (1) {
>   cpu_relax();
>   wfe();
> --
> 2.36.0



RE: [PATCH 18/30] notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently we have a debug infrastructure in the notifiers file, but
> it's very simple/limited. This patch extends it by:
> 
> (a) Showing all registered/unregistered notifiers' callback names;
> 
> (b) Adding a dynamic debug tuning to allow showing called notifiers'
> function names. Notice that this should be guarded as a tunable since
> it can flood the kernel log buffer.
> 
> Cc: Arjan van de Ven 
> Cc: Cong Wang 
> Cc: Sebastian Andrzej Siewior 
> Cc: Valentin Schneider 
> Cc: Xiaoming Ni 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> We have some design decisions that worth discussing here:
> 
> (a) First of call, using C99 helps a lot to write clear and concise code, but

s/call/all/

> due to commit 4d94f910e79a ("Kbuild: use -Wdeclaration-after-statement") we
> have a warning if mixing variable declarations with code. For this patch 
> though,
> doing that makes the code way clear, so decision was to add the debug code
> inside brackets whenever this warning pops up. We can change that, but that'll
> cause more ifdefs in the same function.
> 
> (b) In the symbol lookup helper function, we modify the parameter passed but
> even more, we return it as well! This is unusual and seems unnecessary, but 
> was
> the strategy taken to allow embedding such function in the pr_debug() call.
> 
> Not doing that would likely requiring 3 symbol_name variables to avoid
> concurrency (registering notifier A while calling notifier B) - we rely in
> local variables as a serialization mechanism.
> 
> We're open for suggestions in case this design is not appropriate;
> thanks in advance!
> 
>  kernel/notifier.c | 48 +--
>  1 file changed, 46 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/notifier.c b/kernel/notifier.c
> index ba005ebf4730..21032ebcde57 100644
> --- a/kernel/notifier.c
> +++ b/kernel/notifier.c
> @@ -7,6 +7,22 @@
>  #include 
>  #include 
> 
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> +#include 
> +
> +/*
> + *   Helper to get symbol names in case DEBUG_NOTIFIERS is set.
> + *   Return the modified parameter is a strategy used to achieve
> + *   the pr_debug() functionality - with this, function is only
> + *   executed if the dynamic debug tuning is effectively set.
> + */
> +static inline char *notifier_name(struct notifier_block *nb, char *sym_name)
> +{
> + lookup_symbol_name((unsigned long)(nb->notifier_call), sym_name);
> + return sym_name;
> +}
> +#endif
> +
>  /*
>   *   Notifier list for kernel code which wants to be called
>   *   at shutdown. This is used to stop any idling DMA operations
> @@ -34,20 +50,41 @@ static int notifier_chain_register(struct notifier_block 
> **nl,
>   }
>   n->next = *nl;
>   rcu_assign_pointer(*nl, n);
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> + {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_info("notifiers: registered %s()\n",
> + notifier_name(n, sym_name));
> + }
> +#endif
>   return 0;
>  }
> 
>  static int notifier_chain_unregister(struct notifier_block **nl,
>   struct notifier_block *n)
>  {
> + int ret = -ENOENT;
> +
>   while ((*nl) != NULL) {
>   if ((*nl) == n) {
>   rcu_assign_pointer(*nl, n->next);
> - return 0;
> + ret = 0;
> + break;
>   }
>   nl = &((*nl)->next);
>   }
> - return -ENOENT;
> +
> +#ifdef CONFIG_DEBUG_NOTIFIERS
> + if (!ret) {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_info("notifiers: unregistered %s()\n",
> + notifier_name(n, sym_name));
> + }
> +#endif
> + return ret;
>  }
> 
>  /**
> @@ -80,6 +117,13 @@ static int notifier_call_chain(struct notifier_block **nl,
>   nb = next_nb;
>   continue;
>   }
> +
> + {
> + char sym_name[KSYM_NAME_LEN];
> +
> + pr_debug("notifiers: calling %s()\n",
> +  notifier_name(nb, sym_name));
> + }
>  #endif
>   ret = nb->notifier_call(nb, val, v);
> 
> --
> 2.36.0



RE: [PATCH 16/30] drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> Currently Hyper-V guests are among the most relevant users of the panic
> infrastructure, like panic notifiers, kmsg dumpers, etc. The reasons rely
> both in cleaning-up procedures (closing a hypervisor <-> guest connection,
> disabling a paravirtualized timer) as well as to data collection (sending
> panic information to the hypervisor) and framebuffer management.
> 
> The thing is: some notifiers are related to others, ordering matters, some
> functionalities are duplicated and there are lots of conditionals behind
> sending panic information to the hypervisor. This patch, as part of an
> effort to clean-up the panic notifiers mechanism and better document
> things, address some of the issues/complexities of Hyper-V panic handling
> through the following changes:
> 
> (a) We have die and panic notifiers on vmbus_drv.c and both have goals of
> sending panic information to the hypervisor, though the panic notifier is
> also responsible for a cleaning-up procedure.
> 
> This commit clears the code by splitting the panic notifier in two, one
> for closing the vmbus connection whereas the other is only for sending
> panic info to hypervisor. With that, it was possible to merge the die and
> panic notifiers in a single/well-documented function, and clear some
> conditional complexities on sending such information to the hypervisor.
> 
> (b) The new panic notifier created after (a) is only doing a single thing:
> cleaning the vmbus connection. This procedure might cause a delay (due to
> hypervisor I/O completion), so we postpone that to run late. But more
> relevant: this *same* vmbus unloading happens in the crash_shutdown()
> handler, so if kdump is set, we can safely skip this panic notifier and
> defer such clean-up to the kexec crash handler.

While the last sentence is true for Hyper-V on x86/x64, it's not true for
Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
with the ability to provide a custom crash_shutdown() function, which
Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
has no mechanism to provide such a custom function that will eventually
do the needed vmbus_initiate_unload() before running kdump.

I'm not immediately sure what the best solution is for ARM64.  At this
point, I'm just pointing out the problem and will think about the tradeoffs
for various possible solutions.  Please do the same yourself. :-)

> 
> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
> doing a vmbus operation that demands a valid connection. So, we must
> order this notifier with the panic notifier from vmbus_drv.c, in order to
> guarantee that the framebuffer code executes before the vmbus connection
> is unloaded.

Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
notifier list, which means it won't execute before the VMbus connection
unload in the case of kdump.   This notifier is making sure that Hyper-V
is notified about the last updates made to the frame buffer before the
panic, so maybe it needs to be put on the hypervisor notifier list.  It
sends a message to Hyper-V over its existing VMbus channel, but it
does not wait for a reply.  It does, however, obtain a spin lock on the
ring buffer used to communicate with Hyper-V.   Unless someone has
a better suggestion, I'm inclined to take the risk of blocking on that
spin lock.

> 
> Also, this commit removes a useless header.
> 
> Although there is code rework and re-ordering, we expect that this change
> has no functional regressions but instead optimize the path and increase
> panic reliability on Hyper-V. This was tested on Hyper-V with success.
> 
> Fixes: 792f232d57ff ("Drivers: hv: vmbus: Fix potential crash on module 
> unload")
> Fixes: 74347a99e73a ("x86/Hyper-V: Unload vmbus channel in hv panic callback")

The "Fixes:" tags imply that these changes should be backported to older
longterm kernel versions, which I don't think is the case.  There is a
dependency on Patch 14 of your series where PANIC_NOTIFIER is
introduced.

> Cc: Andrea Parri (Microsoft) 
> Cc: Dexuan Cui 
> Cc: Haiyang Zhang 
> Cc: "K. Y. Srinivasan" 
> Cc: Michael Kelley 
> Cc: Stephen Hemminger 
> Cc: Tianyu Lan 
> Cc: Wei Liu 
> Tested-by: Fabio A M Martins 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> Special thanks to Michael Kelley for the good information about the Hyper-V
> panic path in email threads some months ago, and to Fabio for the testing
> performed.
> 
> Michael and all Microsoft folks: a careful analysis to double-check our 
> changes
> and assumptions here is really appreciated, this code is complex and 
> intricate,
> it is possible some corner case might have been overlooked.
> 
> Thanks in advance!
> 
>  drivers/hv/vmbus_drv.c  | 109 
>  drivers/video/fbdev/hyperv_fb.c |   8 +++
>  2 files changed, 76 insertions(+), 41 deletions(-)
> 
> diff --git a/drivers/hv/vmbus_d

RE: [PATCH 19/30] panic: Add the panic hypervisor notifier list

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> The goal of this new panic notifier is to allow its users to register
> callbacks to run very early in the panic path. This aims hypervisor/FW
> notification mechanisms as well as simple LED functions, and any other
> simple and safe mechanism that should run early in the panic path; more
> dangerous callbacks should execute later.
> 
> For now, the patch is almost a no-op (although it changes a bit the
> ordering in which some panic notifiers are executed). In a subsequent
> patch, the panic path will be refactored, then the panic hypervisor
> notifiers will effectively run very early in the panic path.
> 
> We also defer documenting it all properly in the subsequent refactor
> patch. While at it, we removed some useless header inclusions and
> fixed some notifiers return too (by using the standard NOTIFY_DONE).
> 
> Cc: Alexander Gordeev 
> Cc: Andrea Parri (Microsoft) 
> Cc: Ard Biesheuvel 
> Cc: Benjamin Herrenschmidt 
> Cc: Brian Norris 
> Cc: Christian Borntraeger 
> Cc: Christophe JAILLET 
> Cc: David Gow 
> Cc: "David S. Miller" 
> Cc: Dexuan Cui 
> Cc: Doug Berger 
> Cc: Evan Green 
> Cc: Florian Fainelli 
> Cc: Haiyang Zhang 
> Cc: Hari Bathini 
> Cc: Heiko Carstens 
> Cc: Julius Werner 
> Cc: Justin Chen 
> Cc: "K. Y. Srinivasan" 
> Cc: Lee Jones 
> Cc: Markus Mayer 
> Cc: Michael Ellerman 
> Cc: Michael Kelley 
> Cc: Mihai Carabas 
> Cc: Nicholas Piggin 
> Cc: Paul Mackerras 
> Cc: Pavel Machek 
> Cc: Scott Branden 
> Cc: Sebastian Reichel 
> Cc: Shile Zhang 
> Cc: Stephen Hemminger 
> Cc: Sven Schnelle 
> Cc: Thomas Bogendoerfer 
> Cc: Tianyu Lan 
> Cc: Vasily Gorbik 
> Cc: Wang ShaoBo 
> Cc: Wei Liu 
> Cc: zhenwei pi 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/mips/sgi-ip22/ip22-reset.c  | 2 +-
>  arch/mips/sgi-ip32/ip32-reset.c  | 3 +--
>  arch/powerpc/kernel/setup-common.c   | 2 +-
>  arch/sparc/kernel/sstate.c   | 3 +--
>  drivers/firmware/google/gsmi.c   | 4 ++--
>  drivers/hv/vmbus_drv.c   | 4 ++--
>  drivers/leds/trigger/ledtrig-activity.c  | 4 ++--
>  drivers/leds/trigger/ledtrig-heartbeat.c | 4 ++--
>  drivers/misc/bcm-vk/bcm_vk_dev.c | 6 +++---
>  drivers/misc/pvpanic/pvpanic.c   | 4 ++--
>  drivers/power/reset/ltc2952-poweroff.c   | 4 ++--
>  drivers/s390/char/zcore.c| 5 +++--
>  drivers/soc/bcm/brcmstb/pm/pm-arm.c  | 2 +-
>  include/linux/panic_notifier.h   | 1 +
>  kernel/panic.c   | 4 
>  15 files changed, 28 insertions(+), 24 deletions(-)

[ snip]

> 
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index f37f12d48001..901b97034308 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -1614,7 +1614,7 @@ static int vmbus_bus_init(void)
>   hv_kmsg_dump_register();
> 
>   register_die_notifier(&hyperv_die_report_block);
> - atomic_notifier_chain_register(&panic_notifier_list,
> + atomic_notifier_chain_register(&panic_hypervisor_list,
>   &hyperv_panic_report_block);
>   }
> 
> @@ -2843,7 +2843,7 @@ static void __exit vmbus_exit(void)
>   if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
>   kmsg_dump_unregister(&hv_kmsg_dumper);
>   unregister_die_notifier(&hyperv_die_report_block);
> - atomic_notifier_chain_unregister(&panic_notifier_list,
> + atomic_notifier_chain_unregister(&panic_hypervisor_list,
>   &hyperv_panic_report_block);
>   }
> 

Using the hypervisor_list here produces a bit of a mismatch.  In many cases
this notifier will do nothing, and will defer to the kmsg_dump() mechanism
to notify the hypervisor about the panic.   Running the kmsg_dump()
mechanism is linked to the info_list, so I'm thinking the Hyper-V panic report
notifier should be on the info_list as well.  That way the reporting behavior
is triggered at the same point in the panic path regardless of which
reporting mechanism is used.





Re: [PATCH 0/2] Fix session topology test for powerpc and add utility function to get cpuinfo entries

2022-04-29 Thread Disha Goel


-Original Message-
From: Athira Rajeev 
To: a...@kernel.org, jo...@kernel.org, disg...@linux.vnet.ibm.com
Cc: m...@ellerman.id.au, linux-perf-us...@vger.kernel.org, 
linuxppc-dev@lists.ozlabs.org, ma...@linux.vnet.ibm.com, 
rnsas...@linux.ibm.com, kj...@linux.ibm.com, irog...@google.com
Subject: [PATCH 0/2] Fix session topology test for powerpc and add
utility function to get cpuinfo entries
Date: Thu, 28 Apr 2022 20:38:27 +0530

The session topology test fails in powerpc pSeries platform.Test
logs:<<>>Session topology : FAILED!<<>>
This test uses cpu topology information and in powerpc,some of the
topology info is restricted in environmentlike virtualized platform.
Hence this test needs to beskipped in pSeries platform for powerpc. The
informationabout platform is available in /proc/cpuinfo.
Patch 1 adds generic utility function in "util/header.c"to read
/proc/cpuinfo for any entry. Though the testcasefix needs value from
"platform" entry, making this as ageneric function to return value for
any entry from the/proc/cpuinfo file which can be used commonly in
futureusecases.
Patch 2 uses the newly added utility function to look forplatform and
skip the test in pSeries platform for powerpc.
Athira Rajeev (2):  tools/perf: Add utility function to read
/proc/cpuinfo for any field  tools/perf/tests: Fix session topology
test to skip the test in guestenvironment
Tested the patches on powerpc and powernv, verified perf test session
topology test with the patch set.Tested-by: Disha Goel <
disg...@linux.vnet.ibm.com>
 tools/perf/tests/topology.c | 17 
tools/perf/util/header.c| 54 +
tools/perf/util/header.h|  1 + 3 files changed, 72 insertions(+)



RE: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Michael Kelley (LINUX)
From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
2022 3:49 PM
> 
> The panic() function is somewhat convoluted - a lot of changes were
> made over the years, adding comments that might be misleading/outdated
> now, it has a code structure that is a bit complex to follow, with
> lots of conditionals, for example. The panic notifier list is something
> else - a single list, with multiple callbacks of different purposes,
> that run in a non-deterministic order and may affect hardly kdump
> reliability - see the "crash_kexec_post_notifiers" workaround-ish flag.
> 
> This patch proposes a major refactor on the panic path based on Petr's
> idea [0] - basically we split the notifiers list in three, having a set
> of different call points in the panic path. Below a list of changes
> proposed in this patch, culminating in the panic notifiers level
> concept:
> 
> (a) First of all, we improved comments all over the function
> and removed useless variables / includes. Also, as part of this
> clean-up we concentrate the console flushing functions in a helper.
> 
> (b) As mentioned before, there is a split of the panic notifier list
> in three, based on the purpose of the callback. The code contains
> good documentation in form of comments, but a summary of the three
> lists follows:
> 
> - the hypervisor list aims low-risk procedures to inform hypervisors
> or firmware about the panic event, also includes LED-related functions;
> 
> - the informational list contains callbacks that provide more details,
> like kernel offset or trace dump (if enabled) and also includes the
> callbacks aimed at reducing log pollution or warns, like the RCU and
> hung task disable callbacks;
> 
> - finally, the pre_reboot list is the old notifier list renamed,
> containing the more risky callbacks that didn't fit the previous
> lists. There is also a 4th list (the post_reboot one), but it's not
> related with the original list - it contains late time architecture
> callbacks aimed at stopping the machine, for example.
> 
> The 3 notifiers lists execute in different moments, hypervisor being
> the first, followed by informational and finally the pre_reboot list.
> 
> (c) But then, there is the ordering problem of the notifiers against
> the crash_kernel() call - kdump must be as reliable as possible.
> For that, a simple binary "switch" as "crash_kexec_post_notifiers"
> is not enough, hence we introduce here concept of panic notifier
> levels: there are 5 levels, from 0 (no notifier executes before
> kdump) until 4 (all notifiers run before kdump); the default level
> is 2, in which the hypervisor and (iff we have any kmsg dumper)
> the informational notifiers execute before kdump.
> 
> The detailed documentation of the levels is present in code comments
> and in the kernel-parameters.txt file; as an analogy with the previous
> panic() implementation, the level 0 is exactly the same as the old
> behavior of notifiers, running all after kdump, and the level 4 is
> the same as "crash_kexec_post_notifiers=Y" (we kept this parameter as
> a deprecated one).
> 
> (d) Finally, an important change made here: we now use only the
> function "crash_smp_send_stop()" to shut all the secondary CPUs
> in the panic path. Before, there was a case of using the regular
> "smp_send_stop()", but the better approach is to simplify the
> code and try to use the function which was created exclusively
> for the panic path. Experiments showed that it works fine, and
> code was very simplified with that.
> 
> Functional change is expected from this refactor, since now we
> call some notifiers by default before kdump, but the goal here
> besides code clean-up is to have a better panic path, more
> reliable and deterministic, but also very customizable.
> 
> [0] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/
> 
> Suggested-by: Petr Mladek 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> Special thanks to Petr and Baoquan for the suggestion and feedback in a 
> previous
> email thread. There's some important design decisions that worth mentioning 
> and
> discussing:
> 
> * The default panic notifiers level is 2, based on Petr Mladek's suggestion,
> which makes a lot of sense. Of course, this is customizable through the
> parameter, but would be something worthwhile to have a KConfig option to set
> the default level? It would help distros that want the old behavior
> (no notifiers before kdump) as default.
> 
> * The implementation choice was to _avoid_ intricate if conditionals in the
> panic path, which would _definitely_ be present with the panic notifiers 
> levels
> idea; so, instead of lots of if conditionals, the set/clear bits approach with
> functions called in 2 points (but executing only in one of them) is much 
> easier
> to follow an was used here; the ordering helper function and the comments also
> help a lot to avoid confusion (hopefully).
> 
> * Choice was to *always* use crash_smp_send_stop() instead of sometimes making
> use of the re

Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Steven Rostedt
On Fri, 29 Apr 2022 23:09:19 +0530
"Naveen N. Rao"  wrote:

> If I'm understanding your suggestion right:
> - we now create a new section in each object file: __mcount_loc_weak, 
>   and capture such relocations using weak symbols there.

Yes, but it would be putting the same information it puts into __mcount_loc
but also add it here too. That is, it will duplicate the data.

> - we then ask the linker to put these separately between, say, 
>   __start_mcount_loc_weak and __stop_mcount_loc_weak

Yes, but it will also go in the location between __start_mcount_loc and
__stop_mcount_loc.

> - on ftrace init, we go through entries in this range, but discard those 
>   that belong to functions that also have an entry between 
>   __start_mcount_loc and __stop_mcount loc.

But we should be able to know if it was overridden or not, by seeing if
there's another function that was called. Or at least, we can validate them
to make sure that they are correct.

> 
> The primary issue I see here is that the mcount locations within the new 
> weak section will end up being offsets from a different function in 
> vmlinux, since the linker does not create a symbol for the weak 
> functions that were over-ridden.

The point of this section is to know which functions in __mcount_loc may
have been overridden, as they would be found in the __mcount_loc_weak
section. And then we can do something "special" to them.

> 
> As an example, in the issue described in this patch set, if powerpc 
> starts over-riding kexec_arch_apply_relocations(), then the current weak 
> implementation in kexec_file.o gets carried over to the final vmlinux, 
> but the instructions will instead appear under the previous function in 
> kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
> not be traced to begin with, so we won't be able to figure out if this 
> is valid or not.

If it was overridden, then there would be two entries for function that
overrides the weak function in the __mcount_loc section, right? One for the
new function, and one that was overridden. Of course this could be more
complex if the new function had been marked notrace.

I was thinking of doing this just so that we know what functions are weak
and perhaps need extra processing.

Another issue with weak functions and ftrace just came up here:

  https://lore.kernel.org/all/20220428095803.66c17...@gandalf.local.home/


-- Steve


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Naveen N. Rao

Steven Rostedt wrote:

On Fri, 29 Apr 2022 23:09:19 +0530
"Naveen N. Rao"  wrote:


If I'm understanding your suggestion right:
- we now create a new section in each object file: __mcount_loc_weak, 
  and capture such relocations using weak symbols there.


Yes, but it would be putting the same information it puts into __mcount_loc
but also add it here too. That is, it will duplicate the data.

- we then ask the linker to put these separately between, say, 
  __start_mcount_loc_weak and __stop_mcount_loc_weak


Yes, but it will also go in the location between __start_mcount_loc and
__stop_mcount_loc.

- on ftrace init, we go through entries in this range, but discard those 
  that belong to functions that also have an entry between 
  __start_mcount_loc and __stop_mcount loc.


But we should be able to know if it was overridden or not, by seeing if
there's another function that was called. Or at least, we can validate them
to make sure that they are correct.



The primary issue I see here is that the mcount locations within the new 
weak section will end up being offsets from a different function in 
vmlinux, since the linker does not create a symbol for the weak 
functions that were over-ridden.


The point of this section is to know which functions in __mcount_loc may
have been overridden, as they would be found in the __mcount_loc_weak
section. And then we can do something "special" to them.


I'm not sure I follow that. How are you intending to figure out which 
functions were overridden by looking at entries in the __mcount_loc_weak 
section?


In the final vmlinux image, we only get offsets into .text for all 
mcount locations, but no symbol information. The only hint is the fact 
that a single kallsym symbol has multiple mcount locations within it. 
Even then, the symbol with duplicate mcount entries won't be the 
function that was overridden.


We could do a kallsyms_lookup() on each entry and consult the 
__mcount_loc_weak section to identify duplicates, but that looks to be 
very expensive.


Did you have a different approach in mind?





As an example, in the issue described in this patch set, if powerpc 
starts over-riding kexec_arch_apply_relocations(), then the current weak 
implementation in kexec_file.o gets carried over to the final vmlinux, 
but the instructions will instead appear under the previous function in 
kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
not be traced to begin with, so we won't be able to figure out if this 
is valid or not.


If it was overridden, then there would be two entries for function that
overrides the weak function in the __mcount_loc section, right? One for the
new function, and one that was overridden.


In the final vmlinux, we will have two entries: one pointing at the 
correct function, while the other will point to some other function 
name. So, at least from kallsym perspective, duplicate mcount entries 
won't be for the function that was overridden, but some arbitrary 
function that came before the weak function in the object file.



Of course this could be more
complex if the new function had been marked notrace.

I was thinking of doing this just so that we know what functions are weak
and perhaps need extra processing.

Another issue with weak functions and ftrace just came up here:

  https://lore.kernel.org/all/20220428095803.66c17...@gandalf.local.home/


I noticed this just yesterday:

 # cat available_filter_functions | sort | uniq -d | wc -l
 430

I'm fairly certain that some of those are due to weak functions -- I 
just wasn't sure if all of those were.


I suppose this will now also be a problem with ftrace_location(), given 
that it was recently changed to look at an entire function for mcount 
locations?



Thanks,
Naveen


Re: [PATCH v2 2/2] ftrace: recordmcount: Handle sections with no non-weak symbols

2022-04-29 Thread Steven Rostedt
On Sat, 30 Apr 2022 01:03:01 +0530
"Naveen N. Rao"  wrote:

> > The point of this section is to know which functions in __mcount_loc may
> > have been overridden, as they would be found in the __mcount_loc_weak
> > section. And then we can do something "special" to them.  
> 
> I'm not sure I follow that. How are you intending to figure out which 
> functions were overridden by looking at entries in the __mcount_loc_weak 
> section?

If there's duplicates (or something strange with the offset) then we could
delete the one that has the match in the weak section.

> 
> In the final vmlinux image, we only get offsets into .text for all 
> mcount locations, but no symbol information. The only hint is the fact 
> that a single kallsym symbol has multiple mcount locations within it. 
> Even then, the symbol with duplicate mcount entries won't be the 
> function that was overridden.
> 
> We could do a kallsyms_lookup() on each entry and consult the 
> __mcount_loc_weak section to identify duplicates, but that looks to be 
> very expensive.
> 
> Did you have a different approach in mind?

We only need to look at the ones in the weak section. How many that is may
determine if that's feasible or not.

> 
> >   
> >> 
> >> As an example, in the issue described in this patch set, if powerpc 
> >> starts over-riding kexec_arch_apply_relocations(), then the current weak 
> >> implementation in kexec_file.o gets carried over to the final vmlinux, 
> >> but the instructions will instead appear under the previous function in 
> >> kexec_file.o: crash_prepare_elf64_headers(). This function may or may 
> >> not be traced to begin with, so we won't be able to figure out if this 
> >> is valid or not.  
> > 
> > If it was overridden, then there would be two entries for function that
> > overrides the weak function in the __mcount_loc section, right? One for the
> > new function, and one that was overridden.  
> 
> In the final vmlinux, we will have two entries: one pointing at the 
> correct function, while the other will point to some other function 
> name. So, at least from kallsym perspective, duplicate mcount entries 
> won't be for the function that was overridden, but some arbitrary 
> function that came before the weak function in the object file.

Right, and we should be able to find them.

> 
> > Of course this could be more
> > complex if the new function had been marked notrace.
> > 
> > I was thinking of doing this just so that we know what functions are weak
> > and perhaps need extra processing.
> > 
> > Another issue with weak functions and ftrace just came up here:
> > 
> >   https://lore.kernel.org/all/20220428095803.66c17...@gandalf.local.home/  
> 
> I noticed this just yesterday:
> 
>   # cat available_filter_functions | sort | uniq -d | wc -l
>   430
> 
> I'm fairly certain that some of those are due to weak functions -- I 
> just wasn't sure if all of those were.

Probably :-/

> 
> I suppose this will now also be a problem with ftrace_location(), given 
> that it was recently changed to look at an entire function for mcount 
> locations?
> 

Maybe. I have to focus on other things at the moment, but I'll try to find
some time to look at this deeper.

-- Steve


Re: [PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-04-29 Thread Gerald Schaefer
On Fri, 29 Apr 2022 16:14:43 +0800
Baolin Wang  wrote:

> On some architectures (like ARM64), it can support CONT-PTE/PMD size
> hugetlb, which means it can support not only PMD/PUD size hugetlb:
> 2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
> size specified.
> 
> When unmapping a hugetlb page, we will get the relevant page table
> entry by huge_pte_offset() only once to nuke it. This is correct
> for PMD or PUD size hugetlb, since they always contain only one
> pmd entry or pud entry in the page table.
> 
> However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
> since they can contain several continuous pte or pmd entry with
> same page table attributes, so we will nuke only one pte or pmd
> entry for this CONT-PTE/PMD size hugetlb page.
> 
> And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
> which means now we will unmap only one pte entry for a CONT-PTE or
> CONT-PMD size poisoned hugetlb page, and we can still access other
> subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
> which will cause serious issues possibly.
> 
> So we should change to use huge_ptep_clear_flush() to nuke the
> hugetlb page table to fix this issue, which already considered
> CONT-PTE and CONT-PMD size hugetlb.
> 
> Note we've already used set_huge_swap_pte_at() to set a poisoned
> swap entry for a poisoned hugetlb page.
> 
> Signed-off-by: Baolin Wang 
> ---
>  mm/rmap.c | 34 +-
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7cf2408..1e168d7 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio *folio, 
> struct vm_area_struct *vma,
>   break;
>   }
>   }
> + pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);

Unlike in your patch 2/3, I do not see that this (huge) pteval would later
be used again with set_huge_pte_at() instead of set_pte_at(). Not sure if
this (huge) pteval could end up at a set_pte_at() later, but if yes, then
this would be broken on s390, and you'd need to use set_huge_pte_at()
instead of set_pte_at() like in your patch 2/3.

Please note that huge_ptep_get functions do not return valid PTEs on s390,
and such PTEs must never be set directly with set_pte_at(), but only with
set_huge_pte_at().

Background is that, for hugetlb pages, we are of course not really dealing
with PTEs at this level, but rather PMDs or PUDs, depending on hugetlb size.
On s390, the layout is quite different for PTEs and PMDs / PUDs, and
unfortunately the hugetlb code is not properly reflecting this by using
PMD or PUD types, like the THP code does.

So, as work-around, on s390, the huge_ptep_xxx functions will return
only fake PTEs, which must be converted again to a proper PMD or PUD,
before writing them to the page table, which is what happens in
set_huge_pte_at(), but not in set_pte_at().


Re: [PATCH 24/30] panic: Refactor the panic path

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 14:53, Michael Kelley (LINUX) wrote:
> From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> 2022 3:49 PM
>> [...]
>> +panic_notifiers_level=
>> +[KNL] Set the panic notifiers execution order.
>> +Format: 
>> +We currently have 4 lists of panic notifiers; based
>> +on the functionality and risk (for panic success) the
>> +callbacks are added in a given list. The lists are:
>> +- hypervisor/FW notification list (low risk);
>> +- informational list (low/medium risk);
>> +- pre_reboot list (higher risk);
>> +- post_reboot list (only run late in panic and after
>> +kdump, not configurable for now).
>> +This parameter defines the ordering of the first 3
>> +lists with regards to kdump; the levels determine
>> +which set of notifiers execute before kdump. The
>> +accepted levels are:
>> +0: kdump is the first thing to run, NO list is
>> +executed before kdump.
>> +1: only the hypervisor list is executed before kdump.
>> +2 (default level): the hypervisor list and (*if*
>> +there's any kmsg_dumper defined) the informational
>> +list are executed before kdump.
>> +3: both the hypervisor and the informational lists
>> +(always) execute before kdump.
> 
> I'm not clear on why level 2 exists.  What is the scenario where
> execution of the info list before kdump should be conditional on the
> existence of a kmsg_dumper?   Maybe the scenario is described
> somewhere in the patch set and I just missed it.
> 

Hi Michael, thanks for your review/consideration. So, this idea started
kind of some time ago. It all started with a need of exposing more
information on kernel log *before* kdump and *before* pstore -
specifically, we're talking about panic_print. But this cause some
reactions, Baoquan was very concerned with that [0]. Soon after, I've
proposed a panic notifiers filter (orthogonal) approach, to which Petr
suggested instead doing a major refactor [1] - it finally is alive in
the form of this series.

The theory behind the level 2 is to allow a scenario of kdump with the
minimum amount of notifiers - what is the point in printing more
information if the user doesn't care, since it's going to kdump? Now, if
there is a kmsg dumper, it means that there is likely some interest in
collecting information, and that might as well be required before the
potential kdump (which is my case, hence the proposal on [0]).

Instead of forcing one of the two behaviors (level 1 or level 3), we
have a middle-term/compromise: if there's interest in collecting such
data (in the form of a kmsg dumper), we then execute the informational
notifiers before kdump. If not, why to increase (even slightly) the risk
for kdump?

I'm OK in removing the level 2 if people prefer, but I don't feel it's a
burden, quite opposite - seems a good way to accommodate the somewhat
antagonistic ideas (jump to kdump ASAP vs collecting more info in the
panicked kernel log).

[0] https://lore.kernel.org/lkml/20220126052246.GC2086@MiWiFi-R3L-srv/

[1] https://lore.kernel.org/lkml/YfPxvzSzDLjO5ldp@alley/


>[...]
>> + * Based on the level configured (smaller than 4), we clear the
>> + * proper bits in "panic_notifiers_bits". Notice that this bitfield
>> + * is initialized with all notifiers set.
>> + */
>> +switch (panic_notifiers_level) {
>> +case 3:
>> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +break;
>> +case 2:
>> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +
>> +if (!kmsg_has_dumpers())
>> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +break;
>> +case 1:
>> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +break;
>> +case 0:
>> +clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
>> +clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
>> +clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
>> +break;
>> +}
> 
> I think the above switch statement could be done as follows:
> 
> if (panic_notifiers_level <= 3)
>   clear_bit(PN_PRE_REBOOT_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <= 2)
>   if (!kmsg_has_dumpers())
>   clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level <=1)
>   clear_bit(PN_INFO_BIT, &panic_notifiers_bits);
> if (panic_notifiers_level == 0)
>   clear_bit(PN_HYPERVISOR_BIT, &panic_notifiers_bits);
> 
> That's about half the lines of code.  It's

Re: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Guilherme G. Piccoli
Thanks Marc and Michael for the review/discussion.

On 29/04/2022 15:20, Marc Zyngier wrote:
> [...]

> My expectations would be that, since we're getting here using an IPI,
> interrupts are already masked. So what reenabled them the first place?
> 
> Thanks,
> 
>   M.
> 

Marc, I did some investigation in the code (and tried/failed in the ARM
documentation as well heh), but this is still not 100% clear for me.

You're saying IPI calls disable IRQs/FIQs by default in the the target
CPUs? Where does it happen? I'm a bit confused if this a processor
mechanism, or it's in code.

Looking the smp_send_stop() in arch/arm/, it does IPI the CPUs, with the
flag IPI_CPU_STOP, eventually calling ipi_cpu_stop(), and the latter
does disable IRQ/FIQ in code - that's where I stole my code from.

But crash_smp_send_stop() is different, it seems to IPI the other CPUs
with the flag IPI_CALL_FUNC, which leads to calling
generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
as well? I couldn't find it.

Appreciate your clarifications about that, thanks again.
Cheers,


Guilherme


Re: serial hang in qemu-system-ppc64 -M pseries

2022-04-29 Thread Fabiano Rosas
Rob Landley  writes:

> On 4/27/22 10:27, Thomas Huth wrote:
>> On 26/04/2022 12.26, Rob Landley wrote:
>>> When I cut and paste 80-ish characters of text into the Linux serial 
>>> console, it
>>> reads 16 characters and stops. When I hit space, it reads another 16 
>>> characters,
>>> and if I keep at it will eventually catch up without losing data. If I type,
>>> every character shows up immediately.
>> 
>> That "16" certainly comes from VTERM_BUFSIZE in hw/char/spapr_vty.c in the 
>> QEMU sources, I think.
>> 
>>> (On other qemu targets and kernels I can cut and paste an entire uuencoded
>>> binary and it goes through just fine in one go, but this target hangs with 
>>> big
>>> pastes until I hit keys.)
>>> 
>>> Is this a qemu-side bug, or a kernel-side bug?
>>> 
>>> Kernel config attached (linux 5.18-rc3 or thereabouts), qemu invocation is:
>>> 
>>> qemu-system-ppc64 -M pseries -vga none -nographic -no-reboot -m 256 -kernel
>>> vmlinux -initrd powerpc64leroot.cpio.gz -append "panic=1 HOST=powerpc64le
>>> console=hvc0"
>> 
>> Which version of QEMU are you using?
>
> $ qemu-system-ppc64 --version
> QEMU emulator version 6.2.92 (v6.2.0-rc2)
> Copyright (c) 2003-2021 Fabrice Bellard and the QEMU Project developers
>
> From november. I can pull and rebuild but it'll take a bit. (Hopefully
> rebuilding would fix the need to echo -e '\e[?7h' afterwards to undo the "bash
> command line history marches up the screen because qemu's x86 bios disabled 
> line
> wrap and then left it that way" issue...)
>
>> Which frontend (GTK or terminal?) ... 
>
> The above command line has -nographic, forcing terminal. Running ldd on the
> binary doesn't pull up anything gtk. (It pulls up libncursesw though.)
>
> If you want to reproduce my test locally:
>
> wget 
> https://landley.net/toybox/downloads/binaries/mkroot/0.8.5/powerpc64le.tgz
> tar xvzf powerpc64le.tgz
> cd powerpc64le
> ./qemu-*.sh
>
> Then paste something longer than 16 characters at the eventual command prompt
> once the kernel finishes booting.

I suspect this is due to how the tty driver (n_tty.c) interacts with
the console (hvc_console.c) driver's buffer size.

This is the stack:

#0 hvc_push  <-- calls into KVM/QEMU to write up to 16 bytes
#1 hvc_write
#2 tty_put_char
#3 do_output_char
#4 __process_echoes  <-- writes up to tty_write_room() chars
#5 flush_echoes  <-- returns early if ~ECHO && ~ECHONL
#6 n_tty_receive_buf_common  <-- buffers more than 16 bytes
#7 tty_ldisc_receive_buf
#8 tty_port_default_receive_buf
#9 receive_buf
#10 process_one_work

In my busybox instance which does not have this issue I can see that
termios.c_lflag = 0x447, so in the stack above #4 is not called (ECHO
is 0x8, ECHONL is 0x10).

In the bug scenario: termios.c_lflag = 0x5cf, so we go into #4 which
is supposed to write (say) 17 bytes, but ends up writing only 16
because that is what tty_write_room() returns.

What I think is causing this issue is that the hvc_vio.c driver is
configured with hp->outbuf_size = 16. That number comes from the
H_PUT_TERM_CHAR hypercall spec which reads two registers at a
time. However, the hvc_write function supports writes larger than 16
bytes so it seems we're needlessly propagating the 16 byte limitation
to the upper layer.

The driver is also not buffering the write, so if it gets called to
write one char at a time (like __process_echoes does) there should be no
limit to how much it can write.

I think if we increase hp->outbuf_size to a value that is larger than
N_TTY_BUF_SIZE=4096 the echo buffer would drain before reaching this new
hvc driver limit.

I tested that and it seems to work, but I'm not sure if it's the right
fix, there are some things I couldn't figure out:

   i) if a driver actually runs out of buffer space, what
  __process_echoes should do about it;

  ii) why the bug sometimes happens only at the 32 characters boundary
  (or other multiple of 16);

 iii) why the ECHO flag differs from the working case.

> If you want to reproduce it all from source:
>
> git clone https://github.com/landley/toybox
> cd toybox && mkdir ccc && cd ccc
> wget
> https://landley.net/toybox/downloads/binaries/toolchains/latest/powerpc64le-linux-musl-cross.tar.xz
> -O - | tar xv
> cd ..
> CROSS=powerpc64le LINUX=~/linux scripts/mkroot.sh
> cd root/powerpc64le
> ./qemu-*.sh
>
> This assumes your linux kernel source directory is in ~/linux of course, and
> that qemu-system-ppc64 is in the $PATH...
>
>> this rings a distant bell, but I thought we had fixed these issues long ago 
>> in the past... e.g.:
>>
>> https://yhbt.net/lore/all/1380113886-16845-10-git-send-email-mdr...@linux.vnet.ibm.com/
>> 
>> https://git.qemu.org/?p=qemu.git;a=commitdiff;h=8a273cbe53221d28
>
> The qemu I'm running is newer than 2016. :)
>
> Most targets are fine with this: I cut and paste entire uuencoded binaries 
> into
> the serial console as an easy way to insert a file into an initramfs. It can
> usually take multiple megabytes without

Re: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Russell King (Oracle)
On Fri, Apr 29, 2022 at 06:38:19PM -0300, Guilherme G. Piccoli wrote:
> Thanks Marc and Michael for the review/discussion.
> 
> On 29/04/2022 15:20, Marc Zyngier wrote:
> > [...]
> 
> > My expectations would be that, since we're getting here using an IPI,
> > interrupts are already masked. So what reenabled them the first place?
> > 
> > Thanks,
> > 
> > M.
> > 
> 
> Marc, I did some investigation in the code (and tried/failed in the ARM
> documentation as well heh), but this is still not 100% clear for me.
> 
> You're saying IPI calls disable IRQs/FIQs by default in the the target
> CPUs? Where does it happen? I'm a bit confused if this a processor
> mechanism, or it's in code.

When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
themselves interrupts, so IRQs will be masked while the IPI is being
processed. Therefore, there should be no need to re-disable the
already disabled interrupts.

> But crash_smp_send_stop() is different, it seems to IPI the other CPUs
> with the flag IPI_CALL_FUNC, which leads to calling
> generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
> as well? I couldn't find it.

It's buried in the architecture behaviour. When the CPU takes an
interrupt and jumps to the interrupt vector in the vectors page, it is
architecturally defined that interrupts will be disabled. If they
weren't architecturally disabled at this point, then as soon as the
first instruction is processed (at the interrupt vector, likely a
branch) the CPU would immediately take another jump to the interrupt
vector, and this process would continue indefinitely, making interrupt
handling utterly useless.

So, you won't find an explicit instruction in the code path from the
vectors to the IPI handler that disables interrupts - because it's
written into the architecture that this is what must happen.

IRQs are a lower priority than FIQs, so FIQs remain unmasked.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 18:45, Russell King (Oracle) wrote:
> [...]
>> Marc, I did some investigation in the code (and tried/failed in the ARM
>> documentation as well heh), but this is still not 100% clear for me.
>>
>> You're saying IPI calls disable IRQs/FIQs by default in the the target
>> CPUs? Where does it happen? I'm a bit confused if this a processor
>> mechanism, or it's in code.
> 
> When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
> themselves interrupts, so IRQs will be masked while the IPI is being
> processed. Therefore, there should be no need to re-disable the
> already disabled interrupts.
> 
>> But crash_smp_send_stop() is different, it seems to IPI the other CPUs
>> with the flag IPI_CALL_FUNC, which leads to calling
>> generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
>> as well? I couldn't find it.
> 
> It's buried in the architecture behaviour. When the CPU takes an
> interrupt and jumps to the interrupt vector in the vectors page, it is
> architecturally defined that interrupts will be disabled. If they
> weren't architecturally disabled at this point, then as soon as the
> first instruction is processed (at the interrupt vector, likely a
> branch) the CPU would immediately take another jump to the interrupt
> vector, and this process would continue indefinitely, making interrupt
> handling utterly useless.
> 
> So, you won't find an explicit instruction in the code path from the
> vectors to the IPI handler that disables interrupts - because it's
> written into the architecture that this is what must happen.
> 
> IRQs are a lower priority than FIQs, so FIQs remain unmasked.
> 

Thanks a lot for the *great* explanation Russell, much appreciated.
So, this leads to the both following questions:

a) Shall we then change the patch to only disable FIQs, since it's panic
path and we don't want secondary CPUs getting interrupted, but only
spinning quietly "forever"?

b) How about cleaning ipi_cpu_stop() then, by dropping the call to
local_irq_disable() there, to avoid the double IRQ disabling?

Thanks,


Guilherme


Re: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Marc Zyngier
On Fri, 29 Apr 2022 22:45:14 +0100,
"Russell King (Oracle)"  wrote:
> 
> On Fri, Apr 29, 2022 at 06:38:19PM -0300, Guilherme G. Piccoli wrote:
> > Thanks Marc and Michael for the review/discussion.
> > 
> > On 29/04/2022 15:20, Marc Zyngier wrote:
> > > [...]
> > 
> > > My expectations would be that, since we're getting here using an IPI,
> > > interrupts are already masked. So what reenabled them the first place?
> > > 
> > > Thanks,
> > > 
> > >   M.
> > > 
> > 
> > Marc, I did some investigation in the code (and tried/failed in the ARM
> > documentation as well heh), but this is still not 100% clear for me.
> > 
> > You're saying IPI calls disable IRQs/FIQs by default in the the target
> > CPUs? Where does it happen? I'm a bit confused if this a processor
> > mechanism, or it's in code.
> 
> When we taken an IRQ, IRQs will be masked, FIQs will not. IPIs are
> themselves interrupts, so IRQs will be masked while the IPI is being
> processed. Therefore, there should be no need to re-disable the
> already disabled interrupts.
> 
> > But crash_smp_send_stop() is different, it seems to IPI the other CPUs
> > with the flag IPI_CALL_FUNC, which leads to calling
> > generic_smp_call_function_interrupt() - does it disable interrupts/FIQs
> > as well? I couldn't find it.
> 
> It's buried in the architecture behaviour. When the CPU takes an
> interrupt and jumps to the interrupt vector in the vectors page, it is
> architecturally defined that interrupts will be disabled. If they
> weren't architecturally disabled at this point, then as soon as the
> first instruction is processed (at the interrupt vector, likely a
> branch) the CPU would immediately take another jump to the interrupt
> vector, and this process would continue indefinitely, making interrupt
> handling utterly useless.
> 
> So, you won't find an explicit instruction in the code path from the
> vectors to the IPI handler that disables interrupts - because it's
> written into the architecture that this is what must happen.
> 
> IRQs are a lower priority than FIQs, so FIQs remain unmasked.

Ah, you're of course right. That's one of the huge differences between
AArch32 and AArch64, where the former has per target mode masking
rules, and the later masks everything on entry...

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH 19/30] panic: Add the panic hypervisor notifier list

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 14:30, Michael Kelley (LINUX) wrote:
> From: Guilherme G. Piccoli  Sent: Wednesday, April 27, 
> 2022 3:49 PM
>> [...]
>>
>> @@ -2843,7 +2843,7 @@ static void __exit vmbus_exit(void)
>>  if (ms_hyperv.misc_features & HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE) {
>>  kmsg_dump_unregister(&hv_kmsg_dumper);
>>  unregister_die_notifier(&hyperv_die_report_block);
>> -atomic_notifier_chain_unregister(&panic_notifier_list,
>> +atomic_notifier_chain_unregister(&panic_hypervisor_list,
>>  &hyperv_panic_report_block);
>>  }
>>
> 
> Using the hypervisor_list here produces a bit of a mismatch.  In many cases
> this notifier will do nothing, and will defer to the kmsg_dump() mechanism
> to notify the hypervisor about the panic.   Running the kmsg_dump()
> mechanism is linked to the info_list, so I'm thinking the Hyper-V panic report
> notifier should be on the info_list as well.  That way the reporting behavior
> is triggered at the same point in the panic path regardless of which
> reporting mechanism is used.
> 

Hi Michael, thanks for your feedback! I agree that your idea could work,
but...there is one downside: imagine the kmsg_dump() approach is not set
in some Hyper-V guest, then we would rely in the regular notification
mechanism [hv_die_panic_notify_crash()], right?
But...you want then to run this notifier in the informational list,
which...won't execute *by default* before kdump if no kmsg_dump() is
set. So, this logic is convoluted when you mix it with the default level
concept + kdump.

May I suggest something? If possible, take a run with this patch set +
DEBUG_NOTIFIER=y, in *both* cases (with and without the kmsg_dump()
set). I did that and they run almost at the same time...I've checked the
notifiers called, it's like almost nothing runs in-between.

I feel the panic notification mechanism does really fit with a
hypervisor list, it's a good match with the nature of the list, which
aims at informing the panic notification to the hypervisor/FW.
Of course we can modify it if you prefer...but please take into account
the kdump case and how it complicates the logic.

Let me know your considerations, in case you can experiment with the
patch set as-is.
Cheers,


Guilherme


Re: [PATCH 02/30] ARM: kexec: Disable IRQs/FIQs also on crash CPUs shutdown path

2022-04-29 Thread Marc Zyngier
On Wed, 27 Apr 2022 23:48:56 +0100,
"Guilherme G. Piccoli"  wrote:
> 
> Currently the regular CPU shutdown path for ARM disables IRQs/FIQs
> in the secondary CPUs - smp_send_stop() calls ipi_cpu_stop(), which
> is responsible for that. This makes sense, since we're turning off
> such CPUs, putting them in an endless busy-wait loop.
> 
> Problem is that there is an alternative path for disabling CPUs,
> in the form of function crash_smp_send_stop(), used for kexec/panic
> paths. This functions relies in a SMP call that also triggers a
> busy-wait loop [at machine_crash_nonpanic_core()], but *without*
> disabling interrupts. This might lead to odd scenarios, like early
> interrupts in the boot of kexec'd kernel or even interrupts in
> other CPUs while the main one still works in the panic path and
> assumes all secondary CPUs are (really!) off.
> 
> This patch mimics the ipi_cpu_stop() interrupt disable mechanism
> in the crash CPU shutdown path, hence disabling IRQs/FIQs in all
> secondary CPUs in the kexec/panic path as well.
> 
> Cc: Marc Zyngier 
> Cc: Russell King 
> Signed-off-by: Guilherme G. Piccoli 
> ---
>  arch/arm/kernel/machine_kexec.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm/kernel/machine_kexec.c b/arch/arm/kernel/machine_kexec.c
> index f567032a09c0..ef788ee00519 100644
> --- a/arch/arm/kernel/machine_kexec.c
> +++ b/arch/arm/kernel/machine_kexec.c
> @@ -86,6 +86,9 @@ void machine_crash_nonpanic_core(void *unused)
>   set_cpu_online(smp_processor_id(), false);
>   atomic_dec(&waiting_for_crash_ipi);
>  
> + local_fiq_disable();
> + local_irq_disable();
> +

My expectations would be that, since we're getting here using an IPI,
interrupts are already masked. So what reenabled them the first place?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.


Re: [PATCH 13/30] s390/consoles: Improve panic notifiers reliability

2022-04-29 Thread Heiko Carstens
On Wed, Apr 27, 2022 at 07:49:07PM -0300, Guilherme G. Piccoli wrote:
> Currently many console drivers for s390 rely on panic/reboot notifiers
> to invoke callbacks on these events. The panic() function disables local
> IRQs, secondary CPUs and preemption, so callbacks invoked on panic are
> effectively running in atomic context.
> 
> Happens that most of these console callbacks from s390 doesn't take the
> proper care with regards to atomic context, like taking spinlocks that
> might be taken in other function/CPU and hence will cause a lockup
> situation.
> 
> The goal for this patch is to improve the notifiers reliability, acting
> on 4 console drivers, as detailed below:
> 
> (1) con3215: changed a regular spinlock to the trylock alternative.
> 
> (2) con3270: also changed a regular spinlock to its trylock counterpart,
> but here we also have another problem: raw3270_activate_view() takes a
> different spinlock. So, we worked a helper to validate if this other lock
> is safe to acquire, and if so, raw3270_activate_view() should be safe.
> 
> Notice though that there is a functional change here: it's now possible
> to continue the notifier code [reaching con3270_wait_write() and
> con3270_rebuild_update()] without executing raw3270_activate_view().
> 
> (3) sclp: a global lock is used heavily in the functions called from
> the notifier, so we added a check here - if the lock is taken already,
> we just bail-out, preventing the lockup.
> 
> (4) sclp_vt220: same as (3), a lock validation was added to prevent the
> potential lockup problem.
> 
> Besides (1)-(4), we also removed useless void functions, adding the
> code called from the notifier inside its own body, and changed the
> priority of such notifiers to execute late, since they are "heavyweight"
> for the panic environment, so we aim to reduce risks here.
> Changed return values to NOTIFY_DONE as well, the standard one.
> 
> Cc: Alexander Gordeev 
> Cc: Christian Borntraeger 
> Cc: Heiko Carstens 
> Cc: Sven Schnelle 
> Cc: Vasily Gorbik 
> Signed-off-by: Guilherme G. Piccoli 
> ---
> 
> As a design choice, the option used here to verify a given spinlock is taken
> was the function "spin_is_locked()" - but we noticed that it is not often 
> used.
> An alternative would to take the lock with a spin_trylock() and if it 
> succeeds,
> just release the spinlock and continue the code. But that seemed weird...
> 
> Also, we'd like to ask a good validation of case (2) potential functionality
> change from the s390 console experts - far from expert here, and in our naive
> code observation, that seems fine, but that analysis might be missing some
> corner case.
> 
> Thanks in advance!
> 
>  drivers/s390/char/con3215.c| 36 +++--
>  drivers/s390/char/con3270.c| 34 +++
>  drivers/s390/char/raw3270.c| 18 +++
>  drivers/s390/char/raw3270.h|  1 +
>  drivers/s390/char/sclp_con.c   | 28 +--
>  drivers/s390/char/sclp_vt220.c | 42 +++---
>  6 files changed, 96 insertions(+), 63 deletions(-)

Code looks good, and everything still seems to work. I applied this
internally for the time being, and if it passes testing, I'll schedule
it for the next merge window.

Thanks!


Re: [PATCH 13/30] s390/consoles: Improve panic notifiers reliability

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 15:46, Heiko Carstens wrote:
> [...]
> 
> Code looks good, and everything still seems to work. I applied this
> internally for the time being, and if it passes testing, I'll schedule
> it for the next merge window.
> 
> Thanks!

Perfect Heiko, thanks a bunch for your review and tests!
Let me know if anything breaks heh
Cheers,


Guilherme


Re: [PATCH 21/30] panic: Introduce the panic pre-reboot notifier list

2022-04-29 Thread Guilherme G. Piccoli
On 29/04/2022 13:04, Max Filippov wrote:
> [...]
>>  arch/xtensa/platforms/iss/setup.c |  4 ++--For xtensa:
> 
> For xtensa:
> Acked-by: Max Filippov 
> 

Perfect, thanks Max!
Cheers,


Guilherme


Re: [PATCH 18/30] notifier: Show function names on notifier routines if DEBUG_NOTIFIERS is set

2022-04-29 Thread Guilherme G. Piccoli
On 27/04/2022 22:01, Xiaoming Ni wrote:
> [...]
> Duplicate Code.
> 
> Is it better to use __func__ and %pS?
> 
> pr_info("%s: %pS\n", __func__, n->notifier_call);
> 
> 

This is a great suggestion Xiaoming, much appreciated!
I feel like reinventing the wheel here - with your idea, code was super
clear and concise, very nice suggestion!!

The only 2 things that diverge from your idea: I'm using '%ps' (not
showing offsets) and also, kept the wording "(un)registered/calling",
not using __func__ - I feel it's a bit odd in the output.
OK for you?

I'm definitely using your idea in V2 heh
Cheers,


Guilherme


Re: [PATCH 16/30] drivers/hv/vmbus, video/hyperv_fb: Untangle and refactor Hyper-V panic notifiers

2022-04-29 Thread Guilherme G. Piccoli
Hi Michael, first of all thanks for the great review, much appreciated.
Some comments inline below:

On 29/04/2022 14:16, Michael Kelley (LINUX) wrote:
> [...]
>> hypervisor I/O completion), so we postpone that to run late. But more
>> relevant: this *same* vmbus unloading happens in the crash_shutdown()
>> handler, so if kdump is set, we can safely skip this panic notifier and
>> defer such clean-up to the kexec crash handler.
> 
> While the last sentence is true for Hyper-V on x86/x64, it's not true for
> Hyper-V on ARM64.  x86/x64 has the 'machine_ops' data structure
> with the ability to provide a custom crash_shutdown() function, which
> Hyper-V does in the form of hv_machine_crash_shutdown().  But ARM64
> has no mechanism to provide such a custom function that will eventually
> do the needed vmbus_initiate_unload() before running kdump.
> 
> I'm not immediately sure what the best solution is for ARM64.  At this
> point, I'm just pointing out the problem and will think about the tradeoffs
> for various possible solutions.  Please do the same yourself. :-)
> 

Oh, you're totally right! I just assumed ARM64 would the the same, my
bad. Just to propose some alternatives, so you/others can also discuss
here and we can reach a consensus about the trade-offs:

(a) We could forget about this change, and always do the clean-up here,
not relying in machine_crash_shutdown().
Pro: really simple, behaves the same as it is doing currently.
Con: less elegant/concise, doesn't allow arm64 customization.

(b) Add a way to allow ARM64 customization of shutdown crash handler.
Pro: matches x86, more customizable, improves arm64 arch code.
Con: A tad more complex.

Also, a question that came-up: if ARM64 has no way of calling special
crash shutdown handler, how can you execute hv_stimer_cleanup() and
hv_synic_disable_regs() there? Or are they not required in ARM64?


>>
>> (c) There is also a Hyper-V framebuffer panic notifier, which relies in
>> doing a vmbus operation that demands a valid connection. So, we must
>> order this notifier with the panic notifier from vmbus_drv.c, in order to
>> guarantee that the framebuffer code executes before the vmbus connection
>> is unloaded.
> 
> Patch 21 of this set puts the Hyper-V FB panic notifier on the pre_reboot
> notifier list, which means it won't execute before the VMbus connection
> unload in the case of kdump.   This notifier is making sure that Hyper-V
> is notified about the last updates made to the frame buffer before the
> panic, so maybe it needs to be put on the hypervisor notifier list.  It
> sends a message to Hyper-V over its existing VMbus channel, but it
> does not wait for a reply.  It does, however, obtain a spin lock on the
> ring buffer used to communicate with Hyper-V.   Unless someone has
> a better suggestion, I'm inclined to take the risk of blocking on that
> spin lock.

The logic behind that was: when kdump is set, we'd skip the vmbus
disconnect on notifiers, deferring that to crash_shutdown(), logic this
one refuted in the above discussion on ARM64 (one more Pro argument to
the idea of refactoring aarch64 code to allow a custom crash shutdown
handler heh). But you're right, for the default level 2, we skip the
pre_reboot notifiers on kdump, effectively skipping this notifier.

Some ideas of what we can do here:

I) we could change the framebuffer notifier to rely on trylocks, instead
of risking a lockup scenario, and with that, we can execute it before
the vmbus disconnect in the hypervisor list;

II) we ignore the hypervisor notifier in case of kdump _by default_, and
if the users don't want that, they can always set the panic notifier
level to 4 and run all notifiers prior to kdump; would that be terrible
you think? Kdump users might don't care about the framebuffer...

III) we go with approach (b) above and refactor arm64 code to allow the
custom crash handler on kdump time, then [with point (I) above] the
logic proposed in this series is still valid - seems more and more the
most correct/complete solution.

In any case, I guess we should avoid workarounds if possible and do the
things the best way we can, to encompass all (or almost all) the
possible scenarios and don't force things on users (like enforcing panic
notifier level 4 for Hyper-V or something like this...)

More feedback from you / Hyper-V folks is pretty welcome about this.


> 
>> [...]
> The "Fixes:" tags imply that these changes should be backported to older
> longterm kernel versions, which I don't think is the case.  There is a
> dependency on Patch 14 of your series where PANIC_NOTIFIER is
> introduced.
> 

Oh, this was more related with archeology of the kernel. When I'm
investigating stuff, I really want to understand why code was added and
that usually require some time git blaming stuff, so having that pronto
in the commit message is a bonus.

But of course we don't need to use the Fixes tag for that, easy to only
mention it in the text. A secondary benefit by using this tag

Re: [PATCH 3/3] mm: rmap: Fix CONT-PTE/PMD size hugetlb issue when unmapping

2022-04-29 Thread Baolin Wang




On 4/30/2022 4:02 AM, Gerald Schaefer wrote:

On Fri, 29 Apr 2022 16:14:43 +0800
Baolin Wang  wrote:


On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb:
2M and 1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page
size specified.

When unmapping a hugetlb page, we will get the relevant page table
entry by huge_pte_offset() only once to nuke it. This is correct
for PMD or PUD size hugetlb, since they always contain only one
pmd entry or pud entry in the page table.

However this is incorrect for CONT-PTE and CONT-PMD size hugetlb,
since they can contain several continuous pte or pmd entry with
same page table attributes, so we will nuke only one pte or pmd
entry for this CONT-PTE/PMD size hugetlb page.

And now we only use try_to_unmap() to unmap a poisoned hugetlb page,
which means now we will unmap only one pte entry for a CONT-PTE or
CONT-PMD size poisoned hugetlb page, and we can still access other
subpages of a CONT-PTE or CONT-PMD size poisoned hugetlb page,
which will cause serious issues possibly.

So we should change to use huge_ptep_clear_flush() to nuke the
hugetlb page table to fix this issue, which already considered
CONT-PTE and CONT-PMD size hugetlb.

Note we've already used set_huge_swap_pte_at() to set a poisoned
swap entry for a poisoned hugetlb page.

Signed-off-by: Baolin Wang 
---
  mm/rmap.c | 34 +-
  1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 7cf2408..1e168d7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1564,28 +1564,28 @@ static bool try_to_unmap_one(struct folio *folio, 
struct vm_area_struct *vma,
break;
}
}
+   pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);


Unlike in your patch 2/3, I do not see that this (huge) pteval would later
be used again with set_huge_pte_at() instead of set_pte_at(). Not sure if
this (huge) pteval could end up at a set_pte_at() later, but if yes, then
this would be broken on s390, and you'd need to use set_huge_pte_at()
instead of set_pte_at() like in your patch 2/3.


IIUC, As I said in the commit message, we will only unmap a poisoned 
hugetlb page by try_to_unmap(), and the poisoned hugetlb page will be 
remapped with a poisoned entry by set_huge_swap_pte_at() in 
try_to_unmap_one(). So I think no need change to use set_huge_pte_at() 
instead of set_pte_at() for other cases, since the hugetlb page will not 
hit other cases.


if (PageHWPoison(subpage) && !(flags & TTU_IGNORE_HWPOISON)) {
pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
if (folio_test_hugetlb(folio)) {
hugetlb_count_sub(folio_nr_pages(folio), mm);
set_huge_swap_pte_at(mm, address, pvmw.pte, pteval,
 vma_mmu_pagesize(vma));
} else {
dec_mm_counter(mm, mm_counter(&folio->page));
set_pte_at(mm, address, pvmw.pte, pteval);
}

}



Please note that huge_ptep_get functions do not return valid PTEs on s390,
and such PTEs must never be set directly with set_pte_at(), but only with
set_huge_pte_at().

Background is that, for hugetlb pages, we are of course not really dealing
with PTEs at this level, but rather PMDs or PUDs, depending on hugetlb size.
On s390, the layout is quite different for PTEs and PMDs / PUDs, and
unfortunately the hugetlb code is not properly reflecting this by using
PMD or PUD types, like the THP code does.

So, as work-around, on s390, the huge_ptep_xxx functions will return
only fake PTEs, which must be converted again to a proper PMD or PUD,
before writing them to the page table, which is what happens in
set_huge_pte_at(), but not in set_pte_at().


Thanks for your explanation. As I said as above, I think we've already 
handled the hugetlb with set_huge_swap_pte_at() in try_to_unmap_one().


Re: [PATCH V12 00/20] riscv: Add COMPAT mode support for 64BIT

2022-04-29 Thread Guo Ren
On Fri, Apr 29, 2022 at 2:22 AM Palmer Dabbelt  wrote:
>
> On Thu, 28 Apr 2022 05:25:19 PDT (-0700), guo...@kernel.org wrote:
> > Hi Palmer,
> >
> > I see you have taken v12 into your riscv-compat branch and added
> > asm/signal32.h. Do you need me help put compat_sigcontext &
> > compat_ucontext & compat_rt_sigframe into signal32.h? And could we
> > rename signal32.h to compat_signal.h to match compat_signal.c?
> >
> > In the end, thx for taking care of compat patch series.
>
> No problem.  I was just trying to get something clean through all the
> autobuilders before making it look good, I think it didn't fail this
> time so I'll do a bit more refactoring.  Shouldn't be too much longer at
> this point.
Cool, thank you. Hope it could be in v5.19.

>
> >
> >
> > On Tue, Apr 5, 2022 at 3:13 PM  wrote:
> >>
> >> From: Guo Ren 
> >>
> >> Currently, most 64-bit architectures (x86, parisc, powerpc, arm64,
> >> s390, mips, sparc) have supported COMPAT mode. But they all have
> >> history issues and can't use standard linux unistd.h. RISC-V would
> >> be first standard __SYSCALL_COMPAT user of include/uapi/asm-generic
> >> /unistd.h.
> >>
> >> The patchset are based on v5.18-rc1, you can compare rv64-compat
> >> v.s. rv32-native in qemu with following steps:
> >>
> >>  - Prepare rv32 rootfs & fw_jump.bin by buildroot.org
> >>$ git clone git://git.busybox.net/buildroot
> >>$ cd buildroot
> >>$ make qemu_riscv32_virt_defconfig O=qemu_riscv32_virt_defconfig
> >>$ make -C qemu_riscv32_virt_defconfig
> >>$ make qemu_riscv64_virt_defconfig O=qemu_riscv64_virt_defconfig
> >>$ make -C qemu_riscv64_virt_defconfig
> >>(Got fw_jump.bin & rootfs.ext2 in qemu_riscvXX_virt_defconfig/images)
> >>
> >>  - Prepare Linux rv32 & rv64 Image
> >>$ git clone g...@github.com:c-sky/csky-linux.git -b riscv_compat_v12 
> >> linux
> >>$ cd linux
> >>$ echo "CONFIG_STRICT_KERNEL_RWX=n" >> arch/riscv/configs/defconfig
> >>$ echo "CONFIG_STRICT_MODULE_RWX=n" >> arch/riscv/configs/defconfig
> >>$ make ARCH=riscv CROSS_COMPILE=riscv32-buildroot-linux-gnu- 
> >> O=../build-rv32/ rv32_defconfig
> >>$ make ARCH=riscv CROSS_COMPILE=riscv32-buildroot-linux-gnu- 
> >> O=../build-rv32/ Image
> >>$ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- 
> >> O=../build-rv64/ defconfig
> >>$ make ARCH=riscv CROSS_COMPILE=riscv64-buildroot-linux-gnu- 
> >> O=../build-rv64/ Image
> >>
> >>  - Prepare Qemu:
> >>$ git clone https://gitlab.com/qemu-project/qemu.git -b master linux
> >>$ cd qemu
> >>$ ./configure --target-list="riscv64-softmmu riscv32-softmmu"
> >>$ make
> >>
> >> Now let's compare rv64-compat with rv32-native memory footprint with 
> >> almost the same
> >> defconfig, rootfs, opensbi in one qemu.
> >>
> >>  - Run rv64 with rv32 rootfs in compat mode:
> >>$ ./build/qemu-system-riscv64 -cpu rv64 -M virt -m 64m -nographic -bios 
> >> qemu_riscv64_virt_defconfig/images/fw_jump.bin -kernel build-rv64/Image 
> >> -drive file 
> >> qemu_riscv32_virt_defconfig/images/rootfs.ext2,format=raw,id=hd0 -device 
> >> virtio-blk-device,drive=hd0 -append "rootwait root=/dev/vda ro 
> >> console=ttyS0 earlycon=sbi" -netdev user,id=net0 -device 
> >> virtio-net-device,netdev=net0
> >>
> >> QEMU emulator version 6.2.50 (v6.2.0-29-g196d7182c8)
> >> OpenSBI v0.9
> >> [0.00] Linux version 5.16.0-rc6-00017-g750f87086bdd-dirty 
> >> (guoren@guoren-Z87-HD3) (riscv64-unknown-linux-gnu-gcc (GCC) 10.2.0, GNU 
> >> ld (GNU Binutils) 2.37) #96 SMP Tue Dec 28 21:01:55 CST 2021
> >> [0.00] OF: fdt: Ignoring memory range 0x8000 - 0x8020
> >> [0.00] Machine model: riscv-virtio,qemu
> >> [0.00] earlycon: sbi0 at I/O port 0x0 (options '')
> >> [0.00] printk: bootconsole [sbi0] enabled
> >> [0.00] efi: UEFI not found.
> >> [0.00] Zone ranges:
> >> [0.00]   DMA32[mem 0x8020-0x83ff]
> >> [0.00]   Normal   empty
> >> [0.00] Movable zone start for each node
> >> [0.00] Early memory node ranges
> >> [0.00]   node   0: [mem 0x8020-0x83ff]
> >> [0.00] Initmem setup node 0 [mem 
> >> 0x8020-0x83ff]
> >> [0.00] SBI specification v0.2 detected
> >> [0.00] SBI implementation ID=0x1 Version=0x9
> >> [0.00] SBI TIME extension detected
> >> [0.00] SBI IPI extension detected
> >> [0.00] SBI RFENCE extension detected
> >> [0.00] SBI v0.2 HSM extension detected
> >> [0.00] riscv: ISA extensions acdfhimsu
> >> [0.00] riscv: ELF capabilities acdfim
> >> [0.00] percpu: Embedded 17 pages/cpu s30696 r8192 d30744 u69632
> >> [0.00] Built 1 zonelists, mobility grouping on.  Total pages: 15655
> >> [0.00] Kernel command line: rootwait root=/dev/vda ro 
> >> console=ttyS0 earlycon=sbi
> >> [0.00] Dentry cache hash table entries: 8192 (order: 4