Re: [PATCH v4 5/7] kexec_file, ricv: print out debugging message if required

2023-12-19 Thread Baoquan He
On 12/19/23 at 02:44pm, Conor Dooley wrote:
> On Wed, Dec 13, 2023 at 01:57:45PM +0800, Baoquan He wrote:
> > Then when specifying '-d' for kexec_file_load interface, loaded
> > locations of kernel/initrd/cmdline etc can be printed out to help debug.
> > 
> > Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
> > loading related codes.
> > 
> > And also replace pr_notice() with kexec_dprintk() in elf_kexec_load()
> > because loaded location of purgatory and device tree are only printed
> > out for debugging, it doesn't make sense to always print them out.
> > 
> > And also remove kexec_image_info() because the content has been printed
> > out in generic code.
> > 
> > Signed-off-by: Baoquan He 
> 
> I'm sorry - I meant to look at this several days ago but I forgot.
> Apart from the typo that crept back into $subject, this version explains
> the rationale behind what you're changing a lot better, thanks.

Thanks for careful checking. I forgot the typo fixing you have pointed
out in v3 reviewing.

Hi Andrew,

Could you help fix the typo in subject?

[PATCH v4 5/7] kexec_file, ricv: print out debugging message if required
   ~~~ s/ricv/riscv/



Re: [PATCH 09/13] mm/gup: Cache *pudp in follow_pud_mask()

2023-12-19 Thread Peter Xu
On Tue, Dec 19, 2023 at 11:28:54AM -0500, James Houghton wrote:
> On Tue, Dec 19, 2023 at 2:57 AM  wrote:
> >
> > From: Peter Xu 
> >
> > Introduce "pud_t pud" in the function, so the code won't dereference *pudp
> > multiple time.  Not only because that looks less straightforward, but also
> > because if the dereference really happened, it's not clear whether there
> > can be race to see different *pudp values if it's being modified at the
> > same time.
> >
> > Signed-off-by: Peter Xu 
> > ---
> >  mm/gup.c | 17 +
> >  1 file changed, 9 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 6c0d82fa8cc7..97e87b7a15c3 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct 
> > vm_area_struct *vma,
> > unsigned int flags,
> > struct follow_page_context *ctx)
> >  {
> > -   pud_t *pud;
> > +   pud_t *pudp, pud;
> > spinlock_t *ptl;
> > struct page *page;
> > struct mm_struct *mm = vma->vm_mm;
> >
> > -   pud = pud_offset(p4dp, address);
> > -   if (pud_none(*pud))
> > +   pudp = pud_offset(p4dp, address);
> > +   pud = *pudp;
> 
> I think you might want a READ_ONCE() on this so that the compiler
> doesn't actually read the pud multiple times.

Makes sense.  I probably only did the "split" part which Christoph
requested, without thinking futher than that. :)

> 
> > +   if (pud_none(pud))
> > return no_page_table(vma, flags, address);
> > -   if (pud_devmap(*pud)) {
> > -   ptl = pud_lock(mm, pud);
> > -   page = follow_devmap_pud(vma, address, pud, flags, 
> > >pgmap);
> > +   if (pud_devmap(pud)) {
> > +   ptl = pud_lock(mm, pudp);
> > +   page = follow_devmap_pud(vma, address, pudp, flags, 
> > >pgmap);
> > spin_unlock(ptl);
> > if (page)
> > return page;
> > return no_page_table(vma, flags, address);
> > }
> > -   if (unlikely(pud_bad(*pud)))
> > +   if (unlikely(pud_bad(pud)))
> > return no_page_table(vma, flags, address);
> 
> Not your change, but reading this, it's not clear to me that
> `pud_present(*pudp)` (and non-leaf) would necessarily be true at this
> point -- like, I would prefer to see `!pud_present(pud)` instead of
> `pud_bad()`. Thank you for adding that in the next patch. :)

I think the assumption here is it is expected to be a directory entry when
reaching here, and for a valid directory entry pud_present() should always
return true (a side note: pud_present() may not mean "PRESENT bit set", see
m68k's implementation for example).

Yeah I added that in the next patch, my intention was to check
!pud_present() for all cases without the need to take pgtable lock, though.

> 
> Feel free to add:
> 
> Acked-by: James Houghton 

Thanks,

-- 
Peter Xu



Re: [PATCH 09/13] mm/gup: Cache *pudp in follow_pud_mask()

2023-12-19 Thread James Houghton
On Tue, Dec 19, 2023 at 2:57 AM  wrote:
>
> From: Peter Xu 
>
> Introduce "pud_t pud" in the function, so the code won't dereference *pudp
> multiple time.  Not only because that looks less straightforward, but also
> because if the dereference really happened, it's not clear whether there
> can be race to see different *pudp values if it's being modified at the
> same time.
>
> Signed-off-by: Peter Xu 
> ---
>  mm/gup.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 6c0d82fa8cc7..97e87b7a15c3 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct 
> vm_area_struct *vma,
> unsigned int flags,
> struct follow_page_context *ctx)
>  {
> -   pud_t *pud;
> +   pud_t *pudp, pud;
> spinlock_t *ptl;
> struct page *page;
> struct mm_struct *mm = vma->vm_mm;
>
> -   pud = pud_offset(p4dp, address);
> -   if (pud_none(*pud))
> +   pudp = pud_offset(p4dp, address);
> +   pud = *pudp;

I think you might want a READ_ONCE() on this so that the compiler
doesn't actually read the pud multiple times.

> +   if (pud_none(pud))
> return no_page_table(vma, flags, address);
> -   if (pud_devmap(*pud)) {
> -   ptl = pud_lock(mm, pud);
> -   page = follow_devmap_pud(vma, address, pud, flags, 
> >pgmap);
> +   if (pud_devmap(pud)) {
> +   ptl = pud_lock(mm, pudp);
> +   page = follow_devmap_pud(vma, address, pudp, flags, 
> >pgmap);
> spin_unlock(ptl);
> if (page)
> return page;
> return no_page_table(vma, flags, address);
> }
> -   if (unlikely(pud_bad(*pud)))
> +   if (unlikely(pud_bad(pud)))
> return no_page_table(vma, flags, address);

Not your change, but reading this, it's not clear to me that
`pud_present(*pudp)` (and non-leaf) would necessarily be true at this
point -- like, I would prefer to see `!pud_present(pud)` instead of
`pud_bad()`. Thank you for adding that in the next patch. :)

Feel free to add:

Acked-by: James Houghton 


Re: [PATCH v2] i2c: cpm: Remove linux,i2c-index conversion from be32

2023-12-19 Thread Wolfram Sang
On Wed, Dec 06, 2023 at 11:24:03PM +0100, Christophe Leroy wrote:
> sparse reports an error on some data that gets converted from be32.
> 
> That's because that data is typed u32 instead of __be32.
> 
> The type is correct, the be32_to_cpu() conversion is not.
> 
> Remove the conversion.
> 
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202312042210.ql4da8av-...@intel.com/
> Signed-off-by: Christophe Leroy 

Applied to for-next, thanks!



signature.asc
Description: PGP signature


Re: [PATCH v2] ASoC: fsl_sai: Fix channel swap issue on i.MX8MP

2023-12-19 Thread Mark Brown
On Tue, 19 Dec 2023 10:30:57 +0800, Shengjiu Wang wrote:
> When flag mclk_with_tere and mclk_direction_output enabled,
> The SAI transmitter or receiver will be enabled in very early
> stage, that if FSL_SAI_xMR is set by previous case,
> for example previous case is one channel, current case is
> two channels, then current case started with wrong xMR in
> the beginning, then channel swap happen.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_sai: Fix channel swap issue on i.MX8MP
  commit: 8f0f01647550daf9cd8752c1656dcb0136d79ce1

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark



Re: [PATCH v4 5/7] kexec_file, ricv: print out debugging message if required

2023-12-19 Thread Conor Dooley
On Wed, Dec 13, 2023 at 01:57:45PM +0800, Baoquan He wrote:
> Then when specifying '-d' for kexec_file_load interface, loaded
> locations of kernel/initrd/cmdline etc can be printed out to help debug.
> 
> Here replace pr_debug() with the newly added kexec_dprintk() in kexec_file
> loading related codes.
> 
> And also replace pr_notice() with kexec_dprintk() in elf_kexec_load()
> because loaded location of purgatory and device tree are only printed
> out for debugging, it doesn't make sense to always print them out.
> 
> And also remove kexec_image_info() because the content has been printed
> out in generic code.
> 
> Signed-off-by: Baoquan He 

I'm sorry - I meant to look at this several days ago but I forgot.
Apart from the typo that crept back into $subject, this version explains
the rationale behind what you're changing a lot better, thanks.

Reviewed-by: Conor Dooley 

Cheers,
Conor.


signature.asc
Description: PGP signature


Re: [kvm-unit-tests PATCH v5 29/29] powerpc: Add timebase tests

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

This has a known failure on QEMU TCG machines where the decrementer
interrupt is not lowered when the DEC wraps from -ve to +ve.

Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/asm/ppc_asm.h   |   1 +
  lib/powerpc/asm/processor.h |  22 +++
  powerpc/Makefile.common |   1 +
  powerpc/smp.c   |  22 ---
  powerpc/timebase.c  | 328 
  powerpc/unittests.cfg   |   8 +
  6 files changed, 360 insertions(+), 22 deletions(-)
  create mode 100644 powerpc/timebase.c

...

diff --git a/powerpc/timebase.c b/powerpc/timebase.c
new file mode 100644
index ..4d80ea09
--- /dev/null
+++ b/powerpc/timebase.c
@@ -0,0 +1,328 @@
+/*
+ * Test Timebase
+ *
+ * Copyright 2017  Thomas Huth, Red Hat Inc.


No, not really. Please update ;-)

 Thomas



Re: [kvm-unit-tests PATCH v5 25/29] powerpc: Add rtas stop-self support

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

In preparation for improved SMP support, add stop-self support to the
harness. This is non-trivial because it requires an unlocked rtas
call: a CPU can't be holding a spin lock when it goes offline or it
will deadlock other CPUs. rtas permits stop-self to be called without
serialising all other rtas operations.

Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/asm/rtas.h |  2 ++
  lib/powerpc/rtas.c | 78 +-
  2 files changed, 64 insertions(+), 16 deletions(-)

...

+void rtas_stop_self(void)
+{
+   struct rtas_args args;
+   uint32_t token;
+   int ret;
+
+   ret = rtas_token("stop-self", );
+   if (ret) {
+   puts("RTAS stop-self not available\n");
+   return;
+   }
+
+   ret = rtas_call_unlocked(, token, 0, 1, NULL);
+   printf("RTAS stop-self returnd %d\n", ret);


s/returnd/returned/


+}


With the typo fixed:

Reviewed-by: Thomas Huth 



Re: [kvm-unit-tests PATCH v5 24/29] powerpc: interrupt tests

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Add basic testing of various kinds of interrupts, machine check,
page fault, illegal, decrementer, trace, syscall, etc.

This has a known failure on QEMU TCG pseries machines where MSR[ME]
can be incorrectly set to 0.

Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/asm/ppc_asm.h |  21 +-
  powerpc/Makefile.common   |   3 +-
  powerpc/interrupts.c  | 422 ++
  powerpc/unittests.cfg |   3 +
  4 files changed, 445 insertions(+), 4 deletions(-)
  create mode 100644 powerpc/interrupts.c

diff --git a/lib/powerpc/asm/ppc_asm.h b/lib/powerpc/asm/ppc_asm.h
index ef2d91dd..778e78ee 100644
--- a/lib/powerpc/asm/ppc_asm.h
+++ b/lib/powerpc/asm/ppc_asm.h
@@ -35,17 +35,32 @@
  
  #endif /* __BYTE_ORDER__ */
  
+#define SPR_DSISR	0x012

+#define SPR_DAR0x013
+#define SPR_DEC0x016
+#define SPR_SRR0   0x01A
+#define SPR_SRR1   0x01B
+#define SPR_FSCR   0x099
+#define   FSCR_PREFIX  0x2000
+#define SPR_HDEC   0x136
  #define SPR_HSRR0 0x13A
  #define SPR_HSRR1 0x13B
+#define SPR_LPCR   0x13E
+#define   LPCR_HDICE   0x1UL
+#define SPR_HEIR   0x153
+#define SPR_SIAR   0x31C
  
  /* Machine State Register definitions: */

  #define MSR_LE_BIT0
  #define MSR_EE_BIT15  /* External Interrupts Enable */
  #define MSR_HV_BIT60  /* Hypervisor mode */
  #define MSR_SF_BIT63  /* 64-bit mode */
-#define MSR_ME 0x1000ULL
  
-#define SPR_HSRR0	0x13A

-#define SPR_HSRR1  0x13B
+#define MSR_DR 0x0010ULL
+#define MSR_IR 0x0020ULL
+#define MSR_BE 0x0200ULL   /* Branch Trace Enable */
+#define MSR_SE 0x0400ULL   /* Single Step Enable */
+#define MSR_EE 0x8000ULL
+#define MSR_ME 0x1000ULL
  
  #endif /* _ASMPOWERPC_PPC_ASM_H */

diff --git a/powerpc/Makefile.common b/powerpc/Makefile.common
index a7af225b..b340a53b 100644
--- a/powerpc/Makefile.common
+++ b/powerpc/Makefile.common
@@ -11,7 +11,8 @@ tests-common = \
$(TEST_DIR)/rtas.elf \
$(TEST_DIR)/emulator.elf \
$(TEST_DIR)/tm.elf \
-   $(TEST_DIR)/sprs.elf
+   $(TEST_DIR)/sprs.elf \
+   $(TEST_DIR)/interrupts.elf
  
  tests-all = $(tests-common) $(tests)

  all: directories $(TEST_DIR)/boot_rom.bin $(tests-all)
diff --git a/powerpc/interrupts.c b/powerpc/interrupts.c
new file mode 100644
index ..3217b15e
--- /dev/null
+++ b/powerpc/interrupts.c
@@ -0,0 +1,422 @@
+/*
+ * Test interrupts
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define SPR_LPCR   0x13E
+#define LPCR_HDICE 0x1UL
+#define SPR_DEC0x016
+#define SPR_HDEC   0x136
+
+#define MSR_DR 0x0010ULL
+#define MSR_IR 0x0020ULL
+#define MSR_EE 0x8000ULL
+#define MSR_ME 0x1000ULL


Why don't you use the definitions from ppc_asm.h above?


+static bool cpu_has_heir(void)
+{
+   uint32_t pvr = mfspr(287);  /* Processor Version Register */
+
+   if (!machine_is_powernv())
+   return false;
+
+   /* POWER6 has HEIR, but QEMU powernv support does not go that far */
+   switch (pvr >> 16) {
+   case 0x4b:  /* POWER8E */
+   case 0x4c:  /* POWER8NVL */
+   case 0x4d:  /* POWER8 */
+   case 0x4e:  /* POWER9 */
+   case 0x80:  /* POWER10 */


I'd suggest to introduce some #defines for those PVR values instead of using 
magic numbers all over the place?



+   return true;
+   default:
+   return false;
+   }
+}
+
+static bool cpu_has_prefix(void)
+{
+   uint32_t pvr = mfspr(287);  /* Processor Version Register */
+   switch (pvr >> 16) {
+   case 0x80:  /* POWER10 */
+   return true;
+   default:
+   return false;
+   }
+}
+
+static bool cpu_has_lev_in_srr1(void)
+{
+   uint32_t pvr = mfspr(287);  /* Processor Version Register */
+   switch (pvr >> 16) {
+   case 0x80:  /* POWER10 */
+   return true;
+   default:
+   return false;
+   }
+}
+
+static bool regs_is_prefix(volatile struct pt_regs *regs)
+{
+   return (regs->msr >> (63-34)) & 1;


You introduced a bunch of new #define MSR_xx statements ... why not for this 
one, too?



+}
+
+static void regs_advance_insn(struct pt_regs *regs)
+{
+   if (regs_is_prefix(regs))
+   regs->nip += 8;
+   else
+   regs->nip += 4;
+}
+
+static volatile bool got_interrupt;
+static volatile struct pt_regs recorded_regs;
+
+static void mce_handler(struct pt_regs *regs, void *opaque)
+{
+   got_interrupt = true;
+   

Re: powerpc: several early boot regressions on MPC52xx

2023-12-19 Thread Matthias Schiffer
On Mon, 2023-12-18 at 19:48 +, Christophe Leroy wrote:
> 
> Achtung externe E-Mail: Öffnen Sie Anhänge und Links nur, wenn Sie wissen, 
> dass diese aus einer sicheren Quelle stammen und sicher sind. Leiten Sie die 
> E-Mail im Zweifelsfall zur Prüfung an den IT-Helpdesk weiter.
> Attention external email: Open attachments and links only if you know that 
> they are from a secure source and are safe. In doubt forward the email to the 
> IT-Helpdesk to check it.
> 
> 
> Hi Matthias,
> 
> Le 18/12/2023 à 14:48, Matthias Schiffer a écrit :
> > Hi all,
> > 
> > I'm currently in the process of porting our ancient TQM5200 SoM to a modern 
> > kernel, and I've
> > identified a number of regressions that cause early boot failures (before 
> > the UART console has been
> > initialized) with arch/powerpc/configs/52xx/tqm5200_defconfig.
> 
> "modern" kernel ==> which version ?

Hi Christophe,

I was testing with torvalds/master as of yesterday, and bisected everything 
from 4.14 to identify
the commits related to the issues. For my current project, 6.1.y or 6.6.y will 
likely be our kernel
of choice, but I'd also like to get mainline to a working state again if 
possible.

> 
> > 
> > Issue 1) Boot fails with CONFIG_PPC_KUAP enabled (enabled by default since 
> > 9f5bd8f1471d7
> > "powerpc/32s: Activate KUAP and KUEP by default"). The reason is a number 
> > of of_iomap() calls in
> > arch/powerpc/platforms/52xx that should be early_ioremap().
> 
> Can you give more details and what leads you to that conclusion ?
> 
> There should be no relation between KUAP and of_iomap()/early_ioremap(). 
> Problem is likely somewhere else.

You are entirely right, the warnings about early_ioremap() were a red hering. I 
can't reproduce any
difference in boot behavior anymore I thought I was seeing when changing the 
of_iomap() to
early_ioremap(). I assume I got confused by testing for too many variables at 
once (kernel version +
2 Kconfig settings).


> 
> > 
> > I can fix this up easy enough for mpc5200_simple by changing 
> > mpc5200_setup_xlb_arbiter() to use
> > early_ioremap() and moving mpc52xx_map_common_devices() from the setup_arch 
> > to the init hook (one
> > side effect is that mpc52xx_restart() only works a bit later, as it 
> > requires the mpc52xx_wdt mapping
> > from mpc52xx_map_common_devices(); I'm not sure if there is a better 
> > solution).
> > 
> > For the other 52xx platforms (efika, lite5200, media5200) things are a bit 
> > more chaotic, and they
> > create several more temporary mappings from setup_arch. Either they should 
> > all be moved to the init
> > hook as well, or be converted to early_ioremap(), but I can't tell which is 
> > more appropriate. As a
> > first step, I would propose a patch that fixes this for the simple 
> > platforms and leaves the other
> > ones unchanged.
> > 
> > (Side note: I also found that before 16132529cee58 ("powerpc/32s: Rework 
> > Kernel Userspace Access
> > Protection"), boot would succeed even with KUAP enabled without changing 
> > the incorrect of_iomap(); I
> > guess the old implementation was more lenient about the incorrect calls 
> > that the kernel warns
> > about?)
> 
> Interesting.
> Again, there shouldn't be any impact of those incorrect calls. They are 
> correct calls, it is just an historical method that we want to get rid 
> of on day.
> Could you then provide the dmesg of what/how it works here ? And then 
> I'd also be interested in a dump of /sys/kernel/debug/kernel_page_tables 
> and /sys/kernel/debug/powerpc/block_address_translation
> and /sys/kernel/debug/powerpc/segment_registers
> 
> For that you'll need CONFIG_PTDUMP_DEBUGFS

As it turns out, whatever issue existed with KUAP at the time when it was 
changed to enabled by
default for 32s (which was in 5.14) has been resolved in current mainline. 
Current torvalds/master
boots fine with KUAP enabled, and only CONFIG_STRICT_KERNEL_RWX breaks the boot.

> 
> > 
> > Issue 2) Boot fails with CONFIG_STRICT_KERNEL_RWX enabled, which is also 
> > the default nowadays.
> > 
> > I have not found the cause of this boot failure yet; is there any way to 
> > debug this if the failure
> > happens before the UART is available and I currently don't have JTAG for 
> > this hardware?
> 
> Shouldn't happen before UART is available, strict enforcement is 
> perfomed by mark_readonly() and free_initmem() in the middle of 
> kernel_init(). UART should be ON long before that.
> 
> So it must be something in the setup that collides with CONFIG_KUAP and 
> CONFIG_STRICT_KERNEL_RWX.
> 
> Could you send dmesg of when it works (ie without 
> CONFIG_KUAP/CONFIG_STRICT_KERNEL_RWX) and when it doesn't work if you 
> get some initial stuff ?

Here's the UART output of a working boot (CONFIG_STRICT_KERNEL_RWX disabled; I 
have slightly
extended tqm5200.dts to enable UART output of the cuImage wrapper):

Memory <- <0x0 0x800> (128MB)
CPU clock-frequency <- 0x179a7b00 (396MHz)

Re: [kvm-unit-tests PATCH v5 22/29] powerpc: Fix emulator illegal instruction test for powernv

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Illegal instructions cause 0xe40 (HEAI) interrupts rather
than program interrupts.

Signed-off-by: Nicholas Piggin 
---
  powerpc/emulator.c | 19 ++-
  1 file changed, 18 insertions(+), 1 deletion(-)


Acked-by: Thomas Huth 



Re: [PATCHv3 RESEND 00/10] ASoC: fsl-asoc-card: compatibility integration of a generic codec use case for use with S/PDIF controller

2023-12-19 Thread Elinor Montmasson
On Monday, 18 December, 2023 14:54:03, Daniel Baluta wrote
>
> > * fsl-asoc-card.txt currently follows the old dt-bindings format. 
> > Should we update it to DT schema format in this patch series 
> > before adding my new properties? 
> 
> I know this is extra-work but we would greatly appreciate if you first 
> convert fsl-asoc-card.txt 
> to yml format and then add your new properties. 

I will take some time next week to do the conversion, then I'll send
it in a v4 patch series.

Best regards,
Elinor Montmasson


Re: [kvm-unit-tests PATCH v5 20/29] scripts: Accommodate powerpc powernv machine differences

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

The QEMU powerpc powernv machine has minor differences that must be
accommodated for in output parsing:

- Summary parsing must search more lines of output for the summary
   line, to accommodate OPAL message on shutdown.
- Premature failure testing must tolerate case differences in kernel
   load error message.

Signed-off-by: Nicholas Piggin 
---
  scripts/runtime.bash | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)


Acked-by: Thomas Huth 



Re: [PATCH v2 00/10] Don't let i2c adapters declare I2C_CLASS_SPD support if they support I2C_CLASS_HWMON

2023-12-19 Thread Wolfram Sang

> This series and my other series are sitting idle in patchwork
> for 3 weeks now. AFAICS they have the needed ack's.
> Anything missing before they can be applied?

Applied to for-next, thanks!



signature.asc
Description: PGP signature


Re: [PATCH] powerpc/64s: Increase default stack size to 32KB

2023-12-19 Thread Michael Ellerman
Michael Ellerman  writes:
> There are reports of kernels crashing due to stack overflow while
> running OpenShift (Kubernetes). The primary contributor to the stack
> usage seems to be openvswitch, which is used by OVN-Kubernetes (based on
> OVN (Open Virtual Network)), but NFS also contributes in some stack
> traces.

For the archives here's an example trace.

This comes from the openshift CI:
  
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-ovn-remote-libvirt-ppc64le/1703597644732960768

Which links through to the kdump.tar:
  
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-ovn-remote-libvirt-ppc64le/1703597644732960768/artifacts/ocp-e2e-ovn-remote-libvirt-ppc64le/ipi-conf-debug-kdump-gather-logs/artifacts/kdump.tar

Which contains vmcore-dmesg.txt, which includes this trace:

[ 1805.324030] do_IRQ: stack overflow: 1808
[ 1805.324179] CPU: 0 PID: 263384 Comm: mount.nfs Kdump: loaded Not tainted 
5.14.0-284.32.1.el9_2.ppc64le #1
[ 1805.324184] Call Trace:
[ 1805.324186] [c0037d4806d0] [c08427d0] dump_stack_lvl+0x74/0xa8 
(unreliable)
[ 1805.324199] [c0037d480710] [c0016bbc] __do_IRQ+0x11c/0x130
[ 1805.324205] [c0037d4807a0] [c0016c10] do_IRQ+0x40/0xa0
[ 1805.324210] [c0037d4807d0] [c0009080] 
hardware_interrupt_common_virt+0x210/0x220
[ 1805.324215] --- interrupt: 500 at slab_pre_alloc_hook.constprop.0+0x7c/0x340
[ 1805.324221] NIP:  c04feb3c LR: c04feb24 CTR: c092b770
[ 1805.324223] REGS: c0037d480840 TRAP: 0500   Not tainted  
(5.14.0-284.32.1.el9_2.ppc64le)
[ 1805.324226] MSR:  8280b033   CR: 
24424442  XER: 
[ 1805.324240] CFAR: c045ef8c IRQMASK: 0 
GPR00: c04feb24 c0037d480ae0 c2b12700  
GPR04: 0a20 c0037d480b60 0001 0a20 
GPR08: c133ca80  0028 4000 
GPR12: c092b770 c2ea   
GPR16: 0005 0040 012e c000566930e0 
GPR20: 0008  c000566930e0  
GPR24: c092bac4 c3010400 c0037d480b60 0001 
GPR28:  0a20  c3010400 
[ 1805.324284] NIP [c04feb3c] slab_pre_alloc_hook.constprop.0+0x7c/0x340
[ 1805.324288] LR [c04feb24] slab_pre_alloc_hook.constprop.0+0x64/0x340
[ 1805.324291] --- interrupt: 500
[ 1805.324292] [c0037d480ae0] [] 0x0 (unreliable)
[ 1805.324298] [c0037d480b40] [c050560c] __kmalloc+0x8c/0x5e0
[ 1805.324302] [c0037d480bc0] [c092bac4] 
virtqueue_add_outbuf+0x354/0xac0
[ 1805.324307] [c0037d480cc0] [c008011b3a84] xmit_skb+0x1dc/0x350 
[virtio_net]
[ 1805.324317] [c0037d480d50] [c008011b3ccc] start_xmit+0xd4/0x3b0 
[virtio_net]
[ 1805.324321] [c0037d480e00] [c0c4baac] 
dev_hard_start_xmit+0x11c/0x280
[ 1805.324327] [c0037d480e80] [c0cf1c8c] sch_direct_xmit+0xec/0x330
[ 1805.324332] [c0037d480f20] [c0c4a03c] __dev_xmit_skb+0x41c/0xa80
[ 1805.324336] [c0037d480f90] [c0c4c194] 
__dev_queue_xmit+0x414/0x950
[ 1805.324340] [c0037d481070] [c00802abdfdc] ovs_vport_send+0xb4/0x210 
[openvswitch]
[ 1805.324351] [c0037d4810f0] [c00802aa14a4] do_output+0x7c/0x200 
[openvswitch]
[ 1805.324359] [c0037d481140] [c00802aa33b0] 
do_execute_actions+0xe48/0xeb0 [openvswitch]
[ 1805.324366] [c0037d481300] [c00802aa3800] 
ovs_execute_actions+0x78/0x1f0 [openvswitch]
[ 1805.324373] [c0037d481380] [c00802aa970c] 
ovs_dp_process_packet+0xb4/0x2e0 [openvswitch]
[ 1805.324380] [c0037d481450] [c00802abde84] 
ovs_vport_receive+0x8c/0x130 [openvswitch]
[ 1805.324388] [c0037d481660] [c00802abe638] 
internal_dev_xmit+0x40/0xd0 [openvswitch]
[ 1805.324396] [c0037d481690] [c0c4baac] 
dev_hard_start_xmit+0x11c/0x280
[ 1805.324401] [c0037d481710] [c0c4c3b4] 
__dev_queue_xmit+0x634/0x950
[ 1805.324405] [c0037d4817f0] [c0d50810] neigh_hh_output+0xd0/0x180
[ 1805.324410] [c0037d481840] [c0d516ec] 
ip_finish_output2+0x31c/0x5c0
[ 1805.324415] [c0037d4818e0] [c0d53f94] ip_local_out+0x64/0x90
[ 1805.324419] [c0037d481920] [c0dd83e4] iptunnel_xmit+0x194/0x290
[ 1805.324423] [c0037d4819c0] [c00803160408] 
udp_tunnel_xmit_skb+0x100/0x140 [udp_tunnel]
[ 1805.324429] [c0037d481a80] [c00803203a54] 
geneve_xmit_skb+0x34c/0x610 [geneve]
[ 1805.324434] [c0037d481bb0] [c0080320596c] geneve_xmit+0x94/0x1e8 
[geneve]
[ 1805.324438] [c0037d481c30] [c0c4baac] 
dev_hard_start_xmit+0x11c/0x280
[ 1805.324442] [c0037d481cb0] [c0c4c3b4] 
__dev_queue_xmit+0x634/0x950
[ 

Re: [kvm-unit-tests PATCH v5 19/29] scripts: allow machine option to be specified in unittests.cfg

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

This allows different machines with different requirements to be
supported by run_tests.sh, similarly to how different accelerators
are handled.

Signed-off-by: Nicholas Piggin 
---
  scripts/common.bash  |  8 ++--
  scripts/runtime.bash | 16 
  2 files changed, 18 insertions(+), 6 deletions(-)


Acked-by: Thomas Huth 



Re: [kvm-unit-tests PATCH v5 18/29] powerpc: Fix stack backtrace termination

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

The backtrace handler terminates when it sees a NULL caller address,
but the powerpc stack setup does not keep such a NULL caller frame
at the start of the stack.

This happens to work on pseries because the memory at 0 is mapped and
it contains 0 at the location of the return address pointer if it
were a stack frame. But this is fragile, and does not work with powernv
where address 0 contains firmware instructions.

Use the existing dummy frame on stack as the NULL caller, and create a
new frame on stack for the entry code.

Signed-off-by: Nicholas Piggin 
---
  powerpc/cstart64.S | 12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/powerpc/cstart64.S b/powerpc/cstart64.S
index e18ae9a2..14ab0c6c 100644
--- a/powerpc/cstart64.S
+++ b/powerpc/cstart64.S
@@ -46,8 +46,16 @@ start:
add r1, r1, r31
add r2, r2, r31
  
+	/* Zero backpointers in initial stack frame so backtrace() stops */

+   li  r0,0
+   std r0,0(r1)
+   std r0,16(r1)
+
+   /* Create entry frame */
+   stdur1,-INT_FRAME_SIZE(r1)


Shouldn't that rather be STACK_FRAME_OVERHEAD instead of INT_FRAME_SIZE...


/* save DTB pointer */
-   std r3, 56(r1)
+   SAVE_GPR(3,r1)


... since SAVE_GPR uses STACK_FRAME_OVERHEAD (via GPR0), too?

 Thomas



Re: [kvm-unit-tests PATCH v5 16/29] powerpc: Set .got section alignment to 256 bytes

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Modern powerpc64 toolchains require the .got section have alignment of
256 bytes. Incorrect alignment ends up causing the .data section ELF
load address to move by 8 bytes from its file offset, relative to
previous sections. This is not a problem for the QEMU bios loader used
by the pseries machine, but it is a problem for the powernv machine
using skiboot as the bios and the test programs as a kernel, because the
skiboot ELF loader is crippled:

   * Note that we execute the kernel in-place, we don't actually
   * obey the load informations in the headers. This is expected
   * to work for the Linux Kernel because it's a fairly dumb ELF
   * but it will not work for any ELF binary.

This causes all references to data to be incorrect. Aligning the .got
to 256 bytes prevents this offset skew and allows the skiboot "flat"
loader to work. [I don't know why the .got alignment can cause this
difference in linking.]

Signed-off-by: Nicholas Piggin 
---
  powerpc/flat.lds | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/powerpc/flat.lds b/powerpc/flat.lds
index 5eed368d..e07b91c1 100644
--- a/powerpc/flat.lds
+++ b/powerpc/flat.lds
@@ -41,8 +41,7 @@ SECTIONS
  /*
   * tocptr is tocbase + 32K, allowing toc offsets to be +-32K
   */
-tocptr = . + 32K;
-.got : { *(.toc) *(.got) }
+.got : ALIGN(256) { tocptr = . + 32K; *(.toc .got) }
  . = ALIGN(64K);
  edata = .;
  . += 64K;


Acked-by: Thomas Huth 



Re: [kvm-unit-tests PATCH v5 14/29] powerpc: Expand exception handler vector granularity

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Exception handlers are currently indexed in units of 0x100, but
powerpc can have vectors that are aligned to as little as 0x20
bytes. Increase granularity of the handler functions before
adding support for those vectors.

Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/processor.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)


Reviewed-by: Thomas Huth 



Re: [kvm-unit-tests PATCH v5 13/29] powerpc: Make interrupt handler error more readable

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Installing the same handler twice reports a shifted trap vector
address which is hard to decipher. Print the unshifed address.

Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/processor.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/powerpc/processor.c b/lib/powerpc/processor.c
index aaf45b68..b4cd5b4c 100644
--- a/lib/powerpc/processor.c
+++ b/lib/powerpc/processor.c
@@ -26,7 +26,7 @@ void handle_exception(int trap, void (*func)(struct pt_regs 
*, void *),
trap >>= 8;


You only change this to >>= 5 in the next patch...


if (func && handlers[trap].func) {
-   printf("exception handler installed twice %#x\n", trap);
+   printf("exception handler installed twice %#x\n", trap << 5);


... so I think you should move this patch here after the next one.

 Thomas



abort();
}
handlers[trap].func = func;




Re: [kvm-unit-tests PATCH v5 12/29] powerpc/sprs: Avoid taking async interrupts caused by register fuzzing

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

Storing certain values in some registers can cause asynchronous
interrupts that can crash the test case, for example decrementer
or PMU interrupts.

Change the msleep to mdelay which does not enable MSR[EE] and so
avoids the problem. This allows removing some of the SPR special
casing.

Signed-off-by: Nicholas Piggin 
---
  powerpc/sprs.c | 14 ++
  1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index 01041912..313698e0 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -481,12 +481,7 @@ static void set_sprs(uint64_t val)
continue;
if (sprs[i].type & SPR_HARNESS)
continue;
-   if (!strcmp(sprs[i].name, "MMCR0")) {
-   /* XXX: could use a comment or better abstraction! */
-   __mtspr(i, (val & 0xfbab3fffULL) | 0xfa0b2070);
-   } else {
-   __mtspr(i, val);
-   }
+   __mtspr(i, val);
}
  }
  
@@ -536,12 +531,7 @@ int main(int argc, char **argv)

if (pause) {
migrate_once();
} else {
-   msleep(2000);
-
-   /* Taking a dec updates SRR0, SRR1, SPRG1, so don't fail. */
-   sprs[26].type |= SPR_ASYNC;
-   sprs[27].type |= SPR_ASYNC;
-   sprs[273].type |= SPR_ASYNC;
+   mdelay(2000);
}


IIRC I used the H_CEDE stuff here on purpose to increase the possibility 
that the guest gets rescheduled onto another CPU core on the host, and thus 
that it uncovers sprs that are not saved and restored on the host more 
easily. So I'd rather keep the msleep() here.


 Thomas




Re: [kvm-unit-tests PATCH v5 11/29] powerpc/sprs: Don't fail changed SPRs that are used by the test harness

2023-12-19 Thread Thomas Huth

On 16/12/2023 14.42, Nicholas Piggin wrote:

SPRs annotated with SPR_HARNESS can change between consecutive reads
because the test harness code has changed them. Avoid failing the
test in this case.

[XER was observed to change after the next changeset to use mdelay.]

Signed-off-by: Nicholas Piggin 
---
  powerpc/sprs.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/powerpc/sprs.c b/powerpc/sprs.c
index cd8b472d..01041912 100644
--- a/powerpc/sprs.c
+++ b/powerpc/sprs.c
@@ -557,7 +557,7 @@ int main(int argc, char **argv)
if (before[i] >> 32)
pass = false;
}
-   if (!(sprs[i].type & SPR_ASYNC) && (before[i] != after[i]))
+   if (!(sprs[i].type & (SPR_HARNESS|SPR_ASYNC)) && (before[i] != 
after[i]))
pass = false;


I guess you could even squash this into the previous patch.

Anyway:
Reviewed-by: Thomas Huth 



Re: [PATCH v14 5/6] powerpc: add crash CPU hotplug support

2023-12-19 Thread Hari Bathini

Hi Sourabh

On 11/12/23 2:00 pm, Sourabh Jain wrote:

Due to CPU/Memory hotplug or online/offline events the elfcorehdr
(which describes the CPUs and memory of the crashed kernel) and FDT
(Flattened Device Tree) of kdump image becomes outdated. Consequently,
attempting dump collection with an outdated elfcorehdr or FDT can lead
to failed or inaccurate dump collection.

Going forward CPU hotplug or online/offlice events are referred as


s/offlice/offline/


CPU/Memory add/remvoe events.


s/remvoe/remove/


The current solution to address the above issue involves monitoring the
CPU/memory add/remove events in userspace using udev rules and whenever
there are changes in CPU and memory resources, the entire kdump image
is loaded again. The kdump image includes kernel, initrd, elfcorehdr,
FDT, purgatory. Given that only elfcorehdr and FDT get outdated due to
CPU/Memory add/remove events, reloading the entire kdump image is
inefficient. More importantly, kdump remains inactive for a substantial
amount of time until the kdump reload completes.

To address the aforementioned issue, commit 247262756121 ("crash: add
generic infrastructure for crash hotplug support") added a generic
infrastructure that allows architectures to selectively update the kdump
image component during CPU or memory add/remove events within the kernel
itself.

In the event of a CPU or memory add/remove event, the generic crash
hotplug event handler, `crash_handle_hotplug_event()`, is triggered. It
then acquires the necessary locks to update the kdump image and invokes
the architecture-specific crash hotplug handler,
`arch_crash_handle_hotplug_event()`, to update the required kdump image
components.

This patch adds crash hotplug handler for PowerPC and enable support to
update the kdump image on CPU add/remove events. Support for memory
add/remove events is added in a subsequent patch with the title
"powerpc: add crash memory hotplug support."

As mentioned earlier, only the elfcorehdr and FDT kdump image components
need to be updated in the event of CPU or memory add/remove events.
However, the PowerPC architecture crash hotplug handler only updates the
FDT to enable crash hotplug support for CPU add/remove events. Here's
why.

The Elfcorehdr on PowerPC is built with possible CPUs, and thus, it does
not need an update on CPU add/remove events. On the other hand, the FDT
needs to be updated on CPU add events to include the newly added CPU. If
the FDT is not updated and the kernel crashes on a newly added CPU, the
kdump kernel will fail to boot due to the unavailability of the crashing
CPU in the FDT. During the early boot, it is expected that the boot CPU
must be a part of the FDT; otherwise, the kernel will raise a BUG and
fail to boot. For more information, refer to commit 36ae37e3436b0
("powerpc: Make boot_cpuid common between 32 and 64-bit"). Since it is
okay to have an offline CPU in the kdump FDT, no action is taken in case
of CPU removal.

There are two system calls, `kexec_file_load` and `kexec_load`, used to
load the kdump image. Few changes have been made to ensure kernel can
safely update the kdump FDT for both system calls.

For kexec_file_load syscall the kdump image is prepared in kernel. So to
support an increasing number of CPUs, the FDT is constructed with extra
buffer space to ensure it can accommodate a possible number of CPU
nodes. Additionally, a call to fdt_pack (which trims the unused space
once the FDT is prepared) is avoided for kdump image loading if this
feature is enabled.

For the kexec_load syscall, the FDT is updated only if both the
KEXEC_UPDATE_FDT and KEXEC_UPDATE_ELFCOREHDR kexec flags are passed to
the kernel by the kexec tool. Passing these flags to the kernel
indicates that the FDT is built to accommodate possible CPUs, and the
FDT segment is not considered for SHA calculation, making it safe to
update the FDT.

Commit 88a6f8994421 ("crash: memory and CPU hotplug sysfs attributes")
added a sysfs interface to indicate userspace (kdump udev rule) that
kernel will update the kdump image on CPU hotplug events, so kdump
reload can be avoided. Implement arch specific function
`arch_crash_hotplug_cpu_support()` to correctly advertise kernel
capability to update kdump image.

This feature is advertised to userspace when the following conditions
are met:

1. Kdump image is loaded using kexec_file_load system call.
2. Kdump image is loaded using kexec_load system and both
KEXEC_UPATE_ELFCOREHDR and KEXEC_UPDATE_FDT kexec flags are
passed to kernel.

The changes related to this feature are kept under the CRASH_HOTPLUG
config, and it is enabled by default.

Signed-off-by: Sourabh Jain 
Cc: Akhil Raj 
Cc: Andrew Morton 
Cc: Aneesh Kumar K.V 
Cc: Baoquan He 
Cc: Borislav Petkov (AMD) 
Cc: Boris Ostrovsky 
Cc: Christophe Leroy 
Cc: Dave Hansen 
Cc: Dave Young 
Cc: David Hildenbrand 
Cc: Eric DeVolder 
Cc: Greg Kroah-Hartman 
Cc: Hari Bathini 
Cc: Laurent Dufour 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 

[PATCH] powerpc/hvcall: Reorder Nestedv2 hcall opcodes

2023-12-19 Thread Vaibhav Jain
This trivial patch reorders the newly introduced hcall opcodes for Nestedv2
to follow the increasing-opcode-number convention followed in
'hvcall.h'. The patch also updates the value for MAX_HCALL_OPCODE which is
at various places in arch code for range checking.

Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Suggested-by: Michael Ellerman 
Signed-off-by: Vaibhav Jain
---
 arch/powerpc/include/asm/hvcall.h | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ddb99e982917..605ed2b58aff 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -349,7 +349,17 @@
 #define H_GET_ENERGY_SCALE_INFO0x450
 #define H_PKS_SIGNED_UPDATE0x454
 #define H_WATCHDOG 0x45C
-#define MAX_HCALL_OPCODE   H_WATCHDOG
+#define H_WATCHDOG 0x45C
+#define H_GUEST_GET_CAPABILITIES 0x460
+#define H_GUEST_SET_CAPABILITIES 0x464
+#define H_GUEST_CREATE 0x470
+#define H_GUEST_CREATE_VCPU0x474
+#define H_GUEST_GET_STATE  0x478
+#define H_GUEST_SET_STATE  0x47C
+#define H_GUEST_RUN_VCPU   0x480
+#define H_GUEST_COPY_MEMORY0x484
+#define H_GUEST_DELETE 0x488
+#define MAX_HCALL_OPCODE   H_GUEST_DELETE
 
 /* Scope args for H_SCM_UNBIND_ALL */
 #define H_UNBIND_SCOPE_ALL (0x1)
@@ -393,15 +403,6 @@
 #define H_ENTER_NESTED 0xF804
 #define H_TLB_INVALIDATE   0xF808
 #define H_COPY_TOFROM_GUEST0xF80C
-#define H_GUEST_GET_CAPABILITIES 0x460
-#define H_GUEST_SET_CAPABILITIES 0x464
-#define H_GUEST_CREATE 0x470
-#define H_GUEST_CREATE_VCPU0x474
-#define H_GUEST_GET_STATE  0x478
-#define H_GUEST_SET_STATE  0x47C
-#define H_GUEST_RUN_VCPU   0x480
-#define H_GUEST_COPY_MEMORY0x484
-#define H_GUEST_DELETE 0x488
 
 /* Flags for H_SVM_PAGE_IN */
 #define H_PAGE_IN_SHARED0x1
-- 
2.43.0



Re: [PATCH v2] ASoC: fsl_sai: Fix channel swap issue on i.MX8MP

2023-12-19 Thread Daniel Baluta
On Tue, Dec 19, 2023 at 5:12 AM Shengjiu Wang  wrote:
>
> When flag mclk_with_tere and mclk_direction_output enabled,
> The SAI transmitter or receiver will be enabled in very early
> stage, that if FSL_SAI_xMR is set by previous case,
> for example previous case is one channel, current case is
> two channels, then current case started with wrong xMR in
> the beginning, then channel swap happen.
>
> The patch is to clear xMR in hw_free() to avoid such
> channel swap issue.
>
> Fixes: 3e4a82612998 ("ASoC: fsl_sai: MCLK bind with TX/RX enable bit")
> Signed-off-by: Shengjiu Wang 

Reviewed-by: Daniel Baluta 


[PATCH 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2023-12-19 Thread peterx
From: Peter Xu 

Now follow_page() is ready to handle hugetlb pages in whatever form, and
over all architectures.  Switch to the generic code path.

Time to retire hugetlb_follow_page_mask(), following the previous
retirement of follow_hugetlb_page() in 4849807114b8.

There may be a slight difference of how the loops run when processing slow
GUP over a large hugetlb range on cont_pte/cont_pmd supported archs: each
loop of __get_user_pages() will resolve one pgtable entry with the patch
applied, rather than relying on the size of hugetlb hstate, the latter may
cover multiple entries in one loop.

A quick performance test on an aarch64 VM on M1 chip shows 15% degrade over
a tight loop of slow gup after the path switched.  That shouldn't be a
problem because slow-gup should not be a hot path for GUP in general: when
page is commonly present, fast-gup will already succeed, while when the
page is indeed missing and require a follow up page fault, the slow gup
degrade will probably buried in the fault paths anyway.  It also explains
why slow gup for THP used to be very slow before 57edfcfd3419 ("mm/gup:
accelerate thp gup even for "pages != NULL"") lands, the latter not part of
a performance analysis but a side benefit.  If the performance will be a
concern, we can consider handle CONT_PTE in follow_page().

Before that is justified to be necessary, keep everything clean and simple.

Signed-off-by: Peter Xu 
---
 include/linux/hugetlb.h |  7 
 mm/gup.c| 15 +++--
 mm/hugetlb.c| 71 -
 3 files changed, 5 insertions(+), 88 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index f8c5c174c8a6..8a352d577ebf 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -332,13 +332,6 @@ static inline void hugetlb_zap_end(
 {
 }
 
-static inline struct page *hugetlb_follow_page_mask(
-struct vm_area_struct *vma, unsigned long address, unsigned int flags,
-unsigned int *page_mask)
-{
-   BUILD_BUG(); /* should never be compiled in if !CONFIG_HUGETLB_PAGE*/
-}
-
 static inline int copy_hugetlb_page_range(struct mm_struct *dst,
  struct mm_struct *src,
  struct vm_area_struct *dst_vma,
diff --git a/mm/gup.c b/mm/gup.c
index 14a7d13e7bd6..f34c0a912311 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -997,18 +997,11 @@ static struct page *follow_page_mask(struct 
vm_area_struct *vma,
 {
pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;
+   struct page *page;
 
-   ctx->page_mask = 0;
-
-   /*
-* Call hugetlb_follow_page_mask for hugetlb vmas as it will use
-* special hugetlb page table walking code.  This eliminates the
-* need to check for hugetlb entries in the general walking code.
-*/
-   if (is_vm_hugetlb_page(vma))
-   return hugetlb_follow_page_mask(vma, address, flags,
-   >page_mask);
+   vma_pgtable_walk_begin(vma);
 
+   ctx->page_mask = 0;
pgd = pgd_offset(mm, address);
pgdval = *pgd;
 
@@ -1020,6 +1013,8 @@ static struct page *follow_page_mask(struct 
vm_area_struct *vma,
else
page = follow_p4d_mask(vma, address, pgd, flags, ctx);
 
+   vma_pgtable_walk_end(vma);
+
return page;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 29705e5c6f40..3013122a739f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6783,77 +6783,6 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 }
 #endif /* CONFIG_USERFAULTFD */
 
-struct page *hugetlb_follow_page_mask(struct vm_area_struct *vma,
- unsigned long address, unsigned int flags,
- unsigned int *page_mask)
-{
-   struct hstate *h = hstate_vma(vma);
-   struct mm_struct *mm = vma->vm_mm;
-   unsigned long haddr = address & huge_page_mask(h);
-   struct page *page = NULL;
-   spinlock_t *ptl;
-   pte_t *pte, entry;
-   int ret;
-
-   hugetlb_vma_lock_read(vma);
-   pte = hugetlb_walk(vma, haddr, huge_page_size(h));
-   if (!pte)
-   goto out_unlock;
-
-   ptl = huge_pte_lock(h, mm, pte);
-   entry = huge_ptep_get(pte);
-   if (pte_present(entry)) {
-   page = pte_page(entry);
-
-   if (!huge_pte_write(entry)) {
-   if (flags & FOLL_WRITE) {
-   page = NULL;
-   goto out;
-   }
-
-   if (gup_must_unshare(vma, flags, page)) {
-   /* Tell the caller to do unsharing */
-   page = ERR_PTR(-EMLINK);
-   goto out;
-   }
-   }
-
-   page = nth_page(page, ((address & ~huge_page_mask(h)) >> 
PAGE_SHIFT));
-
-   

[PATCH 12/13] mm/gup: Handle hugepd for follow_page()

2023-12-19 Thread peterx
From: Peter Xu 

Hugepd is only used in PowerPC so far on 4K page size kernels where hash
mmu is used.  follow_page_mask() used to leverage hugetlb APIs to access
hugepd entries.  Teach follow_page_mask() itself on hugepd.

With previous refactors on fast-gup gup_huge_pd(), most of the code can be
easily leveraged.  There's something not needed for follow page, for
example, gup_hugepte() tries to detect pgtable entry change which will
never happen with slow gup (which has the pgtable lock held), but that's
not a problem to check.

Since follow_page() always only fetch one page, set the end to "address +
PAGE_SIZE" should suffice.  We will still do the pgtable walk once for each
hugetlb page by setting ctx->page_mask properly.

One thing worth mentioning is that some level of pgtable's _bad() helper
will report is_hugepd() entries as TRUE on Power8 hash MMUs.  I think it at
least applies to PUD on Power8 with 4K pgsize.  It means feeding a hugepd
entry to pud_bad() will report a false positive. Let's leave that for now
because it can be arch-specific where I am a bit declined to touch.  In
this patch it's not a problem as long as hugepd is detected before any bad
pgtable entries.

Signed-off-by: Peter Xu 
---
 mm/gup.c | 78 +---
 1 file changed, 69 insertions(+), 9 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 080dff79b650..14a7d13e7bd6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -30,6 +30,11 @@ struct follow_page_context {
unsigned int page_mask;
 };
 
+static struct page *follow_hugepd(struct vm_area_struct *vma, hugepd_t hugepd,
+ unsigned long addr, unsigned int pdshift,
+ unsigned int flags,
+ struct follow_page_context *ctx);
+
 static inline void sanity_check_pinned_pages(struct page **pages,
 unsigned long npages)
 {
@@ -871,6 +876,9 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
return no_page_table(vma, flags, address);
+   if (unlikely(is_hugepd(__hugepd(pmd_val(pmdval)
+   return follow_hugepd(vma, __hugepd(pmd_val(pmdval)),
+address, PMD_SHIFT, flags, ctx);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, >pgmap);
@@ -921,6 +929,9 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
pud = *pudp;
if (pud_none(pud) || !pud_present(pud))
return no_page_table(vma, flags, address);
+   if (unlikely(is_hugepd(__hugepd(pud_val(pud)
+   return follow_hugepd(vma, __hugepd(pud_val(pud)),
+address, PUD_SHIFT, flags, ctx);
if (pud_huge(pud)) {
ptl = pud_lock(mm, pudp);
page = follow_huge_pud(vma, address, pudp, flags, ctx);
@@ -940,13 +951,17 @@ static struct page *follow_p4d_mask(struct vm_area_struct 
*vma,
unsigned int flags,
struct follow_page_context *ctx)
 {
-   p4d_t *p4d;
+   p4d_t *p4d, p4dval;
 
p4d = p4d_offset(pgdp, address);
-   if (p4d_none(*p4d))
-   return no_page_table(vma, flags, address);
-   BUILD_BUG_ON(p4d_huge(*p4d));
-   if (unlikely(p4d_bad(*p4d)))
+   p4dval = *p4d;
+   BUILD_BUG_ON(p4d_huge(p4dval));
+
+   if (unlikely(is_hugepd(__hugepd(p4d_val(p4dval)
+   return follow_hugepd(vma, __hugepd(p4d_val(p4dval)),
+address, P4D_SHIFT, flags, ctx);
+
+   if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
return no_page_table(vma, flags, address);
 
return follow_pud_mask(vma, address, p4d, flags, ctx);
@@ -980,7 +995,7 @@ static struct page *follow_page_mask(struct vm_area_struct 
*vma,
  unsigned long address, unsigned int flags,
  struct follow_page_context *ctx)
 {
-   pgd_t *pgd;
+   pgd_t *pgd, pgdval;
struct mm_struct *mm = vma->vm_mm;
 
ctx->page_mask = 0;
@@ -995,11 +1010,17 @@ static struct page *follow_page_mask(struct 
vm_area_struct *vma,
>page_mask);
 
pgd = pgd_offset(mm, address);
+   pgdval = *pgd;
 
-   if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
-   return no_page_table(vma, flags, address);
+   if (unlikely(is_hugepd(__hugepd(pgd_val(pgdval)
+   page = follow_hugepd(vma, __hugepd(pgd_val(pgdval)),
+address, PGDIR_SHIFT, flags, ctx);
+   else if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+   page = no_page_table(vma, flags, 

[PATCH 11/13] mm/gup: Handle huge pmd for follow_pmd_mask()

2023-12-19 Thread peterx
From: Peter Xu 

Replace pmd_trans_huge() with pmd_thp_or_huge() to also cover pmd_huge() as
long as enabled.

FOLL_TOUCH and FOLL_SPLIT_PMD only apply to THP, not yet huge.

Since now follow_trans_huge_pmd() can process hugetlb pages, renaming it
into follow_huge_pmd() to match what it does.  Move it into gup.c so not
depend on CONFIG_THP.

When at it, move the ctx->page_mask setup into follow_huge_pmd(), only set
it when the page is valid.  It was not a bug to set it before even if GUP
failed (page==NULL), because follow_page_mask() callers always ignores
page_mask if so.  But doing so makes the code cleaner.

Signed-off-by: Peter Xu 
---
 mm/gup.c | 107 ---
 mm/huge_memory.c |  86 +
 mm/internal.h|   5 +--
 3 files changed, 105 insertions(+), 93 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 5b14f91d2f6b..080dff79b650 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -580,6 +580,93 @@ static struct page *follow_huge_pud(struct vm_area_struct 
*vma,
 
return page;
 }
+
+/* FOLL_FORCE can write to even unwritable PMDs in COW mappings. */
+static inline bool can_follow_write_pmd(pmd_t pmd, struct page *page,
+   struct vm_area_struct *vma,
+   unsigned int flags)
+{
+   /* If the pmd is writable, we can write to the page. */
+   if (pmd_write(pmd))
+   return true;
+
+   /* Maybe FOLL_FORCE is set to override it? */
+   if (!(flags & FOLL_FORCE))
+   return false;
+
+   /* But FOLL_FORCE has no effect on shared mappings */
+   if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+   return false;
+
+   /* ... or read-only private ones */
+   if (!(vma->vm_flags & VM_MAYWRITE))
+   return false;
+
+   /* ... or already writable ones that just need to take a write fault */
+   if (vma->vm_flags & VM_WRITE)
+   return false;
+
+   /*
+* See can_change_pte_writable(): we broke COW and could map the page
+* writable if we have an exclusive anonymous page ...
+*/
+   if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+   return false;
+
+   /* ... and a write-fault isn't required for other reasons. */
+   if (vma_soft_dirty_enabled(vma) && !pmd_soft_dirty(pmd))
+   return false;
+   return !userfaultfd_huge_pmd_wp(vma, pmd);
+}
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+   unsigned long addr, pmd_t *pmd,
+   unsigned int flags,
+   struct follow_page_context *ctx)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pmd_t pmdval = *pmd;
+   struct page *page;
+   int ret;
+
+   assert_spin_locked(pmd_lockptr(mm, pmd));
+
+   page = pmd_page(pmdval);
+   VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
+
+   if ((flags & FOLL_WRITE) &&
+   !can_follow_write_pmd(pmdval, page, vma, flags))
+   return NULL;
+
+   /* Avoid dumping huge zero page */
+   if ((flags & FOLL_DUMP) && is_huge_zero_pmd(pmdval))
+   return ERR_PTR(-EFAULT);
+
+   if (pmd_protnone(*pmd) && !gup_can_follow_protnone(vma, flags))
+   return NULL;
+
+   if (!pmd_write(pmdval) && gup_must_unshare(vma, flags, page))
+   return ERR_PTR(-EMLINK);
+
+   VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
+   !PageAnonExclusive(page), page);
+
+   ret = try_grab_page(page, flags);
+   if (ret)
+   return ERR_PTR(ret);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   if (pmd_trans_huge(pmdval) && (flags & FOLL_TOUCH))
+   touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+   page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
+   ctx->page_mask = HPAGE_PMD_NR - 1;
+   VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
+
+   return page;
+}
+
 #else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 static struct page *follow_huge_pud(struct vm_area_struct *vma,
unsigned long addr, pud_t *pudp,
@@ -587,6 +674,14 @@ static struct page *follow_huge_pud(struct vm_area_struct 
*vma,
 {
return NULL;
 }
+
+static struct page *follow_huge_pmd(struct vm_area_struct *vma,
+   unsigned long addr, pmd_t *pmd,
+   unsigned int flags,
+   struct follow_page_context *ctx)
+{
+   return NULL;
+}
 #endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
 
 static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
@@ -784,31 +879,31 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
return page;
   

[PATCH 10/13] mm/gup: Handle huge pud for follow_pud_mask()

2023-12-19 Thread peterx
From: Peter Xu 

Teach follow_pud_mask() to be able to handle normal PUD pages like hugetlb.

Rename follow_devmap_pud() to follow_huge_pud() so that it can process
either huge devmap or hugetlb. Move it out of TRANSPARENT_HUGEPAGE_PUD and
and huge_memory.c (which relies on CONFIG_THP).

In the new follow_huge_pud(), taking care of possible CoR for hugetlb if
necessary.  touch_pud() needs to be moved out of huge_memory.c to be
accessable from gup.c even if !THP.

Since at it, optimize the non-present check by adding a pud_present() early
check before taking the pgtable lock, failing the follow_page() early if
PUD is not present: that is required by both devmap or hugetlb.  Use
pud_huge() to also cover the pud_devmap() case.

One more trivial thing to mention is, introduce "pud_t pud" in the code
paths along the way, so the code doesn't dereference *pudp multiple time.
Not only because that looks less straightforward, but also because if the
dereference really happened, it's not clear whether there can be race to
see different *pudp values when it's being modified at the same time.

Setting ctx->page_mask properly for a PUD entry.  As a side effect, this
patch should also be able to optimize devmap GUP on PUD to be able to jump
over the whole PUD range, but not yet verified.  Hugetlb already can do so
prior to this patch.

Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h |  8 -
 mm/gup.c| 70 +++--
 mm/huge_memory.c| 47 ++-
 mm/internal.h   |  2 ++
 4 files changed, 71 insertions(+), 56 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d335130e145f..80f181d76f94 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -346,8 +346,6 @@ static inline bool folio_test_pmd_mappable(struct folio 
*folio)
 
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags, struct dev_pagemap **pgmap);
-struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
-   pud_t *pud, int flags, struct dev_pagemap **pgmap);
 
 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf);
 
@@ -503,12 +501,6 @@ static inline struct page *follow_devmap_pmd(struct 
vm_area_struct *vma,
return NULL;
 }
 
-static inline struct page *follow_devmap_pud(struct vm_area_struct *vma,
-   unsigned long addr, pud_t *pud, int flags, struct dev_pagemap **pgmap)
-{
-   return NULL;
-}
-
 static inline bool thp_migration_supported(void)
 {
return false;
diff --git a/mm/gup.c b/mm/gup.c
index 97e87b7a15c3..5b14f91d2f6b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -525,6 +525,70 @@ static struct page *no_page_table(struct vm_area_struct 
*vma,
return NULL;
 }
 
+#ifdef CONFIG_PGTABLE_HAS_HUGE_LEAVES
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pudp,
+   int flags, struct follow_page_context *ctx)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   struct page *page;
+   pud_t pud = *pudp;
+   unsigned long pfn = pud_pfn(pud);
+   int ret;
+
+   assert_spin_locked(pud_lockptr(mm, pudp));
+
+   if ((flags & FOLL_WRITE) && !pud_write(pud))
+   return NULL;
+
+   if (!pud_present(pud))
+   return NULL;
+
+   pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
+
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+   if (pud_devmap(pud)) {
+   /*
+* device mapped pages can only be returned if the caller
+* will manage the page reference count.
+*
+* At least one of FOLL_GET | FOLL_PIN must be set, so
+* assert that here:
+*/
+   if (!(flags & (FOLL_GET | FOLL_PIN)))
+   return ERR_PTR(-EEXIST);
+
+   if (flags & FOLL_TOUCH)
+   touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
+
+   ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
+   if (!ctx->pgmap)
+   return ERR_PTR(-EFAULT);
+   }
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+   page = pfn_to_page(pfn);
+
+   if (!pud_devmap(pud) && !pud_write(pud) &&
+   gup_must_unshare(vma, flags, page))
+   return ERR_PTR(-EMLINK);
+
+   ret = try_grab_page(page, flags);
+   if (ret)
+   page = ERR_PTR(ret);
+   else
+   ctx->page_mask = HPAGE_PUD_NR - 1;
+
+   return page;
+}
+#else  /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+static struct page *follow_huge_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pudp,
+   int flags, struct follow_page_context *ctx)
+{
+   return NULL;
+}
+#endif /* CONFIG_PGTABLE_HAS_HUGE_LEAVES */
+
 static 

[PATCH 09/13] mm/gup: Cache *pudp in follow_pud_mask()

2023-12-19 Thread peterx
From: Peter Xu 

Introduce "pud_t pud" in the function, so the code won't dereference *pudp
multiple time.  Not only because that looks less straightforward, but also
because if the dereference really happened, it's not clear whether there
can be race to see different *pudp values if it's being modified at the
same time.

Signed-off-by: Peter Xu 
---
 mm/gup.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 6c0d82fa8cc7..97e87b7a15c3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -753,26 +753,27 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
unsigned int flags,
struct follow_page_context *ctx)
 {
-   pud_t *pud;
+   pud_t *pudp, pud;
spinlock_t *ptl;
struct page *page;
struct mm_struct *mm = vma->vm_mm;
 
-   pud = pud_offset(p4dp, address);
-   if (pud_none(*pud))
+   pudp = pud_offset(p4dp, address);
+   pud = *pudp;
+   if (pud_none(pud))
return no_page_table(vma, flags, address);
-   if (pud_devmap(*pud)) {
-   ptl = pud_lock(mm, pud);
-   page = follow_devmap_pud(vma, address, pud, flags, >pgmap);
+   if (pud_devmap(pud)) {
+   ptl = pud_lock(mm, pudp);
+   page = follow_devmap_pud(vma, address, pudp, flags, 
>pgmap);
spin_unlock(ptl);
if (page)
return page;
return no_page_table(vma, flags, address);
}
-   if (unlikely(pud_bad(*pud)))
+   if (unlikely(pud_bad(pud)))
return no_page_table(vma, flags, address);
 
-   return follow_pmd_mask(vma, address, pud, flags, ctx);
+   return follow_pmd_mask(vma, address, pudp, flags, ctx);
 }
 
 static struct page *follow_p4d_mask(struct vm_area_struct *vma,
-- 
2.41.0



[PATCH 08/13] mm/gup: Handle hugetlb for no_page_table()

2023-12-19 Thread peterx
From: Peter Xu 

no_page_table() is not yet used for hugetlb code paths. Make it prepared.

The major difference here is hugetlb will return -EFAULT as long as page
cache does not exist, even if VM_SHARED.  See hugetlb_follow_page_mask().

Pass "address" into no_page_table() too, as hugetlb will need it.

Reviewed-by: Christoph Hellwig 
Signed-off-by: Peter Xu 
---
 mm/gup.c | 44 ++--
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 82d28d517d0d..6c0d82fa8cc7 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -501,19 +501,27 @@ static inline void mm_set_has_pinned_flag(unsigned long 
*mm_flags)
 
 #ifdef CONFIG_MMU
 static struct page *no_page_table(struct vm_area_struct *vma,
-   unsigned int flags)
+ unsigned int flags, unsigned long address)
 {
+   if (!(flags & FOLL_DUMP))
+   return NULL;
+
/*
-* When core dumping an enormous anonymous area that nobody
-* has touched so far, we don't want to allocate unnecessary pages or
+* When core dumping, we don't want to allocate unnecessary pages or
 * page tables.  Return error instead of NULL to skip handle_mm_fault,
 * then get_dump_page() will return NULL to leave a hole in the dump.
 * But we can only make this optimization where a hole would surely
 * be zero-filled if handle_mm_fault() actually did handle it.
 */
-   if ((flags & FOLL_DUMP) &&
-   (vma_is_anonymous(vma) || !vma->vm_ops->fault))
+   if (is_vm_hugetlb_page(vma)) {
+   struct hstate *h = hstate_vma(vma);
+
+   if (!hugetlbfs_pagecache_present(h, vma, address))
+   return ERR_PTR(-EFAULT);
+   } else if ((vma_is_anonymous(vma) || !vma->vm_ops->fault)) {
return ERR_PTR(-EFAULT);
+   }
+
return NULL;
 }
 
@@ -593,7 +601,7 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
 
ptep = pte_offset_map_lock(mm, pmd, address, );
if (!ptep)
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
pte = ptep_get(ptep);
if (!pte_present(pte))
goto no_page;
@@ -685,7 +693,7 @@ static struct page *follow_page_pte(struct vm_area_struct 
*vma,
pte_unmap_unlock(ptep, ptl);
if (!pte_none(pte))
return NULL;
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
 }
 
 static struct page *follow_pmd_mask(struct vm_area_struct *vma,
@@ -701,27 +709,27 @@ static struct page *follow_pmd_mask(struct vm_area_struct 
*vma,
pmd = pmd_offset(pudp, address);
pmdval = pmdp_get_lockless(pmd);
if (pmd_none(pmdval))
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
if (pmd_devmap(pmdval)) {
ptl = pmd_lock(mm, pmd);
page = follow_devmap_pmd(vma, address, pmd, flags, >pgmap);
spin_unlock(ptl);
if (page)
return page;
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
}
if (likely(!pmd_trans_huge(pmdval)))
return follow_page_pte(vma, address, pmd, flags, >pgmap);
 
if (pmd_protnone(pmdval) && !gup_can_follow_protnone(vma, flags))
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
 
ptl = pmd_lock(mm, pmd);
if (unlikely(!pmd_present(*pmd))) {
spin_unlock(ptl);
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
}
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(ptl);
@@ -752,17 +760,17 @@ static struct page *follow_pud_mask(struct vm_area_struct 
*vma,
 
pud = pud_offset(p4dp, address);
if (pud_none(*pud))
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
if (pud_devmap(*pud)) {
ptl = pud_lock(mm, pud);
page = follow_devmap_pud(vma, address, pud, flags, >pgmap);
spin_unlock(ptl);
if (page)
return page;
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
}
if (unlikely(pud_bad(*pud)))
-   return no_page_table(vma, flags);
+   return no_page_table(vma, flags, address);
 
return follow_pmd_mask(vma, address, pud, flags, ctx);
 }
@@ -776,10 +784,10 @@ static struct page 

[PATCH 07/13] mm/gup: Refactor record_subpages() to find 1st small page

2023-12-19 Thread peterx
From: Peter Xu 

All the fast-gup functions take a tail page to operate, always need to do
page mask calculations before feeding that into record_subpages().

Merge that logic into record_subpages(), so that it will do the nth_page()
calculation.

Signed-off-by: Peter Xu 
---
 mm/gup.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index bb5b7134f10b..82d28d517d0d 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2767,13 +2767,16 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
*pudp, unsigned long addr,
 }
 #endif
 
-static int record_subpages(struct page *page, unsigned long addr,
-  unsigned long end, struct page **pages)
+static int record_subpages(struct page *page, unsigned long sz,
+  unsigned long addr, unsigned long end,
+  struct page **pages)
 {
+   struct page *start_page;
int nr;
 
+   start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT);
for (nr = 0; addr != end; nr++, addr += PAGE_SIZE)
-   pages[nr] = nth_page(page, nr);
+   pages[nr] = nth_page(start_page, nr);
 
return nr;
 }
@@ -2808,8 +2811,8 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
/* hugepages are never "special" */
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 
-   page = nth_page(pte_page(pte), (addr & (sz - 1)) >> PAGE_SHIFT);
-   refs = record_subpages(page, addr, end, pages + *nr);
+   page = pte_page(pte);
+   refs = record_subpages(page, sz, addr, end, pages + *nr);
 
folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2882,8 +2885,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned 
long addr,
 pages, nr);
}
 
-   page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT);
-   refs = record_subpages(page, addr, end, pages + *nr);
+   page = pmd_page(orig);
+   refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
 
folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2926,8 +2929,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned 
long addr,
 pages, nr);
}
 
-   page = nth_page(pud_page(orig), (addr & ~PUD_MASK) >> PAGE_SHIFT);
-   refs = record_subpages(page, addr, end, pages + *nr);
+   page = pud_page(orig);
+   refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
 
folio = try_grab_folio(page, refs, flags);
if (!folio)
@@ -2966,8 +2969,8 @@ static int gup_huge_pgd(pgd_t orig, pgd_t *pgdp, unsigned 
long addr,
 
BUILD_BUG_ON(pgd_devmap(orig));
 
-   page = nth_page(pgd_page(orig), (addr & ~PGDIR_MASK) >> PAGE_SHIFT);
-   refs = record_subpages(page, addr, end, pages + *nr);
+   page = pgd_page(orig);
+   refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
 
folio = try_grab_folio(page, refs, flags);
if (!folio)
-- 
2.41.0



[PATCH 06/13] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-12-19 Thread peterx
From: Peter Xu 

Hugepd format for GUP is only used in PowerPC with hugetlbfs.  There are
some kernel usage of hugepd (can refer to hugepd_populate_kernel() for
PPC_8XX), however those pages are not candidates for GUP.

Commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM GUP-fast writing to
file-backed mappings") added a check to fail gup-fast if there's potential
risk of violating GUP over writeback file systems.  That should never apply
to hugepd.  Considering that hugepd is an old format (and even
software-only), there's no plan to extend hugepd into other file typed
memories that is prone to the same issue.

Drop that check, not only because it'll never be true for hugepd per any
known plan, but also it paves way for reusing the function outside
fast-gup.

To make sure we'll still remember this issue just in case hugepd will be
extended to support non-hugetlbfs memories, add a rich comment above
gup_huge_pd(), explaining the issue with proper references.

Cc: Christoph Hellwig 
Cc: Lorenzo Stoakes 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Peter Xu 
---
 mm/gup.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index efc9847e58fb..bb5b7134f10b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2820,11 +2820,6 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
return 0;
}
 
-   if (!folio_fast_pin_allowed(folio, flags)) {
-   gup_put_folio(folio, refs, flags);
-   return 0;
-   }
-
if (!pte_write(pte) && gup_must_unshare(NULL, flags, >page)) {
gup_put_folio(folio, refs, flags);
return 0;
@@ -2835,6 +2830,14 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
unsigned long addr,
return 1;
 }
 
+/*
+ * NOTE: currently GUP for a hugepd is only possible on hugetlbfs file
+ * systems on Power, which does not have issue with folio writeback against
+ * GUP updates.  When hugepd will be extended to support non-hugetlbfs or
+ * even anonymous memory, we need to do extra check as what we do with most
+ * of the other folios. See writable_file_mapping_allowed() and
+ * folio_fast_pin_allowed() for more information.
+ */
 static int gup_huge_pd(hugepd_t hugepd, unsigned long addr,
unsigned int pdshift, unsigned long end, unsigned int flags,
struct page **pages, int *nr)
-- 
2.41.0



[PATCH 05/13] mm: Introduce vma_pgtable_walk_{begin|end}()

2023-12-19 Thread peterx
From: Peter Xu 

Introduce per-vma begin()/end() helpers for pgtable walks.  This is a
preparation work to merge hugetlb pgtable walkers with generic mm.

The helpers need to be called before and after a pgtable walk, will start
to be needed if the pgtable walker code supports hugetlb pages.  It's a
hook point for any type of VMA, but for now only hugetlb uses it to
stablize the pgtable pages from getting away (due to possible pmd
unsharing).

Reviewed-by: Christoph Hellwig 
Signed-off-by: Peter Xu 
---
 include/linux/mm.h |  3 +++
 mm/memory.c| 12 
 2 files changed, 15 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b72bf25a45cf..85e43775824b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4181,4 +4181,7 @@ static inline bool pfn_is_unaccepted_memory(unsigned long 
pfn)
return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+void vma_pgtable_walk_begin(struct vm_area_struct *vma);
+void vma_pgtable_walk_end(struct vm_area_struct *vma);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index 1795aba53cf5..9ac6a9db971e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6270,3 +6270,15 @@ void ptlock_free(struct ptdesc *ptdesc)
kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
+
+void vma_pgtable_walk_begin(struct vm_area_struct *vma)
+{
+   if (is_vm_hugetlb_page(vma))
+   hugetlb_vma_lock_read(vma);
+}
+
+void vma_pgtable_walk_end(struct vm_area_struct *vma)
+{
+   if (is_vm_hugetlb_page(vma))
+   hugetlb_vma_unlock_read(vma);
+}
-- 
2.41.0