date:20240416

Re: [kvm-unit-tests PATCH v8 09/35] powerpc: Fix stack backtrace termination

2024-04-16 Thread Thomas Huth


On 05/04/2024 10.35, Nicholas Piggin wrote:

The backtrace handler terminates when it sees a NULL caller address,
but the powerpc stack setup does not keep such a NULL caller frame
at the start of the stack.

This happens to work on pseries because the memory at 0 is mapped and
it contains 0 at the location of the return address pointer if it
were a stack frame. But this is fragile, and does not work with powernv
where address 0 contains firmware instructions.

Use the existing dummy frame on stack as the NULL caller, and create a
new frame on stack for the entry code.

Signed-off-by: Nicholas Piggin 
---
  powerpc/cstart64.S | 15 +++
  1 file changed, 15 insertions(+)


Reviewed-by: Thomas Huth

Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc

2024-04-16 Thread Thomas Huth


On 11/04/2024 21.22, Thomas Huth wrote:

On 08/04/2024 18.06, Nico Boehr wrote:

Quoting Nicholas Piggin (2024-04-05 10:35:07)

The migration harness is complicated and easy to break so CI will
be helpful.

Signed-off-by: Nicholas Piggin 
---
  .gitlab-ci.yml  | 32 +++-
  s390x/unittests.cfg |  8 
  2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index ff34b1f50..60b3cdfd2 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml

[...]

@@ -135,7 +147,7 @@ build-riscv64:
  build-s390x:
   extends: .outoftree_template
   script:
- - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu
+ - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat
   - mkdir build
   - cd build
   - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu-
@@ -161,6 +173,8 @@ build-s390x:
    sclp-1g
    sclp-3g
    selftest-setup
+  selftest-migration-kvm


We're running under TCG in the Gitlab CI. I'm a little bit confused why
we're running a KVM-only test here.


The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that 
runs on a KVM-capable s390x host, so it could be added there?


I now gave it a try and it seems to work, so I updated this patch and pushed 
it to the repository now.


 Thomas

Re: [PATCH v4 05/15] mm: introduce execmem_alloc() and execmem_free()

2024-04-16 Thread Mike Rapoport

On Mon, Apr 15, 2024 at 06:36:39PM +0100, Mark Rutland wrote:
> On Mon, Apr 15, 2024 at 09:52:41AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 11, 2024 at 07:00:41PM +0300, Mike Rapoport wrote:
> > > +/**
> > > + * enum execmem_type - types of executable memory ranges
> > > + *
> > > + * There are several subsystems that allocate executable memory.
> > > + * Architectures define different restrictions on placement,
> > > + * permissions, alignment and other parameters for memory that can be 
> > > used
> > > + * by these subsystems.
> > > + * Types in this enum identify subsystems that allocate executable memory
> > > + * and let architectures define parameters for ranges suitable for
> > > + * allocations by each subsystem.
> > > + *
> > > + * @EXECMEM_DEFAULT: default parameters that would be used for types that
> > > + * are not explcitly defined.
> > > + * @EXECMEM_MODULE_TEXT: parameters for module text sections
> > > + * @EXECMEM_KPROBES: parameters for kprobes
> > > + * @EXECMEM_FTRACE: parameters for ftrace
> > > + * @EXECMEM_BPF: parameters for BPF
> > > + * @EXECMEM_TYPE_MAX:
> > > + */
> > > +enum execmem_type {
> > > + EXECMEM_DEFAULT,
> > > + EXECMEM_MODULE_TEXT = EXECMEM_DEFAULT,
> > > + EXECMEM_KPROBES,
> > > + EXECMEM_FTRACE,
> > > + EXECMEM_BPF,
> > > + EXECMEM_TYPE_MAX,
> > > +};
> > 
> > Can we please get a break-down of how all these types are actually
> > different from one another?
> > 
> > I'm thinking some platforms have a tiny immediate space (arm64 comes to
> > mind) and has less strict placement constraints for some of them?
> 
> Yeah, and really I'd *much* rather deal with that in arch code, as I have said
> several times.
> 
> For arm64 we have two bsaic restrictions: 
> 
> 1) Direct branches can go +/-128M
>We can expand this range by having direct branches go to PLTs, at a
>performance cost.
> 
> 2) PREL32 relocations can go +/-2G
>We cannot expand this further.
> 
> * We don't need to allocate memory for ftrace. We do not use trampolines.
> 
> * Kprobes XOL areas don't care about either of those; we don't place any
>   PC-relative instructions in those. Maybe we want to in future.
> 
> * Modules care about both; we'd *prefer* to place them within +/-128M of all
>   other kernel/module code, but if there's no space we can use PLTs and expand
>   that to +/-2G. Since modules can refreence other modules, that ends up
>   actually being halved, and modules have to fit within some 2G window that
>   also covers the kernel.
> 
> * I'm not sure about BPF's requirements; it seems happy doing the same as
>   modules.

BPF are happy with vmalloc().
 
> So if we *must* use a common execmem allocator, what we'd reall want is our 
> own
> types, e.g.
> 
>   EXECMEM_ANYWHERE
>   EXECMEM_NOPLT
>   EXECMEM_PREL32
> 
> ... and then we use those in arch code to implement module_alloc() and 
> friends.

I'm looking at execmem_types more as definition of the consumers, maybe I
should have named the enum execmem_consumer at the first place.

And the arch constrains defined in struct execmem_range describe how memory
should be allocated for each consumer.

These constraints are defined early at boot and remain static, so
initializing them once and letting a common allocator use them makes
perfect sense to me.

I agree that fallback_{start,end} are not ideal, but we have 3
architectures that have preferred and secondary range for modules. And arm
and powerpc use the same logic for kprobes as well, and I don't see why this
code should be duplicated.

And, for instance, if you decide to place PC-relative instructions if
kprobes XOL areas, you'd only need to update execmem_range for kprobes to
be more like the range for modules.

With central allocator it's easier to deal with the things like
VM_FLUSH_RESET_PERMS and caching of ROX memory and I think it will be more
maintainable that module_alloc(), alloc_insn_page() and
bpf_jit_alloc_exec() spread all over the place.
 
> Mark.

-- 
Sincerely yours,
Mike.

Re: [kvm-unit-tests PATCH v8 10/35] powerpc: interrupt stack backtracing

2024-04-16 Thread Thomas Huth


On 05/04/2024 10.35, Nicholas Piggin wrote:

Add support for backtracing across interrupt stacks, and add
interrupt frame backtrace for unhandled interrupts.

This requires a back-chain created from initial interrupt stack
frame to the r1 value of the interrupted context. A label is
added at the return location of the exception handler call, so
the unwinder can recognize the initial interrupt frame.

The additional cstart entry-frame is no longer required because
the unwinder now looks for frame == 0 as well as address == 0.

Signed-off-by: Nicholas Piggin 
---


Acked-by: Thomas Huth

Re: [RFC PATCH 6/7] execmem: add support for cache of large ROX pages

2024-04-16 Thread Peter Zijlstra

On Mon, Apr 15, 2024 at 08:00:26PM +0300, Mike Rapoport wrote:
> On Mon, Apr 15, 2024 at 12:47:50PM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 11, 2024 at 07:05:25PM +0300, Mike Rapoport wrote:
> > 
> > > To populate the cache, a writable large page is allocated from vmalloc 
> > > with
> > > VM_ALLOW_HUGE_VMAP, filled with invalid instructions and then remapped as
> > > ROX.
> > 
> > > +static void execmem_invalidate(void *ptr, size_t size, bool writable)
> > > +{
> > > + if (execmem_info->invalidate)
> > > + execmem_info->invalidate(ptr, size, writable);
> > > + else
> > > + memset(ptr, 0, size);
> > > +}
> > 
> > +static void execmem_invalidate(void *ptr, size_t size, bool writeable)
> > +{
> > +   /* fill memory with INT3 instructions */
> > +   if (writeable)
> > +   memset(ptr, 0xcc, size);
> > +   else
> > +   text_poke_set(ptr, 0xcc, size);
> > +}
> > 
> > Thing is, 0xcc (aka INT3_INSN_OPCODE) is not an invalid instruction.
> > It raises #BP not #UD.
> 
> Do you mean that _invalidate is a poor name choice or that it's necessary
> to use an instruction that raises #UD?

Poor naming, mostly. #BP handler will still scream bloody murder if the
site is otherwise unclaimed.

It just isn't an invalid instruction.

Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc

2024-04-16 Thread Thomas Huth


On 16/04/2024 09.18, Thomas Huth wrote:

On 11/04/2024 21.22, Thomas Huth wrote:

On 08/04/2024 18.06, Nico Boehr wrote:

Quoting Nicholas Piggin (2024-04-05 10:35:07)

The migration harness is complicated and easy to break so CI will
be helpful.

Signed-off-by: Nicholas Piggin 
---
  .gitlab-ci.yml  | 32 +++-
  s390x/unittests.cfg |  8 
  2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index ff34b1f50..60b3cdfd2 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml

[...]

@@ -135,7 +147,7 @@ build-riscv64:
  build-s390x:
   extends: .outoftree_template
   script:
- - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu
+ - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat
   - mkdir build
   - cd build
   - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu-
@@ -161,6 +173,8 @@ build-s390x:
    sclp-1g
    sclp-3g
    selftest-setup
+  selftest-migration-kvm


We're running under TCG in the Gitlab CI. I'm a little bit confused why
we're running a KVM-only test here.


The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that 
runs on a KVM-capable s390x host, so it could be added there?


I now gave it a try and it seems to work, so I updated this patch and pushed 
it to the repository now.


Hmm, "selftest-migration" now was failing once here:

 https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6633865591

Let's keep an eye on it, and if it is not stable enough, we might need to 
disable it in the CI again...


 Thomas

[PATCH v9 2/3] powerpc/fadump: add hotplug_ready sysfs interface

2024-04-16 Thread Sourabh Jain

The elfcorehdr describes the CPUs and memory of the crashed kernel to
the kernel that captures the dump, known as the second or fadump kernel.
The elfcorehdr needs to be updated if the system's memory changes due to
memory hotplug or online/offline events.

Currently, memory hotplug events are monitored in userspace by udev
rules, and fadump is re-registered, which recreates the elfcorehdr with
the latest available memory in the system.

However, the previous patch ("powerpc: make fadump resilient with memory
add/remove events") moved the creation of elfcorehdr to the second or
fadump kernel. This eliminates the need to regenerate the elfcorehdr
during memory hotplug or online/offline events.

Create a sysfs entry at /sys/kernel/fadump/hotplug_ready to let
userspace know that fadump re-registration is not required for memory
add/remove events.

Signed-off-by: Sourabh Jain 
Cc: Aditya Gupta 
Cc: "Aneesh Kumar K.V" 
Cc: Hari Bathini 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
---
 Documentation/ABI/testing/sysfs-kernel-fadump | 11 +++
 arch/powerpc/kernel/fadump.c  | 14 ++
 2 files changed, 25 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-kernel-fadump 
b/Documentation/ABI/testing/sysfs-kernel-fadump
index 8f7a64a81783..c586054657d6 100644
--- a/Documentation/ABI/testing/sysfs-kernel-fadump
+++ b/Documentation/ABI/testing/sysfs-kernel-fadump
@@ -38,3 +38,14 @@ Contact: linuxppc-dev@lists.ozlabs.org
 Description:   read only
Provide information about the amount of memory reserved by
FADump to save the crash dump in bytes.
+
+What:  /sys/kernel/fadump/hotplug_ready
+Date:  Apr 2024
+Contact:   linuxppc-dev@lists.ozlabs.org
+Description:   read only
+   Kdump udev rule re-registers fadump on memory add/remove events,
+   primarily to update the elfcorehdr. This sysfs indicates the
+   kdump udev rule that fadump re-registration is not required on
+   memory add/remove events because elfcorehdr is now prepared in
+   the second/fadump kernel.
+User:  kexec-tools
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index e816725a11c0..131bf0d2d45d 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1442,6 +1442,18 @@ static ssize_t enabled_show(struct kobject *kobj,
return sprintf(buf, "%d\n", fw_dump.fadump_enabled);
 }
 
+/*
+ * /sys/kernel/fadump/hotplug_ready sysfs node returns 1, which inidcates
+ * to usersapce that fadump re-registration is not required on memory
+ * hotplug events.
+ */
+static ssize_t hotplug_ready_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *buf)
+{
+   return sprintf(buf, "%d\n", 1);
+}
+
 static ssize_t mem_reserved_show(struct kobject *kobj,
 struct kobj_attribute *attr,
 char *buf)
@@ -1514,11 +1526,13 @@ static struct kobj_attribute release_attr = 
__ATTR_WO(release_mem);
 static struct kobj_attribute enable_attr = __ATTR_RO(enabled);
 static struct kobj_attribute register_attr = __ATTR_RW(registered);
 static struct kobj_attribute mem_reserved_attr = __ATTR_RO(mem_reserved);
+static struct kobj_attribute hotplug_ready_attr = __ATTR_RO(hotplug_ready);
 
 static struct attribute *fadump_attrs[] = {
&enable_attr.attr,
®ister_attr.attr,
&mem_reserved_attr.attr,
+   &hotplug_ready_attr.attr,
NULL,
 };
 
-- 
2.43.0

[PATCH v9 1/3] powerpc: make fadump resilient with memory add/remove events

2024-04-16 Thread Sourabh Jain

Due to changes in memory resources caused by either memory hotplug or
online/offline events, the elfcorehdr, which describes the CPUs and
memory of the crashed kernel to the kernel that collects the dump (known
as second/fadump kernel), becomes outdated. Consequently, attempting
dump collection with an outdated elfcorehdr can lead to failed or
inaccurate dump collection.

Memory hotplug or online/offline events is referred as memory add/remove
events in reset of the commit message.

The current solution to address the aforementioned issue is as follows:
Monitor memory add/remove events in userspace using udev rules, and
re-register fadump whenever there are changes in memory resources. This
leads to the creation of a new elfcorehdr with updated system memory
information.

There are several notable issues associated with re-registering fadump
for every memory add/remove events.

1. Bulk memory add/remove events with udev-based fadump re-registration
   can lead to race conditions and, more importantly, it creates a wide
   window during which fadump is inactive until all memory add/remove
   events are settled.
2. Re-registering fadump for every memory add/remove event is
   inefficient.
3. The memory for elfcorehdr is allocated based on the memblock regions
   available during early boot and remains fixed thereafter. However, if
   elfcorehdr is later recreated with additional memblock regions, its
   size will increase, potentially leading to memory corruption.

Address the aforementioned challenges by shifting the creation of
elfcorehdr from the first kernel (also referred as the crashed kernel),
where it was created and frequently recreated for every memory
add/remove event, to the fadump kernel. As a result, the elfcorehdr only
needs to be created once, thus eliminating the necessity to re-register
fadump during memory add/remove events.

At present, the first kernel prepares fadump header and stores it in the
fadump reserved area. The fadump header includes the start address of
the elfcorehdr, crashing CPU details, and other relevant information. In
the event of a crash in the first kernel, the second/fadump boots and
accesses the fadump header prepared by the first kernel. It then
performs the following steps in a platform-specific function
[rtas|opal]_fadump_process:

1. Sanity check for fadump header
2. Update CPU notes in elfcorehdr

Along with the above, update the setup_fadump()/fadump.c to create
elfcorehdr and set its address to the global variable elfcorehdr_addr
for the vmcore module to process it in the second/fadump kernel.

Section below outlines the information required to create the elfcorehdr
and the changes made to make it available to the fadump kernel if it's
not already.

To create elfcorehdr, the following crashed kernel information is
required: CPU notes, vmcoreinfo, and memory ranges.

At present, the CPU notes are already prepared in the fadump kernel, so
no changes are needed in that regard. The fadump kernel has access to
all crashed kernel memory regions, including boot memory regions that
are relocated by firmware to fadump reserved areas, so no changes for
that either. However, it is necessary to add new members to the fadump
header, i.e., the 'fadump_crash_info_header' structure, in order to pass
the crashed kernel's vmcoreinfo address and its size to fadump kernel.

In addition to the vmcoreinfo address and size, there are a few other
attributes also added to the fadump_crash_info_header structure.

1. version:
   It stores the fadump header version, which is currently set to 1.
   This provides flexibility to update the fadump crash info header in
   the future without changing the magic number. For each change in the
   fadump header, the version will be increased. This will help the
   updated kernel determine how to handle kernel dumps from older
   kernels. The magic number remains relevant for checking fadump header
   corruption.

2. pt_regs_sz/cpu_mask_sz:
   Store size of pt_regs and cpu_mask structure of first kernel. These
   attributes are used to prevent dump processing if the sizes of
   pt_regs or cpu_mask structure differ between the first and fadump
   kernels.

Note: if either first/crashed kernel or second/fadump kernel do not have
the changes introduced here then kernel fail to collect the dump and
prints relevant error message on the console.

Signed-off-by: Sourabh Jain 
Cc: Aditya Gupta 
Cc: "Aneesh Kumar K.V" 
Cc: Hari Bathini 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
---
 arch/powerpc/include/asm/fadump-internal.h   |  31 +-
 arch/powerpc/kernel/fadump.c | 361 +++
 arch/powerpc/platforms/powernv/opal-fadump.c |  22 +-
 arch/powerpc/platforms/pseries/rtas-fadump.c |  34 +-
 4 files changed, 242 insertions(+), 206 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump-internal.h 
b/arch/powerpc/include/asm/fadump-internal.h
index 27f9e11eda28..5d706a7acc8a 100644
--- a/arch/powerpc/include/a

[PATCH v9 3/3] Documentation/powerpc: update fadump implementation details

2024-04-16 Thread Sourabh Jain

The patch titled ("powerpc: make fadump resilient with memory add/remove
events") has made significant changes to the implementation of fadump,
particularly on elfcorehdr creation and fadump crash info header
structure. Therefore, updating the fadump implementation documentation
to reflect those changes.

Following updates are done to firmware assisted dump documentation:

1. The elfcorehdr is no longer stored after fadump HDR in the reserved
   dump area. Instead, the second kernel dynamically allocates memory
   for the elfcorehdr within the address range from 0 to the boot memory
   size. Therefore, update figures 1 and 2 of Memory Reservation during
   the first and second kernels to reflect this change.

2. A version field has been added to the fadump header to manage the
   future changes to fadump crash info header structure without changing
   the fadump header magic number in the future. Therefore, remove the
   corresponding TODO from the document.

Signed-off-by: Sourabh Jain 
Cc: Aditya Gupta 
Cc: Aneesh Kumar K.V 
Cc: Hari Bathini 
Cc: Mahesh Salgaonkar 
Cc: Michael Ellerman 
Cc: Naveen N Rao 
---
 .../arch/powerpc/firmware-assisted-dump.rst   | 91 +--
 1 file changed, 42 insertions(+), 49 deletions(-)

diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst 
b/Documentation/arch/powerpc/firmware-assisted-dump.rst
index e363fc48529a..7e37aadd1f77 100644
--- a/Documentation/arch/powerpc/firmware-assisted-dump.rst
+++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst
@@ -134,12 +134,12 @@ that are run. If there is dump data, then the
 memory is held.
 
 If there is no waiting dump data, then only the memory required to
-hold CPU state, HPTE region, boot memory dump, FADump header and
-elfcore header, is usually reserved at an offset greater than boot
-memory size (see Fig. 1). This area is *not* released: this region
-will be kept permanently reserved, so that it can act as a receptacle
-for a copy of the boot memory content in addition to CPU state and
-HPTE region, in the case a crash does occur.
+hold CPU state, HPTE region, boot memory dump, and FADump header is
+usually reserved at an offset greater than boot memory size (see Fig. 1).
+This area is *not* released: this region will be kept permanently
+reserved, so that it can act as a receptacle for a copy of the boot
+memory content in addition to CPU state and HPTE region, in the case
+a crash does occur.
 
 Since this reserved memory area is used only after the system crash,
 there is no point in blocking this significant chunk of memory from
@@ -153,22 +153,22 @@ that were present in CMA region::
 
   o Memory Reservation during first kernel
 
-  Low memory Top of memory
-  0boot memory size   |<--- Reserved dump area --->|   |
-  |   |   |Permanent Reservation   |   |
-  V   V   ||   V
-  +---+-/ /---+---++---+-+-++--+
-  |   |   |///||  DUMP | HDR | ELF ||  |
-  +---+-/ /---+---++---+-+-++--+
-|   ^^ ^  ^   ^
-|   || |  |   |
-\  CPU  HPTE   /  |   |
- --   |   |
-  Boot memory content gets transferred|   |
-  to reserved area by firmware at the |   |
-  time of crash.  |   |
-  FADump Header   |
-   (meta area)|
+  Low memory  Top of memory
+  0boot memory size   |<-- Reserved dump area ->| |
+  |   |   |  Permanent Reservation  | |
+  V   V   | | V
+  +---+-/ /---+---++---+---++-+
+  |   |   |///||DUMP   |  HDR  || |
+  +---+-/ /---+---++---+---++-+
+|   ^^   ^ ^  ^
+|   ||   | |  |
+\  CPU  HPTE / |  |
+   |  |
+  Boot memory content gets transferred |  |
+  to reserved area by firmware at the  |  |
+  time of crash.   |  |
+   FADump Header  |
+(meta area)   |
   |
   |
   Metadata: This area holds a metadata structure whose
@@ -186,13 +186,20 @@ that were present in CMA

[PATCH v9 0/3] powerpc: make fadump resilient with memory add/remove events

2024-04-16 Thread Sourabh Jain

Problem:

Due to changes in memory resources caused by either memory hotplug or
online/offline events, the elfcorehdr, which describes the cpus and
memory of the crashed kernel to the kernel that collects the dump (known
as second/fadump kernel), becomes outdated. Consequently, attempting
dump collection with an outdated elfcorehdr can lead to failed or
inaccurate dump collection.

Memory hotplug or online/offline events is referred as memory add/remove
events in reset of the patch series.

Existing solution:
==
Monitor memory add/remove events in userspace using udev rules, and
re-register fadump whenever there are changes in memory resources. This
leads to the creation of a new elfcorehdr with updated system memory
information.

Challenges with existing solution:
==
1. Performing bulk memory add/remove with udev-based fadump
   re-registration can lead to race conditions and, more importantly,
   it creates a large wide window during which fadump is inactive until
   all memory add/remove events are settled.
2. Re-registering fadump for every memory add/remove event is
   inefficient.
3. Memory for elfcorehdr is allocated based on the memblock regions
   available during first kernel early boot and it remains fixed
   thereafter. However, if the elfcorehdr is later recreated with
   additional memblock regions, its size will increase, potentially
   leading to memory corruption.

Proposed solution:
==
Address the aforementioned challenges by shifting the creation of
elfcorehdr from the first kernel (also referred as the crashed kernel),
where it was created and frequently recreated for every memory
add/remove event, to the fadump kernel. As a result, the elfcorehdr only
needs to be created once, thus eliminating the necessity to re-register
fadump during memory add/remove events.

To know more about elfcorehdr creation in the fadump kernel, refer to
the first patch in this series.

The second patch includes a new sysfs interface that tells userspace
that fadump re-registration isn't needed for memory add/remove events. 
note that userspace changes do not need to be in sync with kernel
changes; they can roll out independently.

Since there are significant changes in the fadump implementation, the
third patch updates the fadump documentation to reflect the changes made
in this patch series.

Kernel tree rebased on 6.9.0-rc3 with patch series applied:
===
https://github.com/sourabhjains/linux/tree/fadump-mem-hotplug-v9

Userspace changes:
==
To realize this feature, one must update the kdump udev rules to prevent
fadump re-registration during memory add/remove events.

On rhel apply the following changes to file
/usr/lib/udev/rules.d/98-kexec.rules

-RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; 
/usr/bin/systemd-run --quiet --no-block /usr/lib/udev/kdump-udev-throttler'"
+# don't re-register fadump if the value of the node
+# /sys/kernel/fadump/hotplug_ready is 1.
+
+RUN+="/bin/sh -c '/usr/bin/systemctl is-active kdump.service || exit 0; ! test 
-f /sys/kernel/fadump_enabled || cat /sys/kernel/fadump_enabled | grep 0  || ! 
test -f /sys/kernel/fadump/hotplug_ready || cat 
/sys/kernel/fadump/hotplug_ready | grep 0 || exit 0; /usr/bin/systemd-run 
--quiet --no-block /usr/lib/udev/kdump-udev-throttler'"

Changelog:
==
v9: 16 Apr 2024
  - Set the physical address of elfcorehdr to elfcorehdr_addr. 1/3
  - Set elfcorehdr_addr to ELFCORE_ADDR_ERR before freeing the elfcorehdr. 1/3
  - Mark the populate_elf_pt_load function as __init. 1/3
  - Rename a function from process_fadump to fadump_process. 1/3
  - Make minor changes to the commit message and a couple of comments in 1/3.
  - Update date of introduction of /sys/kernel/fadump/hotplug_ready sysfs. 2/3
  - Rebase it to 6.9-rc3.

v8: 16 Feb 2024
  https://lore.kernel.org/all/20240217072004.148293-1-sourabhj...@linux.ibm.com/
  - Move `elfcorehdr_addr` and `elfcorehdr_size` struct attributes
from `struct fadump_crash_info_header` to `struct fw_dump`.
  - Make minor changes in commit message 1/3.
  - Rebase it to 6.8-rc4.

v7: 11 Jan 2024
  https://lore.kernel.org/all/2024040943.297501-1-sourabhj...@linux.ibm.com/
  - Rebase it to 6.7

v6: 8 Dec 2023
  https://lore.kernel.org/all/20231208115159.82236-1-sourabhj...@linux.ibm.com/
  - Add size fields for `pt_regs` and `cpumask` in the fadump header
structure
  - Don't process the dump if the size of `pt_regs` and `cpu_mask` is
not same in the crashed and fadump kernel
  - Include an additional check for endianness mismatch when the magic
number doesn't match, to print the relevant error message
  - Don't process the dump if the fadump header contains an old magic number
  - Rebased it to 6.7.0-rc4

v5: 29 Oct 2023 
  https://lore.kernel.org/all/20231029124548.12198-1-sourabhj...@linux.ibm.com/
  - Fix a comment on the fi

powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Naresh Kamboju

The Powerpc clang builds failed due to following warnings / errors on the
Linux next-20240416 tag.

Powerpc:
 - tqm8xx_defconfig + clang-17 - Failed
 - allnoconfig + clang-17 - Failed
 - tinyconfig + clang-17 - Failed

Reported-by: Linux Kernel Functional Testing 

Build log:

In file included from arch/powerpc/kernel/ptrace/ptrace.c:19:
In file included from include/linux/syscalls.h:93:
In file included from include/trace/syscall.h:7:
In file included from include/linux/trace_events.h:9:
In file included from include/linux/hardirq.h:11:
In file included from arch/powerpc/include/asm/hardirq.h:6:
In file included from include/linux/irq.h:20:
In file included from include/linux/io.h:14:
In file included from arch/powerpc/include/asm/io.h:672:
arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
arithmetic on a null pointer has undefined behavior
[-Werror,-Wnull-pointer-arithmetic]
   43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c),
  | ^~~
   44 |  (p, b, c), pio, p)
  |  ~~

Links:
 - 
https://qa-reports.linaro.org/lkft/linux-next-master/build/next-20240416/testrun/23495171/suite/build/test/clang-17-tqm8xx_defconfig/details/
 - 
https://storage.tuxsuite.com/public/linaro/lkft/builds/2fAvI3mKJ0dTHcazPrLl2zNF9JO/

--
Linaro LKFT
https://lkft.linaro.org

Re: [RFC PATCH 3/7] module: prepare to handle ROX allocations for text

2024-04-16 Thread Nadav Amit




> On 11 Apr 2024, at 19:05, Mike Rapoport  wrote:
> 
> @@ -2440,7 +2479,24 @@ static int post_relocation(struct module *mod, const 
> struct load_info *info)
>   add_kallsyms(mod, info);
> 
>   /* Arch-specific module finalizing. */
> - return module_finalize(info->hdr, info->sechdrs, mod);
> + ret = module_finalize(info->hdr, info->sechdrs, mod);
> + if (ret)
> + return ret;
> +
> + for_each_mod_mem_type(type) {
> + struct module_memory *mem = &mod->mem[type];
> +
> + if (mem->is_rox) {
> + if (!execmem_update_copy(mem->base, mem->rw_copy,
> +  mem->size))
> + return -ENOMEM;
> +
> + vfree(mem->rw_copy);
> + mem->rw_copy = NULL;
> + }
> + }
> +
> + return 0;
> }

I might be missing something, but it seems a bit racy.

IIUC, module_finalize() calls alternatives_smp_module_add(). At this
point, since you don’t hold the text_mutex, some might do text_poke(),
e.g., by enabling/disabling static-key, and the update would be
overwritten. No?

Re: [kvm-unit-tests PATCH v8 11/35] powerpc/sprs: Specify SPRs with data rather than code

2024-04-16 Thread Thomas Huth


On 05/04/2024 10.35, Nicholas Piggin wrote:

A significant rework that builds an array of 'struct spr', where each
element describes an SPR. This makes various metadata about the SPR
like name and access type easier to carry and use.

Hypervisor privileged registers are described despite not being used
at the moment for completeness, but also the code might one day be
reused for a hypervisor-privileged test.

Acked-by: Thomas Huth 
Signed-off-by: Nicholas Piggin 
---
  lib/powerpc/asm/reg.h |   2 +
  powerpc/sprs.c| 647 +-
  2 files changed, 457 insertions(+), 192 deletions(-)


I tried to merge this patch, but it seems to break running the tests on 
Ubuntu Focal with Clang 11 in the Travis-CI:


 https://app.travis-ci.com/github/huth/kvm-unit-tests/jobs/620577246

Could you please have a look?

 Thanks,
  Thomas

Re: [kvm-unit-tests PATCH v8 06/35] gitlab-ci: Run migration selftest on s390x and powerpc

2024-04-16 Thread Thomas Huth


On 16/04/2024 09.55, Thomas Huth wrote:

On 16/04/2024 09.18, Thomas Huth wrote:

On 11/04/2024 21.22, Thomas Huth wrote:

On 08/04/2024 18.06, Nico Boehr wrote:

Quoting Nicholas Piggin (2024-04-05 10:35:07)

The migration harness is complicated and easy to break so CI will
be helpful.

Signed-off-by: Nicholas Piggin 
---
  .gitlab-ci.yml  | 32 +++-
  s390x/unittests.cfg |  8 
  2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
index ff34b1f50..60b3cdfd2 100644
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml

[...]

@@ -135,7 +147,7 @@ build-riscv64:
  build-s390x:
   extends: .outoftree_template
   script:
- - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu
+ - dnf install -y qemu-system-s390x gcc-s390x-linux-gnu nmap-ncat
   - mkdir build
   - cd build
   - ../configure --arch=s390x --cross-prefix=s390x-linux-gnu-
@@ -161,6 +173,8 @@ build-s390x:
    sclp-1g
    sclp-3g
    selftest-setup
+  selftest-migration-kvm


We're running under TCG in the Gitlab CI. I'm a little bit confused why
we're running a KVM-only test here.


The build-s390x job is TCG, indeed, but we have the "s390x-kvm" job that 
runs on a KVM-capable s390x host, so it could be added there?


I now gave it a try and it seems to work, so I updated this patch and 
pushed it to the repository now.


Hmm, "selftest-migration" now was failing once here:

  https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6633865591

Let's keep an eye on it, and if it is not stable enough, we might need to 
disable it in the CI again...


And it just failed again:

 https://gitlab.com/kvm-unit-tests/kvm-unit-tests/-/jobs/6635395811

... not sure whether this is due to the slow CI machine or whether there is 
still a bug lurking around somewhere, but anyway, I disabled it now for the 
CI again to avoid that other MRs get affected.


 Thomas

Re: [PATCH 0/2] ASoC: fsl-asoc-card: add wm8904 codec support

2024-04-16 Thread Mark Brown

On Mon, 15 Apr 2024 15:17:34 +0800, Shengjiu Wang wrote:
> add wm8904 codec support in fsl-asoc-card.
> 
> Shengjiu Wang (2):
>   ASoC: fsl-asoc-card: add wm8904 codec support
>   ASoC: dt-bindings: fsl-asoc-card: Add compatbile string for wm8904
> codec
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/2] ASoC: fsl-asoc-card: add wm8904 codec support
  commit: 51f67862ea6ea83f9fa4cda2e59d7bfd4387f63b
[2/2] ASoC: dt-bindings: fsl-asoc-card: Add compatbile string for wm8904 codec
  commit: 62c48dd33b4f2e037554d1322ae4f9f60e9461ef

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Segher Boessenkool

On Tue, Apr 16, 2024 at 03:02:52PM +0530, Naresh Kamboju wrote:
> In file included from arch/powerpc/include/asm/io.h:672:
> arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
> arithmetic on a null pointer has undefined behavior
> [-Werror,-Wnull-pointer-arithmetic]

It is not UB, but just undefined: the program is meaningless.

It is not a null pointer but even a null pointer constant here.  It
matters in places, including here.

It would help if the warnings were more correct :-(

Segher

Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-16 Thread Christophe Leroy



Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> 
> 
> Le 12/04/2024 à 16:30, Peter Xu a écrit :
>> On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote:
>>>
>>>
>>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
 On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>
>> Unlike most architectures, powerpc 8xx HW requires a two-level
>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>> is not feasible as such.
>>
>> Possible sizes are 4k, 16k, 512k and 8M.
>>
>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
>> entries
>> must point to a single entry level-2 page table. Until now that was
>> done using hugepd. This series changes it to use standard page tables
>> where the entry is replicated 1024 times on each of the two 
>> pagetables
>> refered by the two associated PMD entries for that 8M page.
>>
>> At the moment it has to look into each helper to know if the
>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>> a lower size. I hope this can me handled by core-mm in the future.
>>
>> There are probably several ways to implement stuff, so feedback is
>> very welcome.
>
> I thought it looks pretty good!

 I second it.

 I saw the discussions in patch 1.  Christophe, I suppose you're 
 exploring
 the big hammer over hugepd, and perhaps went already with the 32bit pmd
 solution for nohash/32bit challenge you mentioned?

 I'm trying to position my next step; it seems like at least I should 
 not
 adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
 checks,
 or you're going to have an RFC soon then I can base on top?
>>>
>>> Depends on what you expect by "soon".
>>>
>>> I sure won't be able to send any RFC before end of April.
>>>
>>> Should be possible to have something during May.
>>
>> That's good enough, thanks.  I'll see what is the best I can do.
>>
>> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
>> other
>> email where I'm not sure whether pgd leaves ever existed for any of
>> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
>> recognize pud and lower for all leaves.  Then if Power can switch from
>> hugepd to this it should just work.
> 
> Well, if I understand correctly, something with no PMD will include 
>  and will therefore only have  pmd 
> entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> case.
> 
> And therefore including  means  you 
> have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> checked.
> 
> 
>>
>> Even if pgd exists (then something I overlooked..), I'm wondering whether
>> we can push that downwards to be either pud/pmd (and looks like we all
>> agree p4d is never used on Power).  That may involve some pgtable
>> operations moving from pgd level to lower, e.g. my pure imagination would
>> look like starting with:
> 
> Yes I think there is no doubt that p4d is never used:
> 
> arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> 
> arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> 
> arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> 
> arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> 
> 
> But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> 
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    0
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
>>
>> To:
>>
>> #define PTE_INDEX_SIZE    PTE_SHIFT
>> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
>> #define PUD_INDEX_SIZE    0
>> #define PGD_INDEX_SIZE    0
> 
> But then you can't anymore have #define PTRS_PER_PMD 1 from 
> 
> 
>>
>> And the rest will need care too.  I hope moving downward is easier
>> (e.g. the walker should always exist for lower levels but not always for
>> higher levels), but I actually have little idea on whether there's any
>> other implications, so please bare with me on stupid mistakes.
>>
>> I just hope pgd leaves don't exist already, then I think it'll be 
>> simpler.
>>
>> Thanks,
>>

Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
("riscv/mm: fix two page table check related issues")

So should asm-generic/pgtable-nopud.h contain the same for p4d_leaf() 
and asm-generic/pgtable-nop4d.h contain the same for pgd_leaf() ?

Christophe

Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Arnd Bergmann

On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote:
> The Powerpc clang builds failed due to following warnings / errors on the
> Linux next-20240416 tag.
>
> Powerpc:
>  - tqm8xx_defconfig + clang-17 - Failed
>  - allnoconfig + clang-17 - Failed
>  - tinyconfig + clang-17 - Failed
>
> Reported-by: Linux Kernel Functional Testing 

I'm not sure why this showed up now, but there is a series from
in progress that will avoid this in the future, as the same
issue is present on a couple of other architectures.

The broken definitions are in the !CONFIG_PCI path of

#ifndef CONFIG_PCI
#define _IO_BASE0
#define _ISA_MEM_BASE   0
#define PCI_DRAM_OFFSET 0
#elif defined(CONFIG_PPC32)
#define _IO_BASEisa_io_base
#define _ISA_MEM_BASE   isa_mem_base
#define PCI_DRAM_OFFSET pci_dram_offset
#else
#define _IO_BASEpci_io_base
#define _ISA_MEM_BASE   isa_mem_base
#define PCI_DRAM_OFFSET 0
#endif

Once the series is merged, the !PCI case can disable
CONFIG_HAS_IOPORT and move all references to it into #ifdef
sections, something like the (incomplete) patch below.

It looks like regardless of this, powerpc can also just set
_IO_BASE to ISA_IO_BASE unconditionally, but I could be missing
something there.

 Arnd

---
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 08c550ed49be..29e002b9316c 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -36,11 +36,8 @@ extern struct pci_dev *isa_bridge_pcidev;
  * bases. Most of this file only uses _IO_BASE though which we
  * define properly based on the platform
  */
-#ifndef CONFIG_PCI
-#define _IO_BASE   0
-#define _ISA_MEM_BASE  0
-#define PCI_DRAM_OFFSET 0
-#elif defined(CONFIG_PPC32)
+#ifdef CONFIG_HAS_IOPORT
+#ifdef CONFIG_PPC32
 #define _IO_BASE   isa_io_base
 #define _ISA_MEM_BASE  isa_mem_base
 #define PCI_DRAM_OFFSETpci_dram_offset
@@ -486,8 +483,7 @@ static inline u64 __raw_rm_readq(volatile void __iomem 
*paddr)
  * to port it over
  */
 
-#ifdef CONFIG_PPC32
-
+#if defined(CONFIG_PPC32) && defined(CONFIG_HAS_IOPORT)
 #define __do_in_asm(name, op)  \
 static inline unsigned int name(unsigned int port) \
 {  \
@@ -534,7 +530,7 @@ __do_out_asm(_rec_outb, "stbx")
 __do_out_asm(_rec_outw, "sthbrx")
 __do_out_asm(_rec_outl, "stwbrx")
 
-#endif /* CONFIG_PPC32 */
+#endif /* CONFIG_PPC32 && CONFIG_HAS_IOPORT */
 
 /* The "__do_*" operations below provide the actual "base" implementation
  * for each of the defined accessors. Some of them use the out_* functions
@@ -577,6 +573,7 @@ __do_out_asm(_rec_outl, "stwbrx")
 #define __do_readq_be(addr)in_be64(PCI_FIX_ADDR(addr))
 #endif /* !defined(CONFIG_EEH) */
 
+#ifdef CONFIG_HAS_IOPORT
 #ifdef CONFIG_PPC32
 #define __do_outb(val, port)   _rec_outb(val, port)
 #define __do_outw(val, port)   _rec_outw(val, port)
@@ -592,6 +589,7 @@ __do_out_asm(_rec_outl, "stwbrx")
 #define __do_inw(port) readw((PCI_IO_ADDR)_IO_BASE + port);
 #define __do_inl(port) readl((PCI_IO_ADDR)_IO_BASE + port);
 #endif /* !CONFIG_PPC32 */
+#endif
 
 #ifdef CONFIG_EEH
 #define __do_readsb(a, b, n)   eeh_readsb(PCI_FIX_ADDR(a), (b), (n))
@@ -606,12 +604,14 @@ __do_out_asm(_rec_outl, "stwbrx")
 #define __do_writesw(a, b, n)  _outsw(PCI_FIX_ADDR(a),(b),(n))
 #define __do_writesl(a, b, n)  _outsl(PCI_FIX_ADDR(a),(b),(n))
 
+#ifdef CONFIG_HAS_IOPORT
 #define __do_insb(p, b, n) readsb((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
 #define __do_insw(p, b, n) readsw((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
 #define __do_insl(p, b, n) readsl((PCI_IO_ADDR)_IO_BASE+(p), (b), (n))
 #define __do_outsb(p, b, n)writesb((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
 #define __do_outsw(p, b, n)writesw((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
 #define __do_outsl(p, b, n)writesl((PCI_IO_ADDR)_IO_BASE+(p),(b),(n))
+#endif
 
 #define __do_memset_io(addr, c, n) \
_memset_io(PCI_FIX_ADDR(addr), c, n)
@@ -689,6 +689,8 @@ static inline void name at  
\
 #define writesb writesb
 #define writesw writesw
 #define writesl writesl
+
+#ifdef CONFIG_HAS_IOPORT
 #define inb inb
 #define inw inw
 #define inl inl
@@ -701,6 +703,8 @@ static inline void name at  
\
 #define outsb outsb
 #define outsw outsw
 #define outsl outsl
+#endif
+
 #ifdef __powerpc64__
 #define readq  readq
 #define writeq writeq

Re: [PATCH 1/3] x86/cpu: Actually turn off mitigations by default for SPECULATION_MITIGATIONS=n

2024-04-16 Thread Michael Ellerman

Sean Christopherson  writes:
> On Mon, Apr 15, 2024, Geert Uytterhoeven wrote:
>> On Sat, Apr 13, 2024 at 11:38 AM Michael Ellerman  
>> wrote:
>> > Michael Ellerman  writes:
>> > > Stephen Rothwell  writes:
>> > ...
>> > >> On Tue,  9 Apr 2024 10:51:05 -0700 Sean Christopherson 
>> > >>  wrote:
>> > ...
>> > >>> diff --git a/kernel/cpu.c b/kernel/cpu.c
>> > >>> index 8f6affd051f7..07ad53b7f119 100644
>> > >>> --- a/kernel/cpu.c
>> > >>> +++ b/kernel/cpu.c
>> > >>> @@ -3207,7 +3207,8 @@ enum cpu_mitigations {
>> > >>>  };
>> > >>>
>> > >>>  static enum cpu_mitigations cpu_mitigations __ro_after_init =
>> > >>> -   CPU_MITIGATIONS_AUTO;
>> > >>> +   IS_ENABLED(CONFIG_SPECULATION_MITIGATIONS) ? CPU_MITIGATIONS_AUTO :
>> > >>> +CPU_MITIGATIONS_OFF;
>> > >>>
>> > >>>  static int __init mitigations_parse_cmdline(char *arg)
>> > >>>  {
>> >
>> > I think a minimal workaround/fix would be:
>> >
>> > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
>> > index 2b8fd6bb7da0..290be2f9e909 100644
>> > --- a/drivers/base/Kconfig
>> > +++ b/drivers/base/Kconfig
>> > @@ -191,6 +191,10 @@ config GENERIC_CPU_AUTOPROBE
>> >  config GENERIC_CPU_VULNERABILITIES
>> > bool
>> >
>> > +config SPECULATION_MITIGATIONS
>> > +   def_bool y
>> > +   depends on !X86
>> > +
>> >  config SOC_BUS
>> > bool
>> > select GLOB
>> 
>> Thanks, that works for me (on arm64), so
>> Tested-by: Geert Uytterhoeven 
>
> Oof.  I completely missed that "cpu_mitigations" wasn't x86-only.  I can't 
> think
> of better solution than an on-by-default generic Kconfig, though can't that it
> more simply be:
>
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 2b8fd6bb7da0..5930cb56ee29 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -191,6 +191,9 @@ config GENERIC_CPU_AUTOPROBE
>  config GENERIC_CPU_VULNERABILITIES
> bool
>  
> +config SPECULATION_MITIGATIONS
> +   def_bool !X86
> +

Yeah that works too.

cheers

RE: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread David Laight

From: Segher Boessenkool
> Sent: 16 April 2024 11:38
> 
> On Tue, Apr 16, 2024 at 03:02:52PM +0530, Naresh Kamboju wrote:
> > In file included from arch/powerpc/include/asm/io.h:672:
> > arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
> > arithmetic on a null pointer has undefined behavior
> > [-Werror,-Wnull-pointer-arithmetic]
> 
> It is not UB, but just undefined: the program is meaningless.
> 
> It is not a null pointer but even a null pointer constant here.  It
> matters in places, including here.
> 
> It would help if the warnings were more correct :-(

Isn't it only a problem because the NULL pointer isn't required to
be the all-zero bit pattern?

So when do we get a warning from using memset() on a structure
that contains pointers? Since it is equally buggy.

Has anyone ever seen a system where NULL wasn't 'all zeros'?
I've used a system where the 'native' invalid pointer was 'all ones',
but even there the C code used 'all zeros'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Michael Ellerman

Markus Elfring  writes:
>> A few update suggestions were taken into account
>> from static source code analysis.
>>
>> Markus Elfring (2):
>
> I would appreciate a bit more information about the reasons
> why this patch series was rejected.
>
>
>>   One function call less in vas_window_alloc() after error detection
>
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/

It introduced a new goto and label to avoid a kfree(NULL) call, but
kfree() explicitly accepts NULL and handles it. So it complicates the
source code for no gain.

>>   Return directly after a failed kasprintf() in map_paste_region()
>
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/

Basically the same reasoning. And it also changes the function from
having two return paths (success and error), to three.

cheers

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Christophe Leroy



Le 16/04/2024 à 13:11, Michael Ellerman a écrit :
> Markus Elfring  writes:
>>> A few update suggestions were taken into account
>>> from static source code analysis.
>>>
>>> Markus Elfring (2):
>>
>> I would appreciate a bit more information about the reasons
>> why this patch series was rejected.
>>
>>
>>>One function call less in vas_window_alloc() after error detection
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/
> 
> It introduced a new goto and label to avoid a kfree(NULL) call, but
> kfree() explicitly accepts NULL and handles it. So it complicates the
> source code for no gain.

This is explicit in Kernel documentation:

/**
  * kfree - free previously allocated memory
  * @object: pointer returned by kmalloc() or kmem_cache_alloc()
  *
  * If @object is NULL, no operation is performed.
  */

That's exactly the same behaviour as free() in libc.

So Coccinelle should be fixed if it reports an error for that.

> 
>>>Return directly after a failed kasprintf() in map_paste_region()
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/
> 
> Basically the same reasoning. And it also changes the function from
> having two return paths (success and error), to three.
> 

Looking at that function, I however see a missing region release when 
ioremap_cache() fails.

Christophe

Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Arnd Bergmann

On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote:
> On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote:
>> The Powerpc clang builds failed due to following warnings / errors on the
>> Linux next-20240416 tag.
>>
>> Powerpc:
>>  - tqm8xx_defconfig + clang-17 - Failed
>>  - allnoconfig + clang-17 - Failed
>>  - tinyconfig + clang-17 - Failed
>>
>> Reported-by: Linux Kernel Functional Testing 
>
> I'm not sure why this showed up now, but there is a series from
> in progress that will avoid this in the future, as the same
> issue is present on a couple of other architectures.
>

I see now, it was introduced by my patch to turn on -Wextra
by default. I had tested that patch on all architectures
with allmodconfig and defconfig, but I did not test any
powerpc configs with PCI disabled.

 Arnd

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Markus Elfring

>>>   One function call less in vas_window_alloc() after error detection
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1f1c21cf-c34c-418c-b00c-8e6474f12...@web.de/
>
> It introduced a new goto and label to avoid a kfree(NULL) call, but
> kfree() explicitly accepts NULL and handles it. So it complicates the
> source code for no gain.

I proposed to avoid such a redundant function call for the affected
exception handling.


>>>   Return directly after a failed kasprintf() in map_paste_region()
>>
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/f46f04bc-613c-4e98-b602-4c5120556...@web.de/
>
> Basically the same reasoning. And it also changes the function from
> having two return paths (success and error), to three.

See also a corresponding advice once more from the section “7) Centralized 
exiting of functions”:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/coding-style.rst?h=v6.9-rc4#n532

Regards,
Markus

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Christophe Leroy



Le 16/04/2024 à 14:14, Markus Elfring a écrit :
>> This is explicit in Kernel documentation:
>>
>> /**
>>* kfree - free previously allocated memory
>>* @object: pointer returned by kmalloc() or kmem_cache_alloc()
>>*
>>* If @object is NULL, no operation is performed.
>>*/
>>
>> That's exactly the same behaviour as free() in libc.
>>
>> So Coccinelle should be fixed if it reports an error for that.
> 
> Redundant function calls can occasionally be avoided accordingly,
> can't they?

Sure they can, but is that worth it here ?

Christophe

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Markus Elfring

> This is explicit in Kernel documentation:
>
> /**
>   * kfree - free previously allocated memory
>   * @object: pointer returned by kmalloc() or kmem_cache_alloc()
>   *
>   * If @object is NULL, no operation is performed.
>   */
>
> That's exactly the same behaviour as free() in libc.
>
> So Coccinelle should be fixed if it reports an error for that.

Redundant function calls can occasionally be avoided accordingly,
can't they?

Regards,
Markus

Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Dan Carpenter

On Tue, Apr 16, 2024 at 01:55:57PM +0200, Arnd Bergmann wrote:
> On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote:
> > On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote:
> >> The Powerpc clang builds failed due to following warnings / errors on the
> >> Linux next-20240416 tag.
> >>
> >> Powerpc:
> >>  - tqm8xx_defconfig + clang-17 - Failed
> >>  - allnoconfig + clang-17 - Failed
> >>  - tinyconfig + clang-17 - Failed
> >>
> >> Reported-by: Linux Kernel Functional Testing 
> >
> > I'm not sure why this showed up now, but there is a series from
> > in progress that will avoid this in the future, as the same
> > issue is present on a couple of other architectures.
> >
> 
> I see now, it was introduced by my patch to turn on -Wextra
> by default. I had tested that patch on all architectures
> with allmodconfig and defconfig, but I did not test any
> powerpc configs with PCI disabled.

I think this warning is clang specific as well...  (Maybe clang was
included in all architectures but I'm not sure).

regards,
dan carpenter

Re: powerpc: io-defs.h:43:1: error: performing pointer arithmetic on a null pointer has undefined behavior [-Werror,-Wnull-pointer-arithmetic]

2024-04-16 Thread Arnd Bergmann

On Tue, Apr 16, 2024, at 14:42, Dan Carpenter wrote:
> On Tue, Apr 16, 2024 at 01:55:57PM +0200, Arnd Bergmann wrote:
>> On Tue, Apr 16, 2024, at 13:01, Arnd Bergmann wrote:
>> > On Tue, Apr 16, 2024, at 11:32, Naresh Kamboju wrote:
>> >> The Powerpc clang builds failed due to following warnings / errors on the
>> >> Linux next-20240416 tag.
>> >>
>> >> Powerpc:
>> >>  - tqm8xx_defconfig + clang-17 - Failed
>> >>  - allnoconfig + clang-17 - Failed
>> >>  - tinyconfig + clang-17 - Failed
>> >>
>> >> Reported-by: Linux Kernel Functional Testing 
>> >
>> > I'm not sure why this showed up now, but there is a series from
>> > in progress that will avoid this in the future, as the same
>> > issue is present on a couple of other architectures.
>> >
>> 
>> I see now, it was introduced by my patch to turn on -Wextra
>> by default. I had tested that patch on all architectures
>> with allmodconfig and defconfig, but I did not test any
>> powerpc configs with PCI disabled.
>
> I think this warning is clang specific as well...  (Maybe clang was
> included in all architectures but I'm not sure).

Yes, I did test with both gcc and clang where supported.

 Arnd

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Julia Lawall



On Tue, 16 Apr 2024, Christophe Leroy wrote:

>
>
> Le 16/04/2024 à 14:14, Markus Elfring a écrit :
> >> This is explicit in Kernel documentation:
> >>
> >> /**
> >>* kfree - free previously allocated memory
> >>* @object: pointer returned by kmalloc() or kmem_cache_alloc()
> >>*
> >>* If @object is NULL, no operation is performed.
> >>*/
> >>
> >> That's exactly the same behaviour as free() in libc.
> >>
> >> So Coccinelle should be fixed if it reports an error for that.
> >
> > Redundant function calls can occasionally be avoided accordingly,
> > can't they?
>
> Sure they can, but is that worth it here ?

Coccinelle does what the developer of the semantic patch tells it to do.
It doesn't spontaneously report errors for anything.

julia

Re: [0/2] powerpc/powernv/vas: Adjustments for two function implementations

2024-04-16 Thread Markus Elfring

 So Coccinelle should be fixed if it reports an error for that.
>>>
>>> Redundant function calls can occasionally be avoided accordingly,
>>> can't they?
>>
>> Sure they can, but is that worth it here ?
>
> Coccinelle does what the developer of the semantic patch tells it to do.
> It doesn't spontaneously report errors for anything.

Some special source code search and transformation patterns are occasionally 
applied.
The corresponding change acceptance can often be challenging.

Regards,
Markus

Re: [0/2] powerpc/powernv/vas: Adjustments for map_paste_region()?

2024-04-16 Thread Markus Elfring

> Looking at that function, I however see a missing region release when
> ioremap_cache() fails.

Under which circumstances will the exception handling be completed also for
the implementation of the function “map_paste_region”?
https://elixir.bootlin.com/linux/v6.9-rc4/source/arch/powerpc/platforms/powernv/vas-window.c#L67

Regards,
Markus

Re: [PATCH] i2c: mpc: Removal of of_node_put with __free for auto cleanup

2024-04-16 Thread Andi Shyti

Hi Chris,

On Tue, Apr 16, 2024 at 03:59:13AM +, Chris Packham wrote:
> On 16/04/24 08:54, Andi Shyti wrote:
> >>/* Enable I2C interrupts for mpc5121 */
> >> -  node_ctrl = of_find_compatible_node(NULL, NULL,
> >> -  "fsl,mpc5121-i2c-ctrl");
> >> +  struct device_node *node_ctrl __free(device_node) =
> > How have you tested this?
> 
> I'm not sure I know anyone that still has a mpc5121. Maybe someone on 
> linuxppc-dev?

that's why I was asking, this would be the first driver in i2c
using Rob's __free(device_node).

> I did try to take the patch for a spin on my T2081RDB but I'm having 
> some userland issues on it for some reason (unrelated to this change). 
> The kernel boot does discover a few peripherals hanging of the I2C 
> interface but I'm not in a position to offer up a Tested-by and I've run 
> out of time to debug why my board is unhappy.

Thanks for giving it a try.

Andi

Re: [PATCH] i2c: mpc: Removal of of_node_put with __free for auto cleanup

2024-04-16 Thread Andi Shyti

On Tue, Apr 16, 2024 at 04:07:48PM +0200, Andi Shyti wrote:
> On Tue, Apr 16, 2024 at 03:59:13AM +, Chris Packham wrote:
> > On 16/04/24 08:54, Andi Shyti wrote:
> > >>  /* Enable I2C interrupts for mpc5121 */
> > >> -node_ctrl = of_find_compatible_node(NULL, NULL,
> > >> -"fsl,mpc5121-i2c-ctrl");
> > >> +struct device_node *node_ctrl __free(device_node) =
> > > How have you tested this?
> > 
> > I'm not sure I know anyone that still has a mpc5121. Maybe someone on 
> > linuxppc-dev?
> 
> that's why I was asking, this would be the first driver in i2c
> using Rob's __free(device_node).

Jonathan's, of course :-)

Andi

[PATCH] powerpc: drop port I/O helpers for CONFIG_HAS_IOPORT=n

2024-04-16 Thread Arnd Bergmann

From: Arnd Bergmann 

Calling inb()/outb() on powerpc when CONFIG_PCI is disabled causes
a NULL pointer dereference, which is bad for a number of reasons.

After my patch to turn on -Werror in linux-next, this caused a
compiler-time warning with clang:

In file included from arch/powerpc/include/asm/io.h:672:
arch/powerpc/include/asm/io-defs.h:43:1: error: performing pointer
arithmetic on a null pointer has undefined behavior
[-Werror,-Wnull-pointer-arithmetic]
   43 | DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c),
  | ^~~
   44 |  (p, b, c), pio, p)
  |  ~~

In this configuration, CONFIG_HAS_IOPORT is already disabled, and all
drivers that use inb()/outb() should now depend on that (some patches are
still in the process of getting marged).

Hide all references to inb()/outb() in the powerpc code and the definitions
when HAS_IOPORT is disabled to remove the possible NULL pointer access.
The same should happin in asm-generic in the near future, but for now
the empty inb() macros are still defined to ensure the generic version
does not get pulled in.

Signed-off-by: Arnd Bergmann 
Reported-by: Naresh Kamboju 
--

Cc: linux-ker...@vger.kernel.org>
Cc: linuxppc-dev 
Cc: Aneesh Kumar K.V 
Cc: Anders Roxell 
Cc: Kees Cook 
Cc: Niklas Schnelle 
Cc: clang-built-linux 
Cc: Nick Desaulniers 
Cc: Nathan Chancellor 
Cc: Jeff Xu 
Cc: Naveen N. Rao 
Cc: Dan Carpenter 
---
 arch/powerpc/include/asm/dma.h | 12 
 arch/powerpc/include/asm/io-defs.h |  4 
 arch/powerpc/include/asm/io.h  | 19 +++
 arch/powerpc/kernel/iomap.c|  4 
 arch/powerpc/kernel/traps.c|  2 +-
 5 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/dma.h b/arch/powerpc/include/asm/dma.h
index d97c66d9ae34..004a868f82c9 100644
--- a/arch/powerpc/include/asm/dma.h
+++ b/arch/powerpc/include/asm/dma.h
@@ -3,6 +3,12 @@
 #define _ASM_POWERPC_DMA_H
 #ifdef __KERNEL__
 
+/* The maximum address that we can perform a DMA transfer to on this platform 
*/
+/* Doesn't really apply... */
+#define MAX_DMA_ADDRESS(~0UL)
+
+#ifdef CONFIG_HAS_IOPORT
+
 /*
  * Defines for using and allocating dma channels.
  * Written by Hennus Bergman, 1992.
@@ -26,10 +32,6 @@
 #define MAX_DMA_CHANNELS   8
 #endif
 
-/* The maximum address that we can perform a DMA transfer to on this platform 
*/
-/* Doesn't really apply... */
-#define MAX_DMA_ADDRESS(~0UL)
-
 #ifdef HAVE_REALLY_SLOW_DMA_CONTROLLER
 #define dma_outb   outb_p
 #else
@@ -340,5 +342,7 @@ extern int request_dma(unsigned int dmanr, const char 
*device_id);
 /* release it again */
 extern void free_dma(unsigned int dmanr);
 
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_DMA_H */
diff --git a/arch/powerpc/include/asm/io-defs.h 
b/arch/powerpc/include/asm/io-defs.h
index faf8617cc574..8d2209af7759 100644
--- a/arch/powerpc/include/asm/io-defs.h
+++ b/arch/powerpc/include/asm/io-defs.h
@@ -20,12 +20,14 @@ DEF_PCI_AC_NORET(writeq, (u64 val, PCI_IO_ADDR addr), (val, 
addr), mem, addr)
 DEF_PCI_AC_NORET(writeq_be, (u64 val, PCI_IO_ADDR addr), (val, addr), mem, 
addr)
 #endif /* __powerpc64__ */
 
+#ifdef CONFIG_HAS_IOPORT
 DEF_PCI_AC_RET(inb, u8, (unsigned long port), (port), pio, port)
 DEF_PCI_AC_RET(inw, u16, (unsigned long port), (port), pio, port)
 DEF_PCI_AC_RET(inl, u32, (unsigned long port), (port), pio, port)
 DEF_PCI_AC_NORET(outb, (u8 val, unsigned long port), (val, port), pio, port)
 DEF_PCI_AC_NORET(outw, (u16 val, unsigned long port), (val, port), pio, port)
 DEF_PCI_AC_NORET(outl, (u32 val, unsigned long port), (val, port), pio, port)
+#endif
 
 DEF_PCI_AC_NORET(readsb, (const PCI_IO_ADDR a, void *b, unsigned long c),
 (a, b, c), mem, a)
@@ -40,6 +42,7 @@ DEF_PCI_AC_NORET(writesw, (PCI_IO_ADDR a, const void *b, 
unsigned long c),
 DEF_PCI_AC_NORET(writesl, (PCI_IO_ADDR a, const void *b, unsigned long c),
 (a, b, c), mem, a)
 
+#ifdef CONFIG_HAS_IOPORT
 DEF_PCI_AC_NORET(insb, (unsigned long p, void *b, unsigned long c),
 (p, b, c), pio, p)
 DEF_PCI_AC_NORET(insw, (unsigned long p, void *b, unsigned long c),
@@ -52,6 +55,7 @@ DEF_PCI_AC_NORET(outsw, (unsigned long p, const void *b, 
unsigned long c),
 (p, b, c), pio, p)
 DEF_PCI_AC_NORET(outsl, (unsigned long p, const void *b, unsigned long c),
 (p, b, c), pio, p)
+#endif
 
 DEF_PCI_AC_NORET(memset_io, (PCI_IO_ADDR a, int c, unsigned long n),
 (a, c, n), mem, a)
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 08c550ed49be..86c212fcbc0c 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -37,7 +37,6 @@ extern struct pci_dev *isa_bridge_pcidev;
  * define properly based on the platform
  */
 #ifndef CONFIG_PCI
-#define _IO_BASE

[PATCH 0/5 RESEND] mm: code and data partitioning improvements

2024-04-16 Thread Maxwell Bland

Managing allocations to ensure code and data pages are not interleaved
is not possible prior to this patch, as ASLR requires programming a
dynamic _text offset while the vmalloc infrastructure maintains static
VMALLOC_START and VMALLOC_END constants.

In systems where code and data are interleaved at a PTE granularity,
kernel improvements targeting the prevention of exploit stages which
modify page tables are inefficient and less effective as individual PTE
updates occur at high frequency and cannot be coarsely grouped at the
PMD level or greater.

This patch adds minimal arch-specific callbacks to the initialization of
vmalloc and when deciding whether to use a specific virtual memory area
to satisfy a vmalloc request to provide the capability to prevent the
allocation of specific virtual addresses under specific system states.
By default these hooks are unimplemented.

To further support the practical use of these callbacks, this patch also
adds a virtual address parameter to pmd_populate_kernel, so that this
interface matches the equivalent pte-level interface and architectures
are not required to perform a reverse page table lookup to determine the
vaddr being allocated during pmd creation.

To demonstrate the impact and value of these changes, this patch
implements support for dynamic PXNTable under aarch64 in 71 lines of
code (a single "if" check during memory allocation), by checking the
virtual address of a given vmalloc call to determine whether it is code
or data. From experience in trying to implement kernel page table
immutability and protections in KVM to prevent recent CVEs, e.g.
CVE-2024-1086, this is a necessary first step.

To better help maintainers and future developers, this patch expands
ptdump.c so that non-leaf page table descriptors can be more easily
noted in debug output by setting a note_non_leaf bool in the ptdump
state.

Signed-off-by: Maxwell Bland 

---

Zero-eth, apologies for the triple mail of these patches. I am in the process
of setting up a new SMTP/mail server for Motorola, but until then I've needed
to script the raw SMTP in order to send appropriately formatted patch emails.

First, thank you to a number of maintainers (Mark Rutland, Greg KH,
Christoph Hellwig, Christophe Leroy, David Hildenbrand, Conor Dooley)
for their feedback on

<20240220203256.31153-1-mbl...@motorola.com>
and


This patch is a further refinement and overhaul of these prior two
attempts. Also, apologies for the roughly two months delay between patch
submissions! I had Motorola work to do.

In support of testing this patch (but not included in this patch), I set
note_non_leaf to true under arch/arm64/mm/ptdump.c and added
PMD_TABLE_PXN to pte_bits to print out whether the PXNTable bit was set.
The txt files under the following directory can be diff'ed to see the
result:

github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/ptdump

I also created a script to fetch and cross-compile the kernel for each
of the 21 subarchitectures which required fixes to provide a virtual
address to pmd_populate_kernel. I have no idea if it is useful and maybe
one already exists, but it worked well for me over some alternatives
(xcross, buildroot):

github.com/maxwell-bland/x-linux

As with the last patchset, I also measured performance using Torvald's 
test-tlb program on an aarch64 QEMU instance, with results here:

github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/tlbperf

As all changes to other arches are effectively no-ops, performance
impacts in those domains are negligible. 

Maxwell Bland (5):
  mm: allow arch refinement/skip for vmap alloc
  arm64: mm: code and data partitioning for aslr
  mm: add vaddr param to pmd_populate_kernel
  arm64: dynamic enforcement of PXNTable
  ptdump: add state parameter for non-leaf callback

 arch/alpha/include/asm/pgalloc.h |  5 +-
 arch/arc/include/asm/pgalloc.h   |  3 +-
 arch/arc/mm/highmem.c|  2 +-
 arch/arm/include/asm/kfence.h|  2 +-
 arch/arm/include/asm/pgalloc.h   |  3 +-
 arch/arm/mm/kasan_init.c |  2 +-
 arch/arm/mm/mmu.c|  2 +-
 arch/arm64/include/asm/module.h  | 12 
 arch/arm64/include/asm/pgalloc.h | 15 -
 arch/arm64/include/asm/vmalloc.h | 17 -
 arch/arm64/kernel/Makefile   |  2 +-
 arch/arm64/kernel/module.c   |  7 +-
 arch/arm64/kernel/probes/kprobes.c   |  7 +-
 arch/arm64/kernel/setup.c|  4 ++
 arch/arm64/kernel/vmalloc.c  | 71 
 arch/arm64/mm/ptdump.c   | 10 +--
 arch/arm64/mm/trans_pgd.c|  2 +-
 arch/arm64/net/bpf_jit_comp.c|  8 ++-
 arch/csky/include/asm/pgalloc.h  |  2 +-
 arch/hexagon/include/asm/pgalloc.h   |  2 +-
 arch/loongarch/include/asm/pgalloc.h |  3 +-
 arch/l

[PATCH 5/5 RESEND] ptdump: add state parameter for non-leaf callback

2024-04-16 Thread Maxwell Bland

ptdump can now note non-leaf descriptor entries, a useful addition for
debugging table descriptor permissions when working on related code

Signed-off-by: Maxwell Bland 
---
 arch/arm64/mm/ptdump.c  |  6 --
 arch/powerpc/mm/ptdump/ptdump.c |  2 ++
 arch/riscv/mm/ptdump.c  |  6 --
 arch/s390/mm/dump_pagetables.c  |  6 --
 arch/x86/mm/dump_pagetables.c   |  3 ++-
 include/linux/ptdump.h  |  1 +
 mm/ptdump.c | 13 +
 7 files changed, 30 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
index 796231a4fd63..1a6f4a3513e5 100644
--- a/arch/arm64/mm/ptdump.c
+++ b/arch/arm64/mm/ptdump.c
@@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info 
*info)
.range = (struct ptdump_range[]){
{info->base_addr, end},
{0, 0}
-   }
+   },
+   .note_non_leaf = false
}
};
 
@@ -335,7 +336,8 @@ bool ptdump_check_wx(void)
.range = (struct ptdump_range[]) {
{_PAGE_OFFSET(vabits_actual), ~0UL},
{0, 0}
-   }
+   },
+   .note_non_leaf = false
}
};
 
diff --git a/arch/powerpc/mm/ptdump/ptdump.c b/arch/powerpc/mm/ptdump/ptdump.c
index 9dc239967b77..89e673f5fd3d 100644
--- a/arch/powerpc/mm/ptdump/ptdump.c
+++ b/arch/powerpc/mm/ptdump/ptdump.c
@@ -307,6 +307,7 @@ static int ptdump_show(struct seq_file *m, void *v)
.ptdump = {
.note_page = note_page,
.range = ptdump_range,
+   .note_non_leaf = false
}
};
 
@@ -340,6 +341,7 @@ bool ptdump_check_wx(void)
.ptdump = {
.note_page = note_page,
.range = ptdump_range,
+   .note_non_leaf = false
}
};
 
diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..b355633afcaf 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -328,7 +328,8 @@ static void ptdump_walk(struct seq_file *s, struct 
ptd_mm_info *pinfo)
.range = (struct ptdump_range[]) {
{pinfo->base_addr, pinfo->end},
{0, 0}
-   }
+   },
+   .note_non_leaf = false
}
};
 
@@ -350,7 +351,8 @@ bool ptdump_check_wx(void)
.range = (struct ptdump_range[]) {
{KERN_VIRT_START, ULONG_MAX},
{0, 0}
-   }
+   },
+   .note_non_leaf = false
}
};
 
diff --git a/arch/s390/mm/dump_pagetables.c b/arch/s390/mm/dump_pagetables.c
index ffd07ed7b4af..6468cfd53e2a 100644
--- a/arch/s390/mm/dump_pagetables.c
+++ b/arch/s390/mm/dump_pagetables.c
@@ -200,7 +200,8 @@ bool ptdump_check_wx(void)
.range = (struct ptdump_range[]) {
{.start = 0, .end = max_addr},
{.start = 0, .end = 0},
-   }
+   },
+   .note_non_leaf = false
},
.seq = NULL,
.level = -1,
@@ -239,7 +240,8 @@ static int ptdump_show(struct seq_file *m, void *v)
.range = (struct ptdump_range[]) {
{.start = 0, .end = max_addr},
{.start = 0, .end = 0},
-   }
+   },
+   .note_non_leaf = false
},
.seq = m,
.level = -1,
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index 89079ea73e65..43f00dfb955f 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -380,7 +380,8 @@ bool ptdump_walk_pgd_level_core(struct seq_file *m,
.ptdump = {
.note_page  = note_page,
.effective_prot = effective_prot,
-   .range  = ptdump_ranges
+   .range  = ptdump_ranges,
+   .note_non_leaf  = false
},
.level = -1,
.to_dmesg   = dmesg,
diff --git a/include/linux/ptdump.h b/include/linux/ptdump.h
index 8dbd51ea8626..b3e793a5c77f 100644
--- a/include/linux/ptdump.h
+++ b/include/linux/ptdump.h
@@ -16,6 +16,7 @@ struct ptdump_state {
  int level, u64 val);
void (*effective_prot)(struct ptdump_state *st, int level, u64 val);

Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

2024-04-16 Thread Peter Xu

On Tue, Apr 16, 2024 at 10:58:33AM +, Christophe Leroy wrote:
> 
> 
> Le 15/04/2024 à 21:12, Christophe Leroy a écrit :
> > 
> > 
> > Le 12/04/2024 à 16:30, Peter Xu a écrit :
> >> On Fri, Apr 12, 2024 at 02:08:03PM +, Christophe Leroy wrote:
> >>>
> >>>
> >>> Le 11/04/2024 à 18:15, Peter Xu a écrit :
>  On Mon, Mar 25, 2024 at 01:38:40PM -0300, Jason Gunthorpe wrote:
> > On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
> >> This series reimplements hugepages with hugepd on powerpc 8xx.
> >>
> >> Unlike most architectures, powerpc 8xx HW requires a two-level
> >> pagetable topology for all page sizes. So a leaf PMD-contig approach
> >> is not feasible as such.
> >>
> >> Possible sizes are 4k, 16k, 512k and 8M.
> >>
> >> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD 
> >> entries
> >> must point to a single entry level-2 page table. Until now that was
> >> done using hugepd. This series changes it to use standard page tables
> >> where the entry is replicated 1024 times on each of the two 
> >> pagetables
> >> refered by the two associated PMD entries for that 8M page.
> >>
> >> At the moment it has to look into each helper to know if the
> >> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
> >> a lower size. I hope this can me handled by core-mm in the future.
> >>
> >> There are probably several ways to implement stuff, so feedback is
> >> very welcome.
> >
> > I thought it looks pretty good!
> 
>  I second it.
> 
>  I saw the discussions in patch 1.  Christophe, I suppose you're 
>  exploring
>  the big hammer over hugepd, and perhaps went already with the 32bit pmd
>  solution for nohash/32bit challenge you mentioned?
> 
>  I'm trying to position my next step; it seems like at least I should 
>  not
>  adding any more hugepd code, then should I go with ARCH_HAS_HUGEPD 
>  checks,
>  or you're going to have an RFC soon then I can base on top?
> >>>
> >>> Depends on what you expect by "soon".
> >>>
> >>> I sure won't be able to send any RFC before end of April.
> >>>
> >>> Should be possible to have something during May.
> >>
> >> That's good enough, thanks.  I'll see what is the best I can do.
> >>
> >> Then do you think I can leave p4d/pgd leaves alone?  Please check the 
> >> other
> >> email where I'm not sure whether pgd leaves ever existed for any of
> >> PowerPC.  That's so far what I plan to do, on teaching pgtable walkers
> >> recognize pud and lower for all leaves.  Then if Power can switch from
> >> hugepd to this it should just work.
> > 
> > Well, if I understand correctly, something with no PMD will include 
> >  and will therefore only have  pmd 
> > entries (hence no pgd/p4d/pud entries). Looks odd but that's what it is. 
> > pgd_populate(), p4d_populate(), pud_populate() are all "do { } while 
> > (0)" and only pmd_populate exists. So only pmd_leaf() will exist in that 
> > case.
> > 
> > And therefore including  means  you 
> > have p4d entries. Doesn't mean you have p4d_leaf() but that needs to be 
> > checked.
> > 
> > 
> >>
> >> Even if pgd exists (then something I overlooked..), I'm wondering whether
> >> we can push that downwards to be either pud/pmd (and looks like we all
> >> agree p4d is never used on Power).  That may involve some pgtable
> >> operations moving from pgd level to lower, e.g. my pure imagination would
> >> look like starting with:
> > 
> > Yes I think there is no doubt that p4d is never used:
> > 
> > arch/powerpc/include/asm/book3s/32/pgtable.h:#include 
> > 
> > arch/powerpc/include/asm/book3s/64/pgtable.h:#include 
> > 
> > arch/powerpc/include/asm/nohash/32/pgtable.h:#include 
> > 
> > arch/powerpc/include/asm/nohash/64/pgtable-4k.h:#include 
> > 
> > 
> > But that means that PPC32 have pmd entries and PPC64 have p4d entries ...
> > 
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    0
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    (32 - PGDIR_SHIFT)
> >>
> >> To:
> >>
> >> #define PTE_INDEX_SIZE    PTE_SHIFT
> >> #define PMD_INDEX_SIZE    (32 - PMD_SHIFT)
> >> #define PUD_INDEX_SIZE    0
> >> #define PGD_INDEX_SIZE    0
> > 
> > But then you can't anymore have #define PTRS_PER_PMD 1 from 
> > 
> > 
> >>
> >> And the rest will need care too.  I hope moving downward is easier
> >> (e.g. the walker should always exist for lower levels but not always for
> >> higher levels), but I actually have little idea on whether there's any
> >> other implications, so please bare with me on stupid mistakes.
> >>
> >> I just hope pgd leaves don't exist already, then I think it'll be 
> >> simpler.
> >>
> >> Thanks,
> >>
> 
> Digging into asm-generic/pgtable-nopmd.h, I see a definition of 
> pud_leaf() always returning 0, introduced by commit 2c8a81dc0cc5 
> ("riscv/mm: fix two page table check related issues"

Re: [PATCH 5/5] ptdump: add state parameter for non-leaf callback

2024-04-16 Thread Andrew Morton

On Mon, 15 Apr 2024 14:51:32 -0500 Maxwell Bland  wrote:

> ptdump can now note non-leaf descriptor entries, a useful addition for
> debugging table descriptor permissions when working on related code
> 
> Signed-off-by: Maxwell Bland 
> ---
>  arch/arm64/mm/ptdump.c  |  6 --
>  arch/powerpc/mm/ptdump/ptdump.c |  2 ++
>  arch/riscv/mm/ptdump.c  |  6 --
>  arch/s390/mm/dump_pagetables.c  |  6 --
>  arch/x86/mm/dump_pagetables.c   |  3 ++-
>  include/linux/ptdump.h  |  1 +
>  mm/ptdump.c | 13 +
>  7 files changed, 30 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> index 796231a4fd63..1a6f4a3513e5 100644
> --- a/arch/arm64/mm/ptdump.c
> +++ b/arch/arm64/mm/ptdump.c
> @@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info 
> *info)
>   .range = (struct ptdump_range[]){
>   {info->base_addr, end},
>   {0, 0}
> - }
> + },
> + .note_non_leaf = false
>   }

It would be acceptable to omit all of these and rely upon the runtime
zeroing which the compiler will emit.

Documentation/arch/arm64/ptdump.rst might need updating.

Please include sample output in the changelog so we can better
understand the user's view of this change.

[PATCH] powerpc/pseries/iommu: LPAR panics when rebooted with a frozen PE

2024-04-16 Thread Gaurav Batra

At the time of LPAR reboot, partition firmware provides Open Firmware
property ibm,dma-window for the PE. This property is provided on the PCI
bus the PE is attached to.

There are execptions where the partition firmware might not provide this
property for the PE at the time of LPAR reboot. One of the scenario is
where the firmware has frozen the PE due to some error conditions. This
PE is frozen for 24 hours or unless the whole system is reinitialized.

Within this time frame, if the LPAR is rebooted, the frozen PE will be
presented to the LPAR but ibm,dma-window property could be missing.

Today, under these circumstances, the LPAR oopses with NULL pointer
dereference, when configuring the PCI bus the PE is attached to.

BUG: Kernel NULL pointer dereference on read at 0x00c8
Faulting instruction address: 0xc01024c0
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
Supported: Yes
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.4.0-150600.9-default #1
Hardware name: IBM,9043-MRX POWER10 (raw) 0x800200 0xf06 of:IBM,FW1060.00 
(NM1060_023) hv:phyp pSeries
NIP:  c01024c0 LR: c01024b0 CTR: c0102450
REGS: c37db5c0 TRAP: 0300   Not tainted  (6.4.0-150600.9-default)
MSR:  82009033   CR: 28000822  XER: 
CFAR: c010254c DAR: 00c8 DSISR: 0008 IRQMASK: 0
...
NIP [c01024c0] pci_dma_bus_setup_pSeriesLP+0x70/0x2a0
LR [c01024b0] pci_dma_bus_setup_pSeriesLP+0x60/0x2a0
Call Trace:
pci_dma_bus_setup_pSeriesLP+0x60/0x2a0 (unreliable)
pcibios_setup_bus_self+0x1c0/0x370
__of_scan_bus+0x2f8/0x330
pcibios_scan_phb+0x280/0x3d0
pcibios_init+0x88/0x12c
do_one_initcall+0x60/0x320
kernel_init_freeable+0x344/0x3e4
kernel_init+0x34/0x1d0
ret_from_kernel_user_thread+0x14/0x1c

Fixes: b1fc44eaa9ba ("pseries/iommu/ddw: Fix kdump to work in absence of 
ibm,dma-window")
Signed-off-by: Gaurav Batra 
---
 arch/powerpc/platforms/pseries/iommu.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e8c4129697b1..e808d5b1fa49 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -786,8 +786,16 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
 * parent bus. During reboot, there will be ibm,dma-window property to
 * define DMA window. For kdump, there will at least be default window 
or DDW
 * or both.
+* There is an exception to the above. In case the PE goes into frozen
+* state, firmware may not provide ibm,dma-window property at the time
+* of LPAR reboot.
 */
 
+   if (!pdn) {
+   pr_debug("  no ibm,dma-window property !\n");
+   return;
+   }
+
ppci = PCI_DN(pdn);
 
pr_debug("  parent is %pOF, iommu_table: 0x%p\n",

base-commit: 2c71fdf02a95b3dd425b42f28fd47fb2b1d22702
-- 
2.39.3 (Apple Git-146)

Re: [PATCH 5/5] ptdump: add state parameter for non-leaf callback

2024-04-16 Thread Maxwell Bland

> On Tuesday, April 16, 2024 3:11 PM, Andrew Morton wrote:
> On Mon, 15 Apr 2024 14:51:32 -0500 Maxwell Bland 
> wrote:
>
> >  arch/arm64/mm/ptdump.c  |  6 --
> >  arch/powerpc/mm/ptdump/ptdump.c |  2 ++
> >  arch/riscv/mm/ptdump.c  |  6 --
> >  arch/s390/mm/dump_pagetables.c  |  6 --
> >  arch/x86/mm/dump_pagetables.c   |  3 ++-
> >  include/linux/ptdump.h  |  1 +
> >  mm/ptdump.c | 13 +
> >  7 files changed, 30 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/mm/ptdump.c b/arch/arm64/mm/ptdump.c
> > index 796231a4fd63..1a6f4a3513e5 100644
> > --- a/arch/arm64/mm/ptdump.c
> > +++ b/arch/arm64/mm/ptdump.c
> > @@ -299,7 +299,8 @@ void ptdump_walk(struct seq_file *s, struct ptdump_info
> *info)
> > .range = (struct ptdump_range[]){
> > {info->base_addr, end},
> > {0, 0}
> > -   }
> > +   },
> > +   .note_non_leaf = false
> > }
>
> It would be acceptable to omit all of these and rely upon the runtime
> zeroing which the compiler will emit.

Ah, thank you for the pointer to C99 6.7.8.21. I had always figured since
structs are stack allocated they are potentially non-initialized!

> Documentation/arch/arm64/ptdump.rst might need updating.
>
> Please include sample output in the changelog so we can better
> understand the user's view of this change.

Thanks, I will do both in the next few days everything permitting! Right now
this patch results in no change until note_non_leaf = true is adopted for each
arch.

My plan: I will polish then include output of my personal fixes for arm64.
Specifically, printing expanded PMD flags and tab indenting the layout
according to each level.

Hopefully just adding arm64 support for now is OK, unless maybe we want to
default this to true on all arches? IMO default true would be sweet, but I
wasn't sure everyone would agree.

BRs,
Maxwell Bland

[PATCH 3/5] mm: add vaddr param to pmd_populate_kernel

2024-04-16 Thread Maxwell Bland

This patch affords each architecture the ability to condition the
population of page middle directory entries on the virtual address being
allocated, matching existing PTE infrastructure, easing the necessity of
performing a reverse page table walk in cases where the population
context is not readily accessible, i.e. dynamic vmalloc calls on arm64.

To achieve this goal, it modifies every call and implementation of the
pmd_populate_kernel function across architectures, ensuring uniform
adoption across all kernel deployments.

Signed-off-by: Maxwell Bland 

---

Hi all,

Thank you for taking the time to review this change. This effects many
subarchitectures so the maintainers list is large. Apologies in advance
if there is a specific maintainer I should have spoken with directly for
deployment across subprojects.

The reason for such a sweeping change is from 
lore.kernel.org/all/cf5409c3-254a-459b-8969-429db2ec6...@redhat.com

It is my understanding as well that some subarchitectures may have
separate "next" or development branches ahead of the main upstream
linux. Please let me know if a cherry-pick to that branch is desired and
I will do my best to check out and deploy it as possible.

 arch/alpha/include/asm/pgalloc.h |  5 +++--
 arch/arc/include/asm/pgalloc.h   |  3 ++-
 arch/arc/mm/highmem.c|  2 +-
 arch/arm/include/asm/kfence.h|  2 +-
 arch/arm/include/asm/pgalloc.h   |  3 ++-
 arch/arm/mm/kasan_init.c |  2 +-
 arch/arm/mm/mmu.c|  2 +-
 arch/arm64/include/asm/pgalloc.h |  3 ++-
 arch/arm64/mm/trans_pgd.c|  2 +-
 arch/csky/include/asm/pgalloc.h  |  2 +-
 arch/hexagon/include/asm/pgalloc.h   |  2 +-
 arch/loongarch/include/asm/pgalloc.h |  3 ++-
 arch/loongarch/mm/init.c |  2 +-
 arch/loongarch/mm/kasan_init.c   |  2 +-
 arch/m68k/include/asm/mcf_pgalloc.h  |  2 +-
 arch/m68k/include/asm/motorola_pgalloc.h |  3 ++-
 arch/m68k/include/asm/sun3_pgalloc.h |  3 ++-
 arch/microblaze/include/asm/pgalloc.h|  2 +-
 arch/mips/include/asm/pgalloc.h  |  2 +-
 arch/mips/kvm/mmu.c  |  2 +-
 arch/nios2/include/asm/pgalloc.h |  2 +-
 arch/openrisc/include/asm/pgalloc.h  |  2 +-
 arch/parisc/include/asm/pgalloc.h|  5 +++--
 arch/parisc/mm/init.c|  6 +++---
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  2 +-
 arch/powerpc/include/asm/nohash/64/pgalloc.h |  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
 arch/powerpc/mm/kasan/init_32.c  |  4 ++--
 arch/powerpc/mm/kasan/init_book3e_64.c   |  9 ++---
 arch/powerpc/mm/kasan/init_book3s_64.c   |  7 +--
 arch/powerpc/mm/nohash/book3e_pgtable.c  |  2 +-
 arch/powerpc/mm/pgtable_32.c |  4 ++--
 arch/riscv/include/asm/pgalloc.h |  2 +-
 arch/riscv/kernel/hibernate.c|  2 +-
 arch/s390/include/asm/pgalloc.h  |  2 +-
 arch/sh/include/asm/pgalloc.h|  2 +-
 arch/sh/mm/init.c|  2 +-
 arch/sparc/include/asm/pgalloc_32.h  |  3 ++-
 arch/sparc/include/asm/pgalloc_64.h  |  4 ++--
 arch/sparc/mm/init_64.c  |  8 
 arch/um/include/asm/pgalloc.h|  4 ++--
 arch/x86/include/asm/pgalloc.h   |  3 ++-
 arch/x86/mm/init_64.c| 14 +++---
 arch/x86/mm/ioremap.c|  2 +-
 arch/x86/mm/kasan_init_64.c  |  2 +-
 arch/xtensa/include/asm/pgalloc.h|  2 +-
 include/linux/mm.h   |  4 ++--
 mm/hugetlb_vmemmap.c |  4 ++--
 mm/kasan/init.c  | 14 +-
 mm/memory.c  |  4 ++--
 mm/percpu.c  |  2 +-
 mm/pgalloc-track.h   |  3 ++-
 mm/sparse-vmemmap.c  |  2 +-
 55 files changed, 107 insertions(+), 78 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index 68be7adbfe58..1d3d86cad3cc 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -7,7 +7,7 @@
 
 #include 
 
-/*  
+/*
  * Allocate and free page tables. The xxx_kernel() versions are
  * used to allocate a kernel page table - this turns on ASN bits
  * if any.
@@ -20,7 +20,8 @@ pmd_populate(struct mm_struct *mm, pmd_t *pmd, pgtable_t pte)
 }
 
 static inline void
-pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte)
+pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t *pte,
+   unsigned long vaddr)
 {
pmd_set(pmd, p

[no subject]

2024-04-16 Thread mbland

linux-ker...@vger.kernel.org,
linux-arm-ker...@lists.infradead.org,
linux-ri...@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org,
Mark Rutland ,
Greg Kroah-Hartman ,
Christoph Hellwig ,
Christophe Leroy ,
David Hildenbrand ,
Conor Dooley 
From: Maxwell Bland 
Date: Mon, 15 Apr 2024 15:16:08 -0500
Subject: [PATCH 0/5] mm: code and data partitioning improvements

Managing allocations to ensure code and data pages are not interleaved
is not possible prior to this patch, as ASLR requires programming a
dynamic _text offset while the vmalloc infrastructure maintains static
VMALLOC_START and VMALLOC_END constants.

In systems where code and data are interleaved at a PTE granularity,
kernel improvements targeting the prevention of exploit stages which
modify page tables are inefficient and less effective as individual PTE
updates occur at high frequency and cannot be coarsely grouped at the
PMD level or greater.

This patch adds minimal arch-specific callbacks to the initialization of
vmalloc and when deciding whether to use a specific virtual memory area
to satisfy a vmalloc request to provide the capability to prevent the
allocation of specific virtual addresses under specific system states.
By default these hooks are unimplemented.

To further support the practical use of these callbacks, this patch also
adds a virtual address parameter to pmd_populate_kernel, so that this
interface matches the equivalent pte-level interface and architectures
are not required to perform a reverse page table lookup to determine the
vaddr being allocated during pmd creation.

To demonstrate the impact and value of these changes, this patch
implements support for dynamic PXNTable under aarch64 in 71 lines of
code (a single "if" check during memory allocation), by checking the
virtual address of a given vmalloc call to determine whether it is code
or data. From experience in trying to implement kernel page table
immutability and protections in KVM to prevent recent CVEs, e.g.
CVE-2024-1086, this is a necessary first step.

To better help maintainers and future developers, this patch expands
ptdump.c so that non-leaf page table descriptors can be more easily
noted in debug output by setting a note_non_leaf bool in the ptdump
state.

Signed-off-by: Maxwell Bland 

---

First, thank you to a number of maintainers (Mark Rutland, Greg KH,
Christoph Hellwig, Christophe Leroy, David Hildenbrand, Conor Dooley)
for their feedback on

<20240220203256.31153-1-mbl...@motorola.com>
and


This patch is a further refinement and overhaul of these prior two
attempts. Also, apologies for the roughly two months delay between patch
submissions! I had Motorola work to do.

In support of testing this patch (but not included in this patch), I set
note_non_leaf to true under arch/arm64/mm/ptdump.c and added
PMD_TABLE_PXN to pte_bits to print out whether the PXNTable bit was set.
The txt files under the following directory can be diff'ed to see the
result:

github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/ptdump

I also created a script to fetch and cross-compile the kernel for each
of the 21 subarchitectures which required fixes to provide a virtual
address to pmd_populate_kernel. I have no idea if it is useful and maybe
one already exists, but it worked well for me over some alternatives
(xcross, buildroot):

github.com/maxwell-bland/x-linux

As with the last patchset, I also measured performance using Torvald's 
test-tlb program on an aarch64 QEMU instance, with results here:

github.com/maxwell-bland/linux-patch-data/tree/main/code_data_parting/tlbperf

As all changes to other arches are effectively no-ops, performance
impacts in those domains are negligible. 

Maxwell Bland (5):
  mm: allow arch refinement/skip for vmap alloc
  arm64: mm: code and data partitioning for aslr
  mm: add vaddr param to pmd_populate_kernel
  arm64: dynamic enforcement of PXNTable
  ptdump: add state parameter for non-leaf callback

 arch/alpha/include/asm/pgalloc.h |  5 +-
 arch/arc/include/asm/pgalloc.h   |  3 +-
 arch/arc/mm/highmem.c|  2 +-
 arch/arm/include/asm/kfence.h|  2 +-
 arch/arm/include/asm/pgalloc.h   |  3 +-
 arch/arm/mm/kasan_init.c |  2 +-
 arch/arm/mm/mmu.c|  2 +-
 arch/arm64/include/asm/module.h  | 12 
 arch/arm64/include/asm/pgalloc.h | 15 -
 arch/arm64/include/asm/vmalloc.h | 17 -
 arch/arm64/kernel/Makefile   |  2 +-
 arch/arm64/kernel/module.c   |  7 +-
 arch/arm64/kernel/probes/kprobes.c   |  7 +-
 arch/arm64/kernel/setup.c|  4 ++
 arch/arm64/kernel/vmalloc.c  | 71 
 arch/arm64/mm/ptdump.c   | 10 +--
 arch/arm64/mm/trans_pgd.c|  2 +-
 arch/arm64/net/bpf_jit_comp.c|  8 ++-
 arch/csky/include/asm/pgalloc.h

42 matches

Mail list logo