Re: [PATCH v2 08/23] parisc: add pte_unmap() to balance get_ptep()

2023-06-18 Thread Helge Deller

On 6/8/23 21:18, Hugh Dickins wrote:

To keep balance in future, remember to pte_unmap() after a successful
get_ptep().  And act as if flush_cache_pages() really needs a map there,
to read the pfn before "unmapping", to be sure page table is not removed.

Signed-off-by: Hugh Dickins 


For the parisc parts:

Acked-by: Helge Deller  # parisc

Helge



---
  arch/parisc/kernel/cache.c | 26 +-
  1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ca4a302d4365..501160250bb7 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
addr = mpnt->vm_start + offset;
if (parisc_requires_coherency()) {
+   bool needs_flush = false;
pte_t *ptep;

ptep = get_ptep(mpnt->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep))
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_user_cache_page(mpnt, addr);
} else {
/*
@@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
  static void flush_cache_page_if_present(struct vm_area_struct *vma,
unsigned long vmaddr, unsigned long pfn)
  {
-   pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+   bool needs_flush = false;
+   pte_t *ptep;

/*
 * The pte check is racy and sometimes the flush will trigger
 * a non-access TLB miss. Hopefully, the page has already been
 * flushed.
 */
-   if (ptep && pte_needs_flush(*ptep))
+   ptep = get_ptep(vma->vm_mm, vmaddr);
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush)
flush_cache_page(vma, vmaddr, pfn);
  }

@@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, 
unsigned long start, u
pte_t *ptep;

for (addr = start; addr < end; addr += PAGE_SIZE) {
+   bool needs_flush = false;
/*
 * The vma can contain pages that aren't present. Although
 * the pte search is expensive, we need the pte to find the
 * page pfn and to check whether the page should be flushed.
 */
ptep = get_ptep(vma->vm_mm, addr);
-   if (ptep && pte_needs_flush(*ptep)) {
+   if (ptep) {
+   needs_flush = pte_needs_flush(*ptep);
+   pfn = pte_pfn(*ptep);
+   pte_unmap(ptep);
+   }
+   if (needs_flush) {
if (parisc_requires_coherency()) {
flush_user_cache_page(vma, addr);
} else {
-   pfn = pte_pfn(*ptep);
if (WARN_ON(!pfn_valid(pfn)))
return;
__flush_cache_page(vma, addr, PFN_PHYS(pfn));




Re: [RFC PATCH v2 1/1] powerpc: update ppc_save_regs to save current r1 in pt_regs

2023-06-18 Thread Aditya Gupta

On 15/06/23 17:40, Nicholas Piggin wrote:

On Thu Jun 15, 2023 at 7:10 PM AEST, Aditya Gupta wrote:

ppc_save_regs() skips one stack frame while saving the CPU register states.
Instead of saving current R1, it pulls the previous stack frame pointer.

...

So this now saves regs as though it was an interrupt taken in the
caller, at the instruction after the call to ppc_save_regs, whereas
previously the NIP was there, but R1 came from the caller's caller
and that mismatch is what causes gdb's dwarf unwinder to go haywire.

Nice catch, and I think I follow the fix and I think I agree with it.
Before the bug was introduced, NIP also came from the grandparent.
Which is what xmon presumably wanted, but since then I think maybe it
makes more sense to just have the parent caller.

Reivewed-by: Nicholas Piggin 
Fixes: d16a58f8854b1 ("powerpc: Improve ppc_save_regs()")


Thanks for reviewing this, and providing a Fixes tag too.

Thanks
- Aditya



Re: [PATCH] KVM: ppc64: Enable ring-based dirty memory tracking

2023-06-18 Thread Kautuk Consul
Hi Nick/Gavin/Everyone,

On 2023-06-08 08:34:48, Kautuk Consul wrote:
> - Enable CONFIG_HAVE_KVM_DIRTY_RING_ACQ_REL as ppc64 is weakly
>   ordered.
> - Enable CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP because the
>   kvmppc_xive_native_set_attr is called in the context of an ioctl
>   syscall and will call kvmppc_xive_native_eq_sync for setting the
>   KVM_DEV_XIVE_EQ_SYNC attribute which will call mark_dirty_page()
>   when there isn't a running vcpu. Implemented the
>   kvm_arch_allow_write_without_running_vcpu to always return true
>   to allow mark_page_dirty_in_slot to mark the page dirty in the
>   memslot->dirty_bitmap in this case.
> - Set KVM_DIRTY_LOG_PAGE_OFFSET for the ring buffer's physical page
>   offset.
> - Implement the kvm_arch_mmu_enable_log_dirty_pt_masked function required
>   for the generic KVM code to call.
> - Add a check to kvmppc_vcpu_run_hv for checking whether the dirty
>   ring is soft full.
> - Implement the kvm_arch_flush_remote_tlbs_memslot function to support
>   the CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT config option.
> 
> On testing with live migration it was found that there is around
> 150-180 ms improvment in overall migration time with this patch.
> 
> Signed-off-by: Kautuk Consul 
> ---
>  Documentation/virt/kvm/api.rst  |  2 +-
>  arch/powerpc/include/uapi/asm/kvm.h |  2 ++
>  arch/powerpc/kvm/Kconfig|  2 ++
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 42 +
>  arch/powerpc/kvm/book3s_hv.c|  3 +++
>  include/linux/kvm_dirty_ring.h  |  5 
>  6 files changed, 55 insertions(+), 1 deletion(-)
> 
Any review comments on this ?
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index add067793b90..ce1ebc513bae 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8114,7 +8114,7 @@ regardless of what has actually been exposed through 
> the CPUID leaf.
>  8.29 KVM_CAP_DIRTY_LOG_RING/KVM_CAP_DIRTY_LOG_RING_ACQ_REL
>  --
> 
> -:Architectures: x86, arm64
> +:Architectures: x86, arm64, ppc64
>  :Parameters: args[0] - size of the dirty log ring
> 
>  KVM is capable of tracking dirty memory using ring buffers that are
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h 
> b/arch/powerpc/include/uapi/asm/kvm.h
> index 9f18fa090f1f..f722309ed7fb 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -33,6 +33,8 @@
>  /* Not always available, but if it is, this is the correct offset.  */
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> 
> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> +
>  struct kvm_regs {
>   __u64 pc;
>   __u64 cr;
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index 902611954200..c93354ec3bd5 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -26,6 +26,8 @@ config KVM
>   select IRQ_BYPASS_MANAGER
>   select HAVE_KVM_IRQ_BYPASS
>   select INTERVAL_TREE
> + select HAVE_KVM_DIRTY_RING_ACQ_REL
> + select NEED_KVM_DIRTY_RING_WITH_BITMAP
> 
>  config KVM_BOOK3S_HANDLER
>   bool
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 7f765d5ad436..c92e8022e017 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -2147,3 +2147,45 @@ void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
> 
>   vcpu->arch.hflags |= BOOK3S_HFLAG_SLB;
>  }
> +
> +/*
> + * kvm_arch_mmu_enable_log_dirty_pt_masked - enable dirty logging for 
> selected
> + * dirty pages.
> + *
> + * It write protects selected pages to enable dirty logging for them.
> + */
> +void kvm_arch_mmu_enable_log_dirty_pt_masked(struct kvm *kvm,
> +  struct kvm_memory_slot *slot,
> +  gfn_t gfn_offset,
> +  unsigned long mask)
> +{
> + phys_addr_t base_gfn = slot->base_gfn + gfn_offset;
> + phys_addr_t start = (base_gfn +  __ffs(mask)) << PAGE_SHIFT;
> + phys_addr_t end = (base_gfn + __fls(mask) + 1) << PAGE_SHIFT;
> +
> + while (start < end) {
> + pte_t *ptep;
> + unsigned int shift;
> +
> + ptep = find_kvm_secondary_pte(kvm, start, );
> +
> + *ptep = __pte(pte_val(*ptep) & ~(_PAGE_WRITE));
> +
> + start += PAGE_SIZE;
> + }
> +}
> +
> +#ifdef CONFIG_NEED_KVM_DIRTY_RING_WITH_BITMAP
> +bool kvm_arch_allow_write_without_running_vcpu(struct kvm *kvm)
> +{
> + return true;
> +}
> +#endif
> +
> +#ifdef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT
> +void kvm_arch_flush_remote_tlbs_memslot(struct kvm *kvm,
> + const struct kvm_memory_slot *memslot)
> +{
> + kvm_flush_remote_tlbs(kvm);
> +}
> +#endif
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 

[PATCH v11 0/4] PowerPC: In-kernel handling of CPU/Memory hotplug/online/offline events for kdump kernel

2023-06-18 Thread Sourabh Jain
The Problem:

Post CPU/Memory hot plug/unplug and online/offline events occur, the
kdump kernel often retains outdated system information. This presents
a significant challenge when attempting to perform a dump collection
using an outdated or stale kdump kernel. In such situations, there
are two potential outcomes that pose risks: either the dump collection
fails to capture the required data entirely, leading to a failed dump,
or the collected dump data is inaccurate, thereby compromising its
reliability for analysis and troubleshooting purposes

Existing solution:
==
The existing solution to keep the kdump kernel up-to-date involves
monitoring CPU/Memory hotplug/online/offline events via a udev rule.
This approach triggers a full kdump kernel reload for each hotplug event,
ensuring that the kdump kernel is always synchronized with the latest
system resource changes.

Shortcomings of existing solution:
==
- Leaves a window where kernel crash might not lead to a successful dump
  collection.
- Reloading all kexec segments for each hotplug is inefficient.
- udev rules are prone to races if hotplug events are frequent.

Further information regarding the problems associated with a current
solution can be found here.
 - https://lore.kernel.org/lkml/b04ed259-dc5f-7f30-6661-c26f92d90...@oracle.com/
 - https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-February/240254.html

Proposed Solution:
==
To address the limitations of the current approach, a proposed solution
focuses on implementing a more targeted update strategy. Instead of
performing a full reload of all kexec segments for every CPU/Memory hot
plug/unplug and online/offline events, the proposed solution aims to update
only the relevant kexec segment. After loading the kexec segments into the
reserved area, a newly introduced hotplug handler will be responsible for
updating the specific kexec segment based on the type of hotplug event.
This selective update approach enhances overall efficiency by minimizing
unnecessary overhead and significantly reduces the chances of a kernel
crash leading to a failed or inaccurate dump collection.

Series Dependencies:

The implementation of the crash hotplug handler on PowerPC is included in
this patch series. The introduction of the generic crash hotplug handler
is done through the patch series available at
https://lore.kernel.org/all/20230612210712.683175-1-eric.devol...@oracle.com/

Git tree for testing:
=
The following Git tree incorporates this patch series applied on top of
the dependent patch series.
https://github.com/sourabhjains/linux/tree/e23-s11-with-kexec-config

In order to enable this feature, it is necessary to disable the udev rule
responsible for reloading the kdump service. To do this, you can make the
following additions to the file "/usr/lib/udev/rules.d/98-kexec.rules" on RHEL:

Add the following two lines at top:

   SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
   SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

The changes mentioned above ensure that the kdump reload process is skipped
for CPU/Memory hot plug/unplug events when the path
"/sys/devices/system/[cpu|memory]/crash_hotplug" exists.

Note: only kexec_file_load syscall will work. For kexec_load minor changes are
required in kexec tool.

---
Changelog:

v11:
  - Rebase to v6.4-rc6
  - The patch that introduced CONFIG_CRASH_HOTPLUG for PowerPC has been removed.
The config is now part of common configuration:
https://lore.kernel.org/all/87ilbpflsk.fsf@mail.lhotse/

v10:
  - Drop the patch that adds fdt_index attribute to struct kimage_arch
Find the fdt segment index when needed.
  - Added more details into commits messages.
  - Rebased onto 6.3.0-rc5

v9:
  - Removed patch to prepare elfcorehdr crash notes for possible CPUs.
The patch is moved to generic patch series that introduces generic
infrastructure for in kernel crash update.
  - Removed patch to pass the hotplug action type to the arch crash
hotplug handler function. The generic patch series has introduced
the hotplug action type in kimage struct.
  - Add detail commit message for better understanding.

v8:
  - Restrict fdt_index initialization to machine_kexec_post_load
it work for both kexec_load and kexec_file_load.[3/8] Laurent Dufour

  - Updated the logic to find the number of offline core. [6/8]

  - Changed the logic to find the elfcore program header to accommodate
future memory ranges due memory hotplug events. [8/8]

v7
  - added a new config to configure this feature
  - pass hotplug action type to arch specific handler

v6
  - Added crash memory hotplug support

v5:
  - Replace COFNIG_CRASH_HOTPLUG with CONFIG_HOTPLUG_CPU.
  - Move fdt segment identification for kexec_load case to load path
instead of crash hotplug handler
  - Keep new attribute defined under 

[PATCH v11 2/4] powerpc/crash: add crash CPU hotplug support

2023-06-18 Thread Sourabh Jain
Introduce powerpc crash hotplug handler to update the necessary kexec
segments in the kernel on CPU/Memory hotplug events. Currently, these
updates are done by monitoring CPU/Memory hotplug events in userspace.

In the generic infrastructure, a shared crash hotplug handler is
triggered for both CPU and Memory hotplug events. However, in this
particular patch, only CPU hotplug events are handled for crash updates.
Support for crash updates during memory hotplug events will be
introduced in subsequent patches.

The elfcorehdr segment facilitates the exchange of CPU and other
relevant dump information between kernels. Ideally, this segment should
be regenerated during CPU hotplug events to reflect any changes.
However, on powerpc systems, the elfcorehdr is initially constructed
with all possible CPUs, rendering it unnecessary to update or recreate
it when CPU hotplug events occur.

Additionally, on powerpc, there is another segment called the FDT
(Flattened Device Tree) that holds CPU data. During the boot of the
kdump kernel, it is crucial for the crashing CPU to be present in the
FDT. Failure to have the crashing CPU in the FDT would result in a
boot failure of the kdump kernel.

Therefore, the only action required on powerpc to handle a crash CPU
hotplug event is to add the hot-added CPUs to the kdump FDT segment,
ensuring that the kdump kernel boots successfully. However, there is
no need to remove the hot-unplugged CPUs from the FDT segment, so no
action is taken for a CPU hot removal event.

In order to support an increasing number of CPUs, the FDT is constructed
with extra buffer space to ensure it can accommodate possible number of
CPUs nodes. Also do not pack the FDT and shrik its size once prepared.

The changes introduced here will also work for the kexec_load system
call, considering that the kexec tool constructs the FDT segment with
extra space to accommodate possible number CPUs nodes.

Since memory crash hotplug support is not there yet the crash hotplug
the handler simply warns the user and returns.

The changes realted to in-kernel update to kdump kernel kexec segments
on CPU/Memory hotplug and online/offline events are kept under
CRASH_HOTPLUG config and it is enabled by default on PowerPC platform.

Signed-off-by: Sourabh Jain 
---
 arch/powerpc/Kconfig  |  3 ++
 arch/powerpc/include/asm/kexec.h  | 10 +
 arch/powerpc/kexec/core_64.c  | 61 +++
 arch/powerpc/kexec/elf_64.c   | 12 +-
 arch/powerpc/kexec/file_load_64.c | 14 +++
 5 files changed, 99 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36f2fe0cc8a5..87f1b569c682 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -665,6 +665,9 @@ config RELOCATABLE_TEST
 config ARCH_HAS_CRASH_DUMP
def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
 
+config ARCH_HAS_CRASH_HOTPLUG
+   def_bool y
+
 config ARCH_SUPPORTS_CRASH_DUMP
def_bool y
depends on CRASH_DUMP
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 8090ad7d97d9..154759bf0759 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -103,6 +103,16 @@ void kexec_copy_flush(struct kimage *image);
 struct crash_mem;
 int update_cpus_node(void *fdt);
 int get_crash_memory_ranges(struct crash_mem **mem_ranges);
+
+#ifdef CONFIG_CRASH_HOTPLUG
+void arch_crash_handle_hotplug_event(struct kimage *image);
+#define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
+
+#ifdef CONFIG_HOTPLUG_CPU
+static inline int crash_hotplug_cpu_support(void) { return 1; }
+#define crash_hotplug_cpu_support crash_hotplug_cpu_support
+#endif
+#endif
 #endif
 
 #if defined(CONFIG_CRASH_DUMP) && defined(CONFIG_PPC_RTAS)
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 0b292f93a74c..cb38da1c6dbe 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -543,6 +543,67 @@ int update_cpus_node(void *fdt)
return ret;
 }
 
+#ifdef CONFIG_CRASH_HOTPLUG
+#undef pr_fmt
+#define pr_fmt(fmt) "crash hp: " fmt
+
+/**
+ * arch_crash_handle_hotplug_event - Handle crash CPU/Memory hotplug events to 
update the
+ *  necessary kexec segments based on the 
hotplug event.
+ * @image: the active struct kimage
+ *
+ * Update FDT segment to include newly added CPU. No action for CPU remove 
case.
+ */
+void arch_crash_handle_hotplug_event(struct kimage *image)
+{
+   void *fdt, *ptr;
+   unsigned long mem;
+   int i, fdt_index = -1;
+   unsigned int hp_action = image->hp_action;
+
+   /*
+* Since the hot-unplugged CPU is already part of crash FDT,
+* no action is needed for CPU remove case.
+*/
+   if (hp_action == KEXEC_CRASH_HP_REMOVE_CPU)
+   return;
+
+   /* crash update on memory hotplug events is not supported yet */
+   if (hp_action == 

[PATCH v11 4/4] powerpc/crash: add crash memory hotplug support

2023-06-18 Thread Sourabh Jain
Extend PowerPC arch crash hotplug handler to support memory hotplug
events. Since elfcorehdr is used to exchange the memory info between the
kernels hence it needs to be recreated to reflect the changes due to
memory hotplug events.

The way memory hotplug events are handled on PowerPC and the notifier
call chain used in generic code to trigger the arch crash handler, the
process to recreate the elfcorehdr is different for memory add and
remove case.

For memory remove case the memory change notifier call chain is
triggered first and then memblock regions is updated. Whereas for the
memory hot add case, memblock regions are updated before invoking the
memory change notifier call chain.

On PowerPC, memblock regions list is used to prepare the elfcorehdr. In
case of memory hot remove the memblock regions are updated after the
arch crash hotplug handler is triggered, hence an additional step is
taken to ensure that memory ranges used to prepare elfcorehdr do not
include hot removed memory.

When memory is hot removed it possible that memory regions count may
increase. So to accommodate a growing number of memory regions, the
elfcorehdr kexec segment is built with additional buffer space.

The changes done here will also work for the kexec_load system call given
that the kexec tool builds the elfcoredhr with additional space to
accommodate future memory regions as it is done for kexec_file_load
system call in the kernel.

Signed-off-by: Sourabh Jain 
---
 arch/powerpc/include/asm/kexec.h|  6 ++
 arch/powerpc/include/asm/kexec_ranges.h |  1 +
 arch/powerpc/kexec/core_64.c| 77 +-
 arch/powerpc/kexec/file_load_64.c   | 36 ++-
 arch/powerpc/kexec/ranges.c | 85 +
 5 files changed, 201 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index d3ff481aa9f8..10017880571c 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -112,6 +112,12 @@ void arch_crash_handle_hotplug_event(struct kimage *image, 
void *arg);
 static inline int crash_hotplug_cpu_support(void) { return 1; }
 #define crash_hotplug_cpu_support crash_hotplug_cpu_support
 #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static inline int crash_hotplug_memory_support(void) { return 1; }
+#define crash_hotplug_memory_support crash_hotplug_memory_support
+#endif
+
 #endif
 #endif
 
diff --git a/arch/powerpc/include/asm/kexec_ranges.h 
b/arch/powerpc/include/asm/kexec_ranges.h
index f83866a19e87..802abf580cf0 100644
--- a/arch/powerpc/include/asm/kexec_ranges.h
+++ b/arch/powerpc/include/asm/kexec_ranges.h
@@ -7,6 +7,7 @@
 void sort_memory_ranges(struct crash_mem *mrngs, bool merge);
 struct crash_mem *realloc_mem_ranges(struct crash_mem **mem_ranges);
 int add_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size);
+int remove_mem_range(struct crash_mem **mem_ranges, u64 base, u64 size);
 int add_tce_mem_ranges(struct crash_mem **mem_ranges);
 int add_initrd_mem_range(struct crash_mem **mem_ranges);
 #ifdef CONFIG_PPC_64S_HASH_MMU
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 4d1c53cc9a90..e5038f2769bb 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -547,6 +548,76 @@ int update_cpus_node(void *fdt)
 #undef pr_fmt
 #define pr_fmt(fmt) "crash hp: " fmt
 
+/**
+ * update_crash_elfcorehdr() - Recreate the elfcorehdr and replace it with old
+ *elfcorehdr in the kexec segment array.
+ * @image: the active struct kimage
+ * @mn: struct memory_notify data handler
+ */
+static void update_crash_elfcorehdr(struct kimage *image, struct memory_notify 
*mn)
+{
+   int ret;
+   struct crash_mem *cmem = NULL;
+   struct kexec_segment *ksegment;
+   void *ptr, *mem, *elfbuf = NULL;
+   unsigned long elfsz, memsz, base_addr, size;
+
+   ksegment = >segment[image->elfcorehdr_index];
+   mem = (void *) ksegment->mem;
+   memsz = ksegment->memsz;
+
+   ret = get_crash_memory_ranges();
+   if (ret) {
+   pr_err("Failed to get crash mem range\n");
+   return;
+   }
+
+   /*
+* The hot unplugged memory is part of crash memory ranges,
+* remove it here.
+*/
+   if (image->hp_action == KEXEC_CRASH_HP_REMOVE_MEMORY) {
+   base_addr = PFN_PHYS(mn->start_pfn);
+   size = mn->nr_pages * PAGE_SIZE;
+   ret = remove_mem_range(, base_addr, size);
+   if (ret) {
+   pr_err("Failed to remove hot-unplugged from crash 
memory ranges.\n");
+   return;
+   }
+   }
+
+   ret = crash_prepare_elf64_headers(cmem, false, , );
+   if (ret) {
+   pr_err("Failed to prepare elf header\n");
+   return;
+   }
+
+

[PATCH v11 1/4] powerpc/kexec: turn some static helper functions public

2023-06-18 Thread Sourabh Jain
Move update_cpus_node and get_crash_memory_ranges functions from
kexec/file_load_64.c to kexec/core_64.c to make these functions usable
by other kexec components.

Later in the series, these functions are utilized to do in-kernel update
to kexec segments on CPU/Memory hot plug/unplug or online/offline events
for both kexec_load and kexec_file_load syscalls.

No functional change intended.

Signed-off-by: Sourabh Jain 
Reviewed-by: Laurent Dufour 
---
 arch/powerpc/include/asm/kexec.h  |   6 ++
 arch/powerpc/kexec/core_64.c  | 166 ++
 arch/powerpc/kexec/file_load_64.c | 162 -
 3 files changed, 172 insertions(+), 162 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index a1ddba01e7d1..8090ad7d97d9 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -99,6 +99,12 @@ void relocate_new_kernel(unsigned long indirection_page, 
unsigned long reboot_co
 
 void kexec_copy_flush(struct kimage *image);
 
+#ifdef CONFIG_PPC64
+struct crash_mem;
+int update_cpus_node(void *fdt);
+int get_crash_memory_ranges(struct crash_mem **mem_ranges);
+#endif
+
 #if defined(CONFIG_CRASH_DUMP) && defined(CONFIG_PPC_RTAS)
 void crash_free_reserved_phys_range(unsigned long begin, unsigned long end);
 #define crash_free_reserved_phys_range crash_free_reserved_phys_range
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index a79e28c91e2b..0b292f93a74c 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -30,6 +32,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 int machine_kexec_prepare(struct kimage *image)
 {
@@ -377,6 +381,168 @@ void default_machine_kexec(struct kimage *image)
/* NOTREACHED */
 }
 
+/**
+ * get_crash_memory_ranges - Get crash memory ranges. This list includes
+ *   first/crashing kernel's memory regions that
+ *   would be exported via an elfcore.
+ * @mem_ranges:  Range list to add the memory ranges to.
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+int get_crash_memory_ranges(struct crash_mem **mem_ranges)
+{
+   phys_addr_t base, end;
+   struct crash_mem *tmem;
+   u64 i;
+   int ret;
+
+   for_each_mem_range(i, , ) {
+   u64 size = end - base;
+
+   /* Skip backup memory region, which needs a separate entry */
+   if (base == BACKUP_SRC_START) {
+   if (size > BACKUP_SRC_SIZE) {
+   base = BACKUP_SRC_END + 1;
+   size -= BACKUP_SRC_SIZE;
+   } else
+   continue;
+   }
+
+   ret = add_mem_range(mem_ranges, base, size);
+   if (ret)
+   goto out;
+
+   /* Try merging adjacent ranges before reallocation attempt */
+   if ((*mem_ranges)->nr_ranges == (*mem_ranges)->max_nr_ranges)
+   sort_memory_ranges(*mem_ranges, true);
+   }
+
+   /* Reallocate memory ranges if there is no space to split ranges */
+   tmem = *mem_ranges;
+   if (tmem && (tmem->nr_ranges == tmem->max_nr_ranges)) {
+   tmem = realloc_mem_ranges(mem_ranges);
+   if (!tmem)
+   goto out;
+   }
+
+   /* Exclude crashkernel region */
+   ret = crash_exclude_mem_range(tmem, crashk_res.start, crashk_res.end);
+   if (ret)
+   goto out;
+
+   /*
+* FIXME: For now, stay in parity with kexec-tools but if RTAS/OPAL
+*regions are exported to save their context at the time of
+*crash, they should actually be backed up just like the
+*first 64K bytes of memory.
+*/
+   ret = add_rtas_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   ret = add_opal_mem_range(mem_ranges);
+   if (ret)
+   goto out;
+
+   /* create a separate program header for the backup region */
+   ret = add_mem_range(mem_ranges, BACKUP_SRC_START, BACKUP_SRC_SIZE);
+   if (ret)
+   goto out;
+
+   sort_memory_ranges(*mem_ranges, false);
+out:
+   if (ret)
+   pr_err("Failed to setup crash memory ranges\n");
+   return ret;
+}
+
+/**
+ * add_node_props - Reads node properties from device node structure and add
+ *  them to fdt.
+ * @fdt:Flattened device tree of the kernel
+ * @node_offset:offset of the node to add a property at
+ * @dn: device node pointer
+ *
+ * Returns 0 on success, negative errno on error.
+ */
+static int add_node_props(void *fdt, int node_offset, const struct device_node 
*dn)
+{
+   int ret = 0;
+   struct property *pp;
+

[PATCH v11 3/4] crash: forward memory_notify args to arch crash hotplug handler

2023-06-18 Thread Sourabh Jain
On PowerPC, memblock regions is used to prepare the elfcorehdr. This
elfcorehdr describes the memory regions of the running kernel to the
kdump kernel. However, a challenge arises as the notifier for the memory
hotplug crash handler is triggered before the memblock region update
takes place. Consequently, the newly prepared elfcorehdr still retains
the outdated memory regions. If the elfcorehdr is generated with these
outdated memblock regions, it will contain inaccurate information about
the memory regions. This can result in failures or incomplete dump
collection when attempting to collect a dump using the outdated
elfcorehdr.

This challenge is specific to removing an LMB (Local Memory Block). It
does not apply when memory is added. During memory addition, the memory
regions are updated first, and then the memory notify function calls the
arch crash hotplug handler to update the elfcorehdr.

The flow diagram below depicts the series of action taken during memory
hot removal.

 Initiate memory hot remove
  |
  v
 offline pages
  |
  v
 initiate memory notify call
 chain for MEM_OFFLINE event
  |
  v
 Prepare new elfcorehdr and replace
 it with old one.
  |
  V
 update memblock regions

The arch crash hotplug handler function signature is updated to pass
additional argument as the struct memory_notify object to architectures.

The memory_notify object contains the starting PFN (Page Frame Number)
and the number of pages in the hot removed memory. By utilizing this
information, the base address and size of the hot removed memory is
calculated and used to avoid adding the hot removed memory region to
the elfcorehdr.

Signed-off-by: Sourabh Jain 
---
 arch/powerpc/include/asm/kexec.h |  2 +-
 arch/powerpc/kexec/core_64.c |  3 ++-
 arch/x86/include/asm/kexec.h |  2 +-
 arch/x86/kernel/crash.c  |  5 +++--
 include/linux/kexec.h|  2 +-
 kernel/crash_core.c  | 14 +++---
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 154759bf0759..d3ff481aa9f8 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -105,7 +105,7 @@ int update_cpus_node(void *fdt);
 int get_crash_memory_ranges(struct crash_mem **mem_ranges);
 
 #ifdef CONFIG_CRASH_HOTPLUG
-void arch_crash_handle_hotplug_event(struct kimage *image);
+void arch_crash_handle_hotplug_event(struct kimage *image, void *arg);
 #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index cb38da1c6dbe..4d1c53cc9a90 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -551,10 +551,11 @@ int update_cpus_node(void *fdt)
  * arch_crash_handle_hotplug_event - Handle crash CPU/Memory hotplug events to 
update the
  *  necessary kexec segments based on the 
hotplug event.
  * @image: the active struct kimage
+ * @arg: struct memory_notify handler for memory hotplug case and NULL for CPU 
hotplug case.
  *
  * Update FDT segment to include newly added CPU. No action for CPU remove 
case.
  */
-void arch_crash_handle_hotplug_event(struct kimage *image)
+void arch_crash_handle_hotplug_event(struct kimage *image, void *arg)
 {
void *fdt, *ptr;
unsigned long mem;
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 3be6a98751f0..ca0bc9b12504 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -210,7 +210,7 @@ extern crash_vmclear_fn __rcu *crash_vmclear_loaded_vmcss;
 extern void kdump_nmi_shootdown_cpus(void);
 
 #ifdef CONFIG_CRASH_HOTPLUG
-void arch_crash_handle_hotplug_event(struct kimage *image);
+void arch_crash_handle_hotplug_event(struct kimage *image, void *arg);
 #define arch_crash_handle_hotplug_event arch_crash_handle_hotplug_event
 
 #ifdef CONFIG_HOTPLUG_CPU
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 18d2a18d1073..d80a63665436 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -456,11 +456,12 @@ unsigned int arch_crash_get_elfcorehdr_size(void)
 
 /**
  * arch_crash_handle_hotplug_event() - Handle hotplug elfcorehdr changes
- * @image: a pointer to kexec_crash_image
+ * @image: the active struct kimage
+ * @arg: struct memory_notify handler for memory hotplug case and NULL for CPU 
hotplug case.
  *
  * Prepare the new elfcorehdr and replace the existing elfcorehdr.
  */
-void arch_crash_handle_hotplug_event(struct kimage *image)
+void arch_crash_handle_hotplug_event(struct kimage *image, void *arg)
 {
void *elfbuf = NULL, *old_elfcorehdr;
unsigned long nr_mem_ranges;
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index bb0e614f2a05..b3fcee14fea4 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -502,7 

Re: [PATCH] ASoC: imx-audmix: check return value of devm_kasprintf()

2023-06-18 Thread Shengjiu Wang
On Wed, Jun 14, 2023 at 8:15 PM Claudiu Beznea 
wrote:

> devm_kasprintf() returns a pointer to dynamically allocated memory.
> Pointer could be NULL in case allocation fails. Check pointer validity.
> Identified with coccinelle (kmerr.cocci script).
>
> Fixes: b86ef5367761 ("ASoC: fsl: Add Audio Mixer machine driver")
> Signed-off-by: Claudiu Beznea 
>

Acked-by: Shengjiu Wang 

Best regards
Wang shengjiu

> ---
>
> Hi,
>
> This has been addressed using kmerr.cocci script proposed for update
> at [1].
>
> Thank you,
> Claudiu Beznea
>
> [1]
> https://lore.kernel.org/all/20230530074044.1603426-1-claudiu.bez...@microchip.com/
>
>  sound/soc/fsl/imx-audmix.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/sound/soc/fsl/imx-audmix.c b/sound/soc/fsl/imx-audmix.c
> index 2c57fe9d2d08..af06268ee57b 100644
> --- a/sound/soc/fsl/imx-audmix.c
> +++ b/sound/soc/fsl/imx-audmix.c
> @@ -228,6 +228,8 @@ static int imx_audmix_probe(struct platform_device
> *pdev)
>
> dai_name = devm_kasprintf(>dev, GFP_KERNEL, "%s%s",
>   fe_name_pref, args.np->full_name
> + 1);
> +   if (!dai_name)
> +   return -ENOMEM;
>
> dev_info(pdev->dev.parent, "DAI FE name:%s\n", dai_name);
>
> @@ -236,6 +238,8 @@ static int imx_audmix_probe(struct platform_device
> *pdev)
> capture_dai_name =
> devm_kasprintf(>dev, GFP_KERNEL, "%s
> %s",
>dai_name, "CPU-Capture");
> +   if (!capture_dai_name)
> +   return -ENOMEM;
> }
>
> priv->dai[i].cpus = [0];
> @@ -263,6 +267,8 @@ static int imx_audmix_probe(struct platform_device
> *pdev)
>"AUDMIX-Playback-%d", i);
> be_cp = devm_kasprintf(>dev, GFP_KERNEL,
>"AUDMIX-Capture-%d", i);
> +   if (!be_name || !be_pb || !be_cp)
> +   return -ENOMEM;
>
> priv->dai[num_dai + i].cpus = [2];
> priv->dai[num_dai + i].codecs = [3];
> @@ -287,6 +293,9 @@ static int imx_audmix_probe(struct platform_device
> *pdev)
> priv->dapm_routes[i].source =
> devm_kasprintf(>dev, GFP_KERNEL, "%s %s",
>dai_name, "CPU-Playback");
> +   if (!priv->dapm_routes[i].source)
> +   return -ENOMEM;
> +
> priv->dapm_routes[i].sink = be_pb;
> priv->dapm_routes[num_dai + i].source   = be_pb;
> priv->dapm_routes[num_dai + i].sink = be_cp;
> --
> 2.34.1
>
>


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Kent Overstreet
On Mon, Jun 19, 2023 at 02:43:58AM +0200, Thomas Gleixner wrote:
> Kent!

Hi Thomas :)

> No. I am not.

Ok.

> Whether that's an internal function or not does not make any difference
> at all.

Well, at the risk of this discussion going completely off the rails, I
have to disagree with you there. External interfaces and high level
semantics are more important to get right from the outset, internal
implementation details can be cleaned up later, within reason.

And the discussion on this patchset has been more focused on those
external interfaces, which seems like the right approach to me.

> > ... I made the same mistake reviewing Song's patchset...
> 
> Songs series had rough edges, but was way more data structure driven
> and palatable than this hackery.

I liked that aspect of Song's patchset too, and I'm actually inclined to
agree with you that this patchset might get a bit cleaner with more of
that, but really, this semes like just quibbling over calling convention
for an internal helper function.


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Thomas Gleixner
Kent!

On Sun, Jun 18 2023 at 19:14, Kent Overstreet wrote:
> On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
>
> Thomas, you're confusing an internal interface with external

No. I am not.

Whether that's an internal function or not does not make any difference
at all.

Having a function growing _eight_ parameters where _six_ of them are
derived from a well defined data structure is a clear sign of design
fail.

It's not rocket science to do:

struct generic_allocation_info {
   
};

struct execmem_info {
   
   struct generic_allocation_info alloc_info;
;

struct execmem_param {
   ...
   struct execmem_info[NTYPES];
};

and having a function which can either operate on execmem_param and type
or on generic_allocation_info itself. It does not matter as long as the
data structure which is handed into this internal function is
describing it completely or needs a supplementary argument, i.e. flags.

Having tons of wrappers which do:

   a = generic_info.a;
   b = generic_info.b;
   
   n = generic_info.n;

   internal_func(a, b, ,, n);

is just hillarious and to repeat myself tasteless and therefore
disgusting.

That's CS course first semester hackery, but TBH, I can only tell from
imagination because I did not take CS courses - maybe that's the
problem...

Data structure driven design works not from the usage site down to the
internals. It's the other way round:

  1) Define a data structure which describes what the internal function
 needs to know

  2) Implement use case specific variants which describe that

  3) Hand the use case specific variant to the internal function
 eventually with some minimal supplementary information.

Object oriented basics, right?

There is absolutely nothing wrong with having

  internal_func(generic_info *, modifier);

but having

  internal_func(a, b, ... n)

is fundamentally wrong in the context of an extensible and future proof
internal function.

Guess what. Today it's sufficient to have _eight_ arguments and we are
happy to have 10 nonsensical wrappers around this internal
function. Tomorrow there happens to be a use case which needs another
argument so you end up:

  Changing 10 wrappers plus the function declaration and definition in
  one go

instead of adding

  One new naturally 0 initialized member to generic_info and be done
  with it.

Look at the evolution of execmem_alloc() in this very pathset which I
pointed out. That very patchset covers _two_ of at least _six_ cases
Song and myself identified. It already had _three_ steps of evolution
from _one_ to _five_ to _eight_ parameters.

C is not like e.g. python where you can "solve" that problem by simply
doing:

- internal_func(a, b, c):
+ internal_func(a, b, c, d=None, ..., n=None):

But that's not a solution either. That's a horrible workaround even in
python once your parameter space gets sufficiently large. The way out in
python is to have **kwargs. But that's not an option in C, and not
necessarily the best option for python either.

Even in python or any other object oriented language you get to the
point where you have to rethink your approach, go back to the drawing
board and think about data representation.

But creating a new interface based on "let's see what we need over
time and add parameters as we see fit" is simply wrong to begin with
independent of the programming language.

Even if the _eight_ parameters are the end of the range, then they are
beyond justifyable because that's way beyond the natural register
argument space of any architecture and you are offloading your lazyness
to wrappers and the compiler to emit pointlessly horrible code.

There is a reason why new syscalls which need more than a few parameters
are based on 'struct DATA_WHICH_I_NEED_TO_KNOW' and 'flags'.

We've got burned on the non-extensibilty often enough. Why would a new
internal function have any different requirements especially as it is
neither implemented to the full extent nor a hotpath function?

Now you might argue that it _is_ a "hotpath" due to the BPF usage, but
then even more so as any intermediate wrapper which converts from one
data representation to another data representation is not going to
increase performance, right?

> ... I made the same mistake reviewing Song's patchset...

Songs series had rough edges, but was way more data structure driven
and palatable than this hackery.

The fact that you made a mistake while reviewing Songs series has
absolutely nothing to do with the above or my previous reply to Mike.

Thanks,

tglx


Re: [PATCH v2 2/2] powerpc/tpm: Reserve SML log when kexec'ing with kexec_file_load()

2023-06-18 Thread Stefan Berger




On 6/15/23 08:37, Michael Ellerman wrote:

The TPM code in prom_init.c creates a small buffer of memory to store
the TPM's SML (Stored Measurement Log). It's communicated to Linux via
the linux,sml-base/size device tree properties of the TPM node.

When kexec'ing that buffer can be overwritten, or when kdump'ing it may
not be mapped by the second kernel. The latter can lead to a crash when
booting the second kernel such as:

   tpm_ibmvtpm 7103: CRQ initialization completed
   BUG: Unable to handle kernel data access on read at 0xc0002ffb
   Faulting instruction address: 0xc000200a70e0
   Oops: Kernel access of bad area, sig: 11 [#1]
   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
   Modules linked in:
   CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc2-00134-g9307ce092f5d #314
   Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 
0xf05 of:SLOF,git-5b4c5a pSeries
   NIP:  c000200a70e0 LR: c000203dd5dc CTR: 0800
   REGS: c00024543280 TRAP: 0300   Not tainted  
(6.2.0-rc2-00134-g9307ce092f5d)
   MSR:  82009033   CR: 24002280  XER: 
0006
   CFAR: c000200a70c8 DAR: c0002ffb DSISR: 4000 IRQMASK: 0
   ...
   NIP memcpy_power7+0x400/0x7d0
   LR  kmemdup+0x5c/0x80
   Call Trace:
 memcpy_power7+0x274/0x7d0 (unreliable)
 kmemdup+0x5c/0x80
 tpm_read_log_of+0xe8/0x1b0
 tpm_bios_log_setup+0x60/0x210
 tpm_chip_register+0x134/0x320
 tpm_ibmvtpm_probe+0x520/0x7d0
 vio_bus_probe+0x9c/0x460
 really_probe+0x104/0x420
 __driver_probe_device+0xb0/0x170
 driver_probe_device+0x58/0x180
 __driver_attach+0xd8/0x250
 bus_for_each_dev+0xb4/0x140
 driver_attach+0x34/0x50
 bus_add_driver+0x1e8/0x2d0
 driver_register+0xb4/0x1c0
 __vio_register_driver+0x74/0x9c
 ibmvtpm_module_init+0x34/0x48
 do_one_initcall+0x80/0x320
 kernel_init_freeable+0x304/0x3ac
 kernel_init+0x30/0x1a0
 ret_from_kernel_thread+0x5c/0x64

To fix the crash, add the SML region to the usable memory areas for the
kdump kernel, so that the second kernel will map the region. To avoid
corruption of the region, add the region to the reserved memory areas,


To me the 2nd paragraph and the one below seem to say that in general it does 
NOT 'avoid corruption of the region.'



so that the second kernel does not use the memory for something else.

Note that when loading a kdump kernel with the regular kexec_load()
syscall the SML may be overwritten by the kdump kernel, depending on
where the SML is in memory in relation to the crashkernel region. That
is a separate problem that is not solved by this patch.

Fixes: a0458284f062 ("powerpc: Add support code for kexec_file_load()")
Reported-by: Stefan Berger 
Signed-off-by: Michael Ellerman 


I agree to the code:

Reviewed-by: Stefan Berger 



[PATCH] cxl/ocxl: Possible repeated word

2023-06-18 Thread zhumao001


Delete repeated word in comment.

Signed-off-by: Zhu Mao 
---
 drivers/misc/cxl/native.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 50b0c44bb8d7..6957946a6463 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -920,7 +920,7 @@ int cxl_attach_dedicated_process_psl9(struct 
cxl_context *ctx, u64 wed, u64 amr)
  * Ideally we should do a wmb() here to make sure the changes to 
the

  * PE are visible to the card before we call afu_enable.
  * On ppc64 though all mmios are preceded by a 'sync' instruction 
hence

- * we dont dont need one here.
+ * we dont need one here.
  */

 result = cxl_ops->afu_reset(afu);

Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Kent Overstreet
On Mon, Jun 19, 2023 at 12:32:55AM +0200, Thomas Gleixner wrote:
> Mike!
> 
> Sorry for being late on this ...
> 
> On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
> >  
> > +void *execmem_data_alloc(size_t size)
> > +{
> > +   unsigned long start = execmem_params.modules.data.start;
> > +   unsigned long end = execmem_params.modules.data.end;
> > +   pgprot_t pgprot = execmem_params.modules.data.pgprot;
> > +   unsigned int align = execmem_params.modules.data.alignment;
> > +   unsigned long fallback_start = 
> > execmem_params.modules.data.fallback_start;
> > +   unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> > +   bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;
> 
> While I know for sure that you read up on the discussion I had with Song
> about data structures, it seems you completely failed to understand it.
> 
> > +   return execmem_alloc(size, start, end, align, pgprot,
> > +fallback_start, fallback_end, kasan);
> 
> Having _seven_ intermediate variables to fill _eight_ arguments of a
> function instead of handing in @size and a proper struct pointer is
> tasteless and disgusting at best.
> 
> Six out of those seven parameters are from:
> 
> execmem_params.module.data
> 
> while the KASAN shadow part is retrieved from
> 
> execmem_params.module.flags
> 
> So what prevents you from having a uniform data structure, which is
> extensible and decribes _all_ types of allocations?
> 
> Absolutely nothing. The flags part can either be in the type dependend
> part or you make the type configs an array as I had suggested originally
> and then execmem_alloc() becomes:
> 
> void *execmem_alloc(type, size)
> 
> and
> 
> static inline void *execmem_data_alloc(size_t size)
> {
> return execmem_alloc(EXECMEM_TYPE_DATA, size);
> }
> 
> which gets the type independent parts from @execmem_param.
> 
> Just read through your own series and watch the evolution of
> execmem_alloc():
> 
>   static void *execmem_alloc(size_t size)
> 
>   static void *execmem_alloc(size_t size, unsigned long start,
>  unsigned long end, unsigned int align,
>  pgprot_t pgprot)
> 
>   static void *execmem_alloc(size_t len, unsigned long start,
>  unsigned long end, unsigned int align,
>  pgprot_t pgprot,
>  unsigned long fallback_start,
>  unsigned long fallback_end,
>  bool kasan)
> 
> In a month from now this function will have _ten_ parameters and tons of
> horrible wrappers which convert an already existing data structure into
> individual function arguments.
> 
> Seriously?
> 
> If you want this function to be [ab]used outside of the exec_param
> configuration space for whatever non-sensical reasons then this still
> can be either:
> 
> void *execmem_alloc(params, type, size)
> 
> static inline void *execmem_data_alloc(size_t size)
> {
> return execmem_alloc(_param, EXECMEM_TYPE_DATA, size);
> }
> 
> or
> 
> void *execmem_alloc(type_params, size);
> 
> static inline void *execmem_data_alloc(size_t size)
> {
> return execmem_alloc(_param.data, size);
> }
> 
> which both allows you to provide alternative params, right?
> 
> Coming back to my conversation with Song:
> 
>"Bad programmers worry about the code. Good programmers worry about
> data structures and their relationships."

Thomas, you're confusing an internal interface with external, I made the
same mistake reviewing Song's patchset...


Re: [PATCH v2 06/12] mm/execmem: introduce execmem_data_alloc()

2023-06-18 Thread Thomas Gleixner
Mike!

Sorry for being late on this ...

On Fri, Jun 16 2023 at 11:50, Mike Rapoport wrote:
>  
> +void *execmem_data_alloc(size_t size)
> +{
> + unsigned long start = execmem_params.modules.data.start;
> + unsigned long end = execmem_params.modules.data.end;
> + pgprot_t pgprot = execmem_params.modules.data.pgprot;
> + unsigned int align = execmem_params.modules.data.alignment;
> + unsigned long fallback_start = 
> execmem_params.modules.data.fallback_start;
> + unsigned long fallback_end = execmem_params.modules.data.fallback_end;
> + bool kasan = execmem_params.modules.flags & EXECMEM_KASAN_SHADOW;

While I know for sure that you read up on the discussion I had with Song
about data structures, it seems you completely failed to understand it.

> + return execmem_alloc(size, start, end, align, pgprot,
> +  fallback_start, fallback_end, kasan);

Having _seven_ intermediate variables to fill _eight_ arguments of a
function instead of handing in @size and a proper struct pointer is
tasteless and disgusting at best.

Six out of those seven parameters are from:

execmem_params.module.data

while the KASAN shadow part is retrieved from

execmem_params.module.flags

So what prevents you from having a uniform data structure, which is
extensible and decribes _all_ types of allocations?

Absolutely nothing. The flags part can either be in the type dependend
part or you make the type configs an array as I had suggested originally
and then execmem_alloc() becomes:

void *execmem_alloc(type, size)

and

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(EXECMEM_TYPE_DATA, size);
}

which gets the type independent parts from @execmem_param.

Just read through your own series and watch the evolution of
execmem_alloc():

  static void *execmem_alloc(size_t size)

  static void *execmem_alloc(size_t size, unsigned long start,
 unsigned long end, unsigned int align,
 pgprot_t pgprot)

  static void *execmem_alloc(size_t len, unsigned long start,
 unsigned long end, unsigned int align,
 pgprot_t pgprot,
 unsigned long fallback_start,
 unsigned long fallback_end,
 bool kasan)

In a month from now this function will have _ten_ parameters and tons of
horrible wrappers which convert an already existing data structure into
individual function arguments.

Seriously?

If you want this function to be [ab]used outside of the exec_param
configuration space for whatever non-sensical reasons then this still
can be either:

void *execmem_alloc(params, type, size)

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(_param, EXECMEM_TYPE_DATA, size);
}

or

void *execmem_alloc(type_params, size);

static inline void *execmem_data_alloc(size_t size)
{
return execmem_alloc(_param.data, size);
}

which both allows you to provide alternative params, right?

Coming back to my conversation with Song:

   "Bad programmers worry about the code. Good programmers worry about
data structures and their relationships."

You might want to reread:

https://lore.kernel.org/r/87lenuukj0.ffs@tglx

and the subsequent thread.

The fact that my suggestions had a 'mod_' namespace prefix does not make
any of my points moot.

Song did an extremly good job in abstracting things out, but you decided
to ditch his ground work instead of building on it and keeping the good
parts. That's beyond sad.

Worse, you went the full 'not invented here' path with an outcome which is
even worse than the original hackery from Song which started the above
referenced thread.

I don't know what caused you to decide that ad hoc hackery is better
than proper data structure based design patterns. I actually don't want
to know.

As much as my voice counts:

  NAK-ed-by: Thomas Gleixner 

Thanks,

tglx


Re: [PATCH v2 07/23 replacement] mips: add pte_unmap() to balance pte_offset_map()

2023-06-18 Thread Yu Zhao
On Fri, Jun 16, 2023 at 9:54 PM Yu Zhao  wrote:
>
> On Thu, Jun 15, 2023 at 04:02:43PM -0700, Hugh Dickins wrote:
> > To keep balance in future, __update_tlb() remember to pte_unmap() after
> > pte_offset_map().  This is an odd case, since the caller has already done
> > pte_offset_map_lock(), then mips forgets the address and recalculates it;
> > but my two naive attempts to clean that up did more harm than good.
> >
> > Tested-by: Nathan Chancellor 
> > Signed-off-by: Hugh Dickins 
>
> FWIW: Tested-by: Yu Zhao 
>
> There is another problem, likely caused by khugepaged, happened multiple 
> times. But I don't think it's related to your series, just FYI.
>
>   Got mcheck at 81134ef0
>   CPU: 3 PID: 36 Comm: khugepaged Not tainted 
> 6.4.0-rc6-00049-g62d8779610bb-dirty #1

...

>   Kernel panic - not syncing: Caught Machine Check exception - caused by 
> multiple matching entries in the TLB.

In case anyone plans to try to fix this - the problem goes back to at
least 5.15 stable. My (educated) guess is that nobody complained about
it because all the testing is done in QEMU, which does NOT detect
conflicting TLBs. This means the verification of the fix would need to
be on a real piece of h/w or an updated QEMU.

In target/mips/tcg/sysemu/tlb_helper.c:

static void r4k_fill_tlb(CPUMIPSState *env, int idx)
{
r4k_tlb_t *tlb;
uint64_t mask = env->CP0_PageMask >> (TARGET_PAGE_BITS + 1);

/* XXX: detect conflicting TLBs and raise a MCHECK exception when needed */
...


Re: kvm/arm64: Spark benchmark

2023-06-18 Thread Yu Zhao
On Fri, Jun 9, 2023 at 7:04 AM Marc Zyngier  wrote:
>
> On Fri, 09 Jun 2023 01:59:35 +0100,
> Yu Zhao  wrote:
> >
> > TLDR
> > 
> > Apache Spark spent 12% less time sorting four billion random integers 
> > twenty times (in ~4 hours) after this patchset [1].
>
> Why are the 3 architectures you have considered being evaluated with 3
> different benchmarks?

I was hoping people having special interests in different archs might
try to reproduce the benchmarks that I didn't report (but did cover)
and see what happens.

> I am not suspecting you to have cherry-picked
> the best results

I'm generally very conservative when reporting *synthetic* results.
For example, the same memcached benchmark used on powerpc yielded >50%
improvement on aarch64, because the default Ubuntu Kconfig uses 64KB
base page size for powerpc but 4KB for aarch64. (Before the series,
the reclaim (swap) path takes kvm->mmu_lock for *write* on O(nr of all
pages to consider); after the series, it becomes O(actual nr of pages
to swap), which is <10% given how the benchmark was set up.)

  Ops/sec  Avg. Latency  p50 Latency  p99 Latency  p99.9 Latency

Before  639511.40   0.09940  0.04700  0.27100   22.52700
After   974184.60   0.06471  0.04700  0.159003.75900

> but I'd really like to see a variety of benchmarks
> that exercise this stuff differently.

I'd be happy to try other synthetic workloads that people think that
are relatively representative. Also, I've backported the series and
started an A/B experiment involving ~1 million devices (real-world
workloads). We should have the preliminary results by the time I post
the next version.


Re: kvm/x86: multichase benchmark

2023-06-18 Thread Yu Zhao
On Thu, Jun 8, 2023 at 6:59 PM Yu Zhao  wrote:
>
> TLDR
> 
> Multichase in 64 microVMs achieved 6% more total samples (in ~4 hours) after 
> this patchset [1].
>
> Hardware
> 
> HOST $ lscpu
> Architecture:x86_64
>   CPU op-mode(s):32-bit, 64-bit
>   Address sizes: 43 bits physical, 48 bits virtual
>   Byte Order:Little Endian
> CPU(s):  128
>   On-line CPU(s) list:   0-127
> Vendor ID:   AuthenticAMD
>   Model name:AMD Ryzen Threadripper PRO 3995WX 64-Cores
> CPU family:  23
> Model:   49
> Thread(s) per core:  2
> Core(s) per socket:  64
> Socket(s):   1
> Stepping:0
> Frequency boost: disabled
> CPU max MHz: 4308.3979
> CPU min MHz: 2200.
> BogoMIPS:5390.20
> Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> mca cmov pat pse36 clflush mmx fxsr sse sse2
>  ...
> Virtualization features:
>   Virtualization:AMD-V
> Caches (sum of all):
>   L1d:   2 MiB (64 instances)
>   L1i:   2 MiB (64 instances)
>   L2:32 MiB (64 instances)
>   L3:256 MiB (16 instances)
> NUMA:
>   NUMA node(s):  1
>   NUMA node0 CPU(s): 0-127
> Vulnerabilities:
>   Itlb multihit: Not affected
>   L1tf:  Not affected
>   Mds:   Not affected
>   Meltdown:  Not affected
>   Mmio stale data:   Not affected
>   Retbleed:  Mitigation; untrained return thunk; SMT enabled with 
> STIBP protection
>   Spec store bypass: Mitigation; Speculative Store Bypass disabled via 
> prctl
>   Spectre v1:Mitigation; usercopy/swapgs barriers and __user 
> pointer sanitization
>   Spectre v2:Mitigation; Retpolines, IBPB conditional, STIBP 
> always-on, RSB filling, PBRSB-eIBRS Not affected
>   Srbds: Not affected
>   Tsx async abort:   Not affected
>
> HOST $ numactl -H
> available: 1 nodes (0)
> node 0 cpus: 0-127
> node 0 size: 257542 MB
> node 0 free: 224855 MB
> node distances:
> node   0
>   0:  10
>
> HOST $ cat /sys/class/nvme/nvme0/model
> INTEL SSDPF21Q800GB
>
> HOST $ cat /sys/class/nvme/nvme0/numa_node
> 0
>
> Software
> 
> HOST $ cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=22.04
> DISTRIB_CODENAME=jammy
> DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
>
> HOST $ uname -a
> Linux x86 6.4.0-rc5+ #1 SMP PREEMPT_DYNAMIC Wed Jun  7 22:17:47 UTC 2023 
> x86_64 x86_64 x86_64 GNU/Linux
>
> HOST $ cat /proc/swaps
> Filename  Type Size UsedPriority
> /dev/nvme0n1p2partition4668383560   -2
>
> HOST $ cat /sys/kernel/mm/lru_gen/enabled
> 0x000f
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> HOST $ cat /sys/kernel/mm/transparent_hugepage/defrag
> always defer defer+madvise madvise [never]
>
> Procedure
> =
> HOST $ git clone https://github.com/google/multichase
>
> HOST $ 
> HOST $ 
>
> HOST $ cp multichase/multichase ./initrd/bin/
> HOST $ sed -i \
> "/^maybe_break top$/i multichase -t 2 -m 4g -n 28800; poweroff" \

I was reminded that I missed one parameter above, i.e.,

"/^maybe_break top$/i multichase -N -t 2 -m 4g -n 28800; poweroff" \
 ^^

> ./initrd/init
>
> HOST $ 
>
> HOST $ cat run_microvms.sh
> memcgs=64
>
> run() {
> path=/sys/fs/cgroup/memcg$1
>
> mkdir $path
> echo $BASHPID >$path/cgroup.procs

And one line here:

echo 4000m >$path/memory.min # or the largest size that doesn't cause OOM kills

> qemu-system-x86_64 -M microvm,accel=kvm -cpu host -smp 2 -m 6g \
> -nographic -kernel /boot/vmlinuz -initrd ./initrd.img \
> -append "console=ttyS0 loglevel=0"
> }
>
> for ((memcg = 0; memcg < $memcgs; memcg++)); do
> run $memcg &
> done
>
> wait
>
> Results
> ===
>  Before [1]AfterChange
> --
> Total samples6824  7237 +6%
>
> Notes
> =
> [1] "mm: rmap: Don't flush TLB after checking PTE young for page
> reference" was included so that the comparison is apples to
> Apples.
> https://lore.kernel.org/r/20220706112041.3831-1-21cn...@gmail.com/


Re: [PATCH 1/1] arch:hexagon/powerpc: use KSYM_NAME_LEN in array size

2023-06-18 Thread Miguel Ojeda
On Wed, May 31, 2023 at 1:14 AM Kees Cook  wrote:
>
> On Mon, May 29, 2023 at 04:50:45PM +0200, Miguel Ojeda wrote:
> > Kees: what is the current stance on `[static N]` parameters? Something like:
> >
> > const char *kallsyms_lookup(unsigned long addr,
> > unsigned long *symbolsize,
> > unsigned long *offset,
> > -   char **modname, char *namebuf);
> > +   char **modname, char namebuf[static 
> > KSYM_NAME_LEN]);
> >
> > makes the compiler complain about cases like these (even if trivial):
> >
> > arch/powerpc/xmon/xmon.c:1711:10: error: array argument is too small;
> > contains 128 elements, callee requires at least 512
> > [-Werror,-Warray-bounds]
> > name = kallsyms_lookup(pc, , , NULL, tmpstr);
> >  ^   ~~
> > ./include/linux/kallsyms.h:86:29: note: callee declares array
> > parameter as static here
> > char **modname, char namebuf[static KSYM_NAME_LEN]);
> >  ^  ~~
>
> Wouldn't that be a good thing? (I.e. complain about the size mismatch?)

Yeah, I would say so (i.e. I meant it as a good thing).

> > But I only see 2 files in the kernel using `[static N]` (from 2020 and
> > 2021). Should something else be used instead (e.g. `__counted_by`),
> > even if constexpr-sized?.
>
> Yeah, it seems pretty uncommon. I'd say traditionally arrays aren't
> based too often, rather structs containing them.
>
> But ultimately, yeah, everything could gain __counted_by and friends in
> the future.

That would be nice!

Cheers,
Miguel


Re: [PATCH v2 00/16] Add support for DAX vmemmap optimization for ppc64

2023-06-18 Thread Sachin Sant



> On 16-Jun-2023, at 4:38 PM, Aneesh Kumar K.V  
> wrote:
> 
> This patch series implements changes required to support DAX vmemmap
> optimization for ppc64. The vmemmap optimization is only enabled with radix 
> MMU
> translation and 1GB PUD mapping with 64K page size. The patch series also 
> split
> hugetlb vmemmap optimization as a separate Kconfig variable so that
> architectures can enable DAX vmemmap optimization without enabling hugetlb
> vmemmap optimization. This should enable architectures like arm64 to enable 
> DAX
> vmemmap optimization while they can't enable hugetlb vmemmap optimization. 
> More
> details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
> devdax vmemmap optimization"
> 
> Changes from V1:
> * Fix make htmldocs warning
> * Fix vmemmap allocation bugs with different alignment values.
> * Correctly check for section validity to before we free vmemmap area

Thanks for the updated series.

The previously reported WARN_ON is not seen with this series.
I also ran few regression tests against this series and did not
see any problems.

Based on the test results

Tested-by: Sachin Sant mailto:sach...@linux.ibm.com>>

-Sachin

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-18 Thread Mike Rapoport
On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> >
> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >
> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > and puts the burden of code allocation to the modules code.
> >
> > Several architectures override module_alloc() because of various
> > constraints where the executable memory can be located and this causes
> > additional obstacles for improvements of code allocation.
> >
> > Start splitting code allocation from modules by introducing
> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >
> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > module_alloc() and execmem_free() and jit_free() are replacements of
> > module_memfree() to allow updating all call sites to use the new APIs.
> >
> > The intention semantics for new allocation APIs:
> >
> > * execmem_text_alloc() should be used to allocate memory that must reside
> >   close to the kernel image, like loadable kernel modules and generated
> >   code that is restricted by relative addressing.
> >
> > * jit_text_alloc() should be used to allocate memory for generated code
> >   when there are no restrictions for the code placement. For
> >   architectures that require that any code is within certain distance
> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >   execmem_text_alloc().
> >
> 
> Is there anything in this series to help users do the appropriate
> synchronization when the actually populate the allocated memory with
> code?  See here, for example:

This series only factors out the executable allocations from modules and
puts them in a central place.
Anything else would go on top after this lands.
 
> https://lore.kernel.org/linux-fsdevel/cb6533c6-cea0-4f04-95cf-b8240c6ab...@app.fastmail.com/T/#u

-- 
Sincerely yours,
Mike.