date:20230706

[PATCH v2] powernv/opal-prd: Silence memcpy() run-time false positive warnings

2023-07-06 Thread Mahesh Salgaonkar

opal_prd_msg_notifier extracts the opal prd message size from the message
header and uses it for allocating opal_prd_msg_queue_item that includes
the correct message size to be copied. However, while running under
CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning:

[ 6458.234352] memcpy: detected field-spanning write (size 32) of single field 
">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4)
[ 6458.234390] WARNING: CPU: 9 PID: 660 at 
arch/powerpc/platforms/powernv/opal-prd.c:355 opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[...]
[ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 
[opal_prd]
[ 6458.234736] Call Trace:
[ 6458.234742] [c002acb23c10] [c0080e0c0e68] 
opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable)
[ 6458.234759] [c002acb23ca0] [c019ccc0] 
notifier_call_chain+0xc0/0x1b0
[ 6458.234774] [c002acb23d00] [c019ceac] 
atomic_notifier_call_chain+0x2c/0x40
[ 6458.234788] [c002acb23d20] [c00d69b4] 
opal_message_notify+0xf4/0x2c0
[...]

Split the copy to avoid false positive run-time warning.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
Change from v1:
- Rework the memcpy to copy message header and rest of the message
  separately without adding flex array.
---
 arch/powerpc/platforms/powernv/opal-prd.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 113bdb151f687..a068cbc4d43c0 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -352,7 +352,9 @@ static int opal_prd_msg_notifier(struct notifier_block *nb,
if (!item)
return -ENOMEM;
 
-   memcpy(>msg, msg->params, msg_size);
+   item->msg = *hdr;
+   hdr++;
+   memcpy((char *)>msg + sizeof(*hdr), hdr, msg_size - sizeof(*hdr));
 
spin_lock_irqsave(_prd_msg_queue_lock, flags);
list_add_tail(>list, _prd_msg_queue);

Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised

2023-07-06 Thread Christophe Leroy



Le 07/07/2023 à 04:20, Benjamin Gray a écrit :
> The inclusion of kasan_early_init() is conditional on CONFIG_KASAN, so
> 
>if(IS_ENABLED(CONFIG_KASAN))
>kasan_early_init();
> 
> wouldn't work.

See my comment on the main patch, just do like kasan_init() and others 
in arch/powerpc/include/asm/kasan.h : Define a stub for when 
CONFIG_KASAN is not selected.

Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised

2023-07-06 Thread Christophe Leroy



Le 07/07/2023 à 03:31, Benjamin Gray a écrit :
> The KCOV handler __sanitizer_cov_trace_pc() uses the PACA, so initialise
> the PACA first. This fixes a hang during boot when KASAN and KCOV are
> both enabled, where the coverage tracer in kasan_early_init() tries to
> access a field of the (currently null) PACA.

mm/kasan/Makefile has KCOV_INSTRUMENT := n
Should we add the same in arch/powerpc/mm/kasan/Makefile to be consistent ?

> 
> Signed-off-by: Benjamin Gray 
> 
> ---
> 
> I tried annotating kasan_early_init() with 'notrace', but it still
> seemed to hang. It would also be less robust, because kasan_early_init()
> may in future call generic code that should keep coverage.
> ---
>   arch/powerpc/kernel/head_64.S  | 3 ---
>   arch/powerpc/kernel/setup_64.c | 4 
>   2 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index f132d8704263..21a78a849ca8 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -1004,9 +1004,6 @@ start_here_multiplatform:
>* and SLB setup before we turn on relocation.
>*/
>   
> -#ifdef CONFIG_KASAN
> - bl  CFUNC(kasan_early_init)
> -#endif
>   /* Restore parameters passed from prom_init/kexec */
>   mr  r3,r31
>   LOAD_REG_ADDR(r12, DOTSYM(early_setup))
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 246201d0d879..a3f5decbc041 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -369,6 +369,10 @@ void __init early_setup(unsigned long dt_ptr)
>   
>   /*  printk is now safe to use --- */
>   
> +#ifdef CONFIG_KASAN
> + kasan_early_init();
> +#endif

Don't #ifdef

In arch/powerpc/include/asm/kasan.h add a stub for kasan_early_init() 
just like already done for kasan_init(), kasan_mmu_init() and 
kasan_late_init().

> +
>   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && (mfmsr() & MSR_HV))
>   enable_machine_check();
>

Re: [PATCH] powerpc/powermac: Use early_* IO variants in via_calibrate_decr

2023-07-06 Thread Benjamin Gray

On Thu, 2023-07-06 at 11:08 +1000, Benjamin Gray wrote:
> The issue is pre-existing, but is surfaced by commit 721255b9826b
> ("genirq: Use a maple tree for interrupt descriptor management").
> It's not clear to me why this causes it to surface.

>From the thread chain in [1], it looks like the maple tree
implementation just has different allocation behaviour, which follows a
pre-existing code path to kmem_cache_alloc_bulk(), which
unconditionally enables interrupts.

(thanks Jordan Niethe for finding this thread)

[1]: https://lore.kernel.org/all/87o7qdzfay.ffs@tglx/

Re: [PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised

2023-07-06 Thread Benjamin Gray

The inclusion of kasan_early_init() is conditional on CONFIG_KASAN, so

  if(IS_ENABLED(CONFIG_KASAN)) 
  kasan_early_init();

wouldn't work.

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-06 Thread David Stevens

On Fri, Jul 7, 2023 at 12:58 AM Isaku Yamahata  wrote:
>
> On Thu, Jul 06, 2023 at 01:52:08PM +0900,
> David Stevens  wrote:
>
> > On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang  wrote:
> > >
> > > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > > From: David Stevens 
> > > >
> > > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > > memory into the guest that is backed by un-refcounted struct pages - for
> > > > example, higher order non-compound pages allocated by the amdgpu driver
> > > > via ttm_pool_alloc_page.
> > >
> > > I guess you mean the tail pages of the higher order non-compound pages?
> > > And as to the head page, it is said to be set to one coincidentally[*],
> > > and shall not be considered as refcounted.  IIUC, refcount of this head
> > > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > > this may not be a problem(?). But treating this head page differently,
> > > as a refcounted one(e.g., to set the A/D flags), is weired.
> > >
> > > Or maybe I missed some context, e.g., can the head page be allocted to
> > > guest at all?
> >
> > Yes, this is to allow mapping the tail pages of higher order
> > non-compound pages - I should have been more precise in my wording.
> > The head pages can already be mapped into the guest.
> >
> > Treating the head and tail pages would require changing how KVM
> > behaves in a situation it supports today (rather than just adding
> > support for an unsupported situation). Currently, without this series,
> > KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> > guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> > that's actually valid behavior, nor do I know whether anyone actually
> > cares about it. But it's what KVM does today, and I would shy away
> > from modifying that behavior without good reason.
> >
> > > >
> > > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > > done by storing the flag in an unused bit in the sptes.
> > >
> > > Also, maybe we should mention this only works on x86-64.
> > >
> > > >
> > > > Signed-off-by: David Stevens 
> > > > ---
> > > >  arch/x86/kvm/mmu/mmu.c  | 44 +
> > > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
> > > >  arch/x86/kvm/mmu/spte.c |  4 ++-
> > > >  arch/x86/kvm/mmu/spte.h | 12 -
> > > >  arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
> > > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > > index e44ab512c3a1..b1607e314497 100644
> > > > --- a/arch/x86/kvm/mmu/mmu.c
> > > > +++ b/arch/x86/kvm/mmu/mmu.c
> > >
> > > ...
> > >
> > > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, 
> > > > struct kvm_memory_slot *slot,
> > > >   bool host_writable = !fault || fault->map_writable;
> > > >   bool prefetch = !fault || fault->prefetch;
> > > >   bool write_fault = fault && fault->write;
> > > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> > >
> > > Just wonder, what if a non-refcounted page is prefetched?  Or is it 
> > > possible in
> > > practice?
> >
> > Prefetching is still done via gfn_to_page_many_atomic, which sets
> > FOLL_GET. That's fixable, but it's not something this series currently
> > does.
>
> So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
> hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for 
> prefetched
> spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
> whether the corresponding page is ref-countable or not, right?
>
> Because direct_pte_prefetch_many() is for legacy KVM MMU and 
> FNAME(prefetch_pte)
> is shadow paging, we need to test it with legacy KVM MMU or shadow paging to 
> hit
> the issue, though.
>

direct_pte_prefetch_many and prefetch_gpte both pass NULL for the
fault parameter, so is_refcounted will evaluate to true. So the spte's
refcounted bit will get set in that case.

-David

[PATCH] powerpc64/kasan: Call kasan_early_init() after PACA initialised

2023-07-06 Thread Benjamin Gray

The KCOV handler __sanitizer_cov_trace_pc() uses the PACA, so initialise
the PACA first. This fixes a hang during boot when KASAN and KCOV are
both enabled, where the coverage tracer in kasan_early_init() tries to
access a field of the (currently null) PACA.

Signed-off-by: Benjamin Gray 

---

I tried annotating kasan_early_init() with 'notrace', but it still
seemed to hang. It would also be less robust, because kasan_early_init()
may in future call generic code that should keep coverage.
---
 arch/powerpc/kernel/head_64.S  | 3 ---
 arch/powerpc/kernel/setup_64.c | 4 
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
index f132d8704263..21a78a849ca8 100644
--- a/arch/powerpc/kernel/head_64.S
+++ b/arch/powerpc/kernel/head_64.S
@@ -1004,9 +1004,6 @@ start_here_multiplatform:
 * and SLB setup before we turn on relocation.
 */
 
-#ifdef CONFIG_KASAN
-   bl  CFUNC(kasan_early_init)
-#endif
/* Restore parameters passed from prom_init/kexec */
mr  r3,r31
LOAD_REG_ADDR(r12, DOTSYM(early_setup))
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 246201d0d879..a3f5decbc041 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -369,6 +369,10 @@ void __init early_setup(unsigned long dt_ptr)
 
/*  printk is now safe to use --- */
 
+#ifdef CONFIG_KASAN
+   kasan_early_init();
+#endif
+
if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && (mfmsr() & MSR_HV))
enable_machine_check();
 
-- 
2.41.0

Re: [PATCH v5 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Palmer Dabbelt


On Thu, 06 Jul 2023 15:20:25 PDT (-0700), eric.devol...@oracle.com wrote:

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/riscv/Kconfig | 44 +---
 1 file changed, 13 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b49793cf34eb..8a3af850597a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT

  If unsure what to do here, say N.

-config KEXEC
-   bool "Kexec system call"
-   depends on MMU
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SELECTS_KEXEC
+   def_bool y
+   depends on KEXEC
select HOTPLUG_CPU if SMP
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel. It is like a reboot
- but it is independent of the system firmware. And like a reboot
- you can start any kernel with it, not just Linux.

- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool 64BIT && MMU

-config KEXEC_FILE
-   bool "kexec file based systmem call"
-   depends on 64BIT && MMU
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
select HAVE_IMA_KEXEC if IMA
-   select KEXEC_CORE
select KEXEC_ELF
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
- If you don't know what to do here, say Y.

 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
depends on CRYPTO=y
depends on CRYPTO_SHA256=y

-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y

 config COMPAT
bool "Kernel support for 32-bit U-mode"


Acked-by: Palmer Dabbelt

[PATCH v5 13/13] sh/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: John Paul Adrian Glaubitz 
---
 arch/sh/Kconfig | 46 --
 1 file changed, 8 insertions(+), 38 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 2b3ce4fd3956..1cf6603781c7 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -548,44 +548,14 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call (EXPERIMENTAL)"
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.  And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "kernel crash dumps (EXPERIMENTAL)"
-   depends on BROKEN_ON_SMP
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel using
- PHYSICAL_START.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
-
-config KEXEC_JUMP
-   bool "kexec jump (EXPERIMENTAL)"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code via KEXEC
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool BROKEN_ON_SMP
+
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1

[PATCH v5 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

NOTE: The original Kconfig has a KEXEC_SIG which depends on
MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
dependency (using the strategy outlined in this series, and other
techniques) results in 'error: recursive dependency detected'
on CRYPTO.

Per Alexander Gordeev : "the MODULE_SIG_FORMAT
dependency was introduced with [git commit below] and in fact was not
necessary, since s390 did/does not use mod_check_sig() anyway.

 commit c8424e776b09 ("MODSIGN: Export module signature definitions")

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
MODULE_SIG_FORMAT does not hurt."

Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
from KEXEC_SIG. Still results in equivalent .config files for s390.

Signed-off-by: Eric DeVolder 
Acked-by: Alexander Gordeev 
---
 arch/s390/Kconfig | 69 ---
 1 file changed, 23 insertions(+), 46 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5b39918b7042..44154246e4b7 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -213,6 +213,7 @@ config S390
select HAVE_VIRT_CPU_ACCOUNTING_IDLE
select IOMMU_HELPER if PCI
select IOMMU_SUPPORTif PCI
+   select KEXEC
select MMU_GATHER_MERGE_VMAS
select MMU_GATHER_NO_GATHER
select MMU_GATHER_RCU_TABLE_FREE
@@ -244,6 +245,28 @@ config PGTABLE_LEVELS
 
 source "kernel/livepatch/Kconfig"
 
+config ARCH_DEFAULT_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
+
+config ARCH_SUPPORTS_KEXEC_SIG
+   def_bool y
+
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
+   help
+ Refer to  for more details on 
this.
+ This option also enables s390 zfcpdump.
+ See also 
+
 menu "Processor type and features"
 
 config HAVE_MARCH_Z10_FEATURES
@@ -482,36 +505,6 @@ config SCHED_TOPOLOGY
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   def_bool y
-   select KEXEC_CORE
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   depends on CRYPTO
-   depends on CRYPTO_SHA256
-   depends on CRYPTO_SHA256_S390
-   help
- Enable the kexec file based system call. In contrast to the normal
- kexec system call this system call takes file descriptors for the
- kernel and initramfs as arguments.
-
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool y
-   depends on KEXEC_FILE
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE && MODULE_SIG_FORMAT
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
 config KERNEL_NOBP
def_bool n
prompt "Enable modified branch prediction for the kernel by default"
@@ -733,22 +726,6 @@ config VFIO_AP
 
 endmenu
 
-menu "Dump support"
-
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   select KEXEC
-   help
- Generate crash dump after being started by kexec.
- Crash dump kernels are loaded in the main kernel with kexec-tools
- into a specially reserved region and then later executed after
- a crash by kdump/kexec.
- Refer to  for more details on 
this.
- This option also enables s390 zfcpdump.
- See also 
-
-endmenu
-
 config CCW
def_bool y
 
-- 
2.31.1

[PATCH v5 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/riscv/Kconfig | 44 +---
 1 file changed, 13 insertions(+), 31 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b49793cf34eb..8a3af850597a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -647,48 +647,30 @@ config RISCV_BOOT_SPINWAIT
 
  If unsure what to do here, say N.
 
-config KEXEC
-   bool "Kexec system call"
-   depends on MMU
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SELECTS_KEXEC
+   def_bool y
+   depends on KEXEC
select HOTPLUG_CPU if SMP
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel. It is like a reboot
- but it is independent of the system firmware. And like a reboot
- you can start any kernel with it, not just Linux.
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool 64BIT && MMU
 
-config KEXEC_FILE
-   bool "kexec file based systmem call"
-   depends on 64BIT && MMU
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
select HAVE_IMA_KEXEC if IMA
-   select KEXEC_CORE
select KEXEC_ELF
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
- If you don't know what to do here, say Y.
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
depends on CRYPTO=y
depends on CRYPTO_SHA256=y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config COMPAT
bool "Kernel support for 32-bit U-mode"
-- 
2.31.1

[PATCH v5 10/13] powerpc/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
---
 arch/powerpc/Kconfig | 55 ++--
 1 file changed, 17 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..1695a71777f0 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -589,41 +589,21 @@ config PPC64_SUPPORTS_MEMORY_FAILURE
default "y" if PPC_POWERNV
select ARCH_SUPPORTS_MEMORY_FAILURE
 
-config KEXEC
-   bool "kexec system call"
-   depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   select KEXEC_ELF
-   depends on PPC64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is a new version of the kexec system call. This call is
- file based and takes in file descriptors as system call arguments
- for kernel and initramfs as opposed to a list of segments as is the
- case for the older kexec call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
 
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+   select HAVE_IMA_KEXEC if IMA
+
 config PPC64_BIG_ENDIAN_ELF_ABI_V2
# Option is available to BFD, but LLD does not support ELFv1 so this is
# always true there.
@@ -683,14 +663,13 @@ config RELOCATABLE_TEST
  loaded at, which tends to be non-zero and therefore test the
  relocation code.
 
-config CRASH_DUMP
-   bool "Build a dump capture kernel"
-   depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE if PPC64 || 44x || PPC_85xx
-   help
- Build a kernel suitable for use as a dump capture kernel.
- The same kernel binary can be used as production kernel and dump
- capture kernel.
 
 config FA_DUMP
bool "Firmware-assisted dump"
-- 
2.31.1

[PATCH v5 09/13] parisc/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/parisc/Kconfig | 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 4cb46d5c64a2..2ef6843aae60 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -339,29 +339,17 @@ config NR_CPUS
default "8" if 64BIT
default "16"
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- shutdown, so do not be surprised if this code does not
- initially work for you.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select KEXEC_ELF
-   help
- This enables the kexec_file_load() System call. This is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
 endmenu
 
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
+
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+
 source "drivers/parisc/Kconfig"
-- 
2.31.1

[PATCH v5 08/13] mips/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: Thomas Bogendoerfer 
---
 arch/mips/Kconfig | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index fc6fba925aea..bc8421859006 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2878,33 +2878,11 @@ config HZ
 config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "Kernel crash dumps"
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or firmware using
- PHYSICAL_START.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded"
-- 
2.31.1

[PATCH v5 07/13] m68k/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/Kconfig | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index dc792b321f1e..3e318bf9504c 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -89,23 +89,8 @@ config MMU_SUN3
bool
depends on MMU && !MMU_MOTOROLA && !MMU_COLDFIRE
 
-config KEXEC
-   bool "kexec system call"
-   depends on M68KCLASSIC && MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+config ARCH_SUPPORTS_KEXEC
+   def_bool M68KCLASSIC && MMU
 
 config BOOTINFO_PROC
bool "Export bootinfo in procfs"
-- 
2.31.1

[PATCH v5 06/13] loongarch/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/loongarch/Kconfig | 26 +++---
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index e55511af4c77..397203e18800 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -537,28 +537,16 @@ config CPU_HAS_PREFETCH
bool
default y
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
 
 config RELOCATABLE
bool "Relocatable kernel"
-- 
2.31.1

[PATCH v5 05/13] arm64/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm64/Kconfig | 64 --
 1 file changed, 16 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7856c3a3e35a..3a9b06a693a9 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1459,60 +1459,28 @@ config PARAVIRT_TIME_ACCOUNTING
 
  If in doubt, say N here.
 
-config KEXEC
-   depends on PM_SLEEP_SMP
-   select KEXEC_CORE
-   bool "kexec system call"
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PM_SLEEP_SMP
 
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE
-   help
- Select this option to verify a signature with loaded kernel
- image. If configured, any attempt of loading a image without
- valid signature will fail.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
 
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select HAVE_IMA_KEXEC if IMA
 
-config KEXEC_IMAGE_VERIFY_SIG
-   bool "Enable Image signature verification support"
-   default y
-   depends on KEXEC_SIG
-   depends on EFI && SIGNED_PE_FILE_VERIFICATION
-   help
- Enable Image signature verification support.
+config ARCH_SUPPORTS_KEXEC_SIG
+   def_bool y
 
-comment "Support for PE file signature verification disabled"
-   depends on KEXEC_SIG
-   depends on !EFI || !SIGNED_PE_FILE_VERIFICATION
+config ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
+   def_bool y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
+config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   def_bool y
 
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config TRANS_TABLE
def_bool y
-- 
2.31.1

[PATCH v5 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.

Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_ config options.

Architectures can utilize the new ARCH_SELECTS_ config
options to specify additional components when  is enabled.

To summarize, the ARCH_SUPPORTS_ permits the  to be
enabled, and the ARCH_SELECTS_ handles side effects (ie.
select statements).

Signed-off-by: Eric DeVolder 
---
 arch/Kconfig |  13 -
 init/Kconfig |   2 +
 kernel/Kconfig.kexec | 116 +++
 3 files changed, 118 insertions(+), 13 deletions(-)
 create mode 100644 kernel/Kconfig.kexec

diff --git a/arch/Kconfig b/arch/Kconfig
index aff2746c8af2..b2872e9d3760 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,19 +11,6 @@ source "arch/$(SRCARCH)/Kconfig"
 
 menu "General architecture-dependent options"
 
-config CRASH_CORE
-   bool
-
-config KEXEC_CORE
-   select CRASH_CORE
-   bool
-
-config KEXEC_ELF
-   bool
-
-config HAVE_IMA_KEXEC
-   bool
-
 config ARCH_HAS_SUBPAGE_FAULTS
bool
help
diff --git a/init/Kconfig b/init/Kconfig
index f7f65af4ee12..639e8a3363c3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1927,6 +1927,8 @@ config BINDGEN_VERSION_TEXT
 config TRACEPOINTS
bool
 
+source "kernel/Kconfig.kexec"
+
 endmenu# General setup
 
 source "arch/Kconfig"
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
new file mode 100644
index ..ff72e45cfaef
--- /dev/null
+++ b/kernel/Kconfig.kexec
@@ -0,0 +1,116 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Kexec and crash features"
+
+config CRASH_CORE
+   bool
+
+config KEXEC_CORE
+   select CRASH_CORE
+   bool
+
+config KEXEC_ELF
+   bool
+
+config HAVE_IMA_KEXEC
+   bool
+
+config KEXEC
+   bool "Enable kexec system call"
+   default ARCH_DEFAULT_KEXEC
+   depends on ARCH_SUPPORTS_KEXEC
+   select KEXEC_CORE
+   help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is independent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. As of this writing the exact hardware
+ interface is strongly in flux, so no good recommendation can be
+ made.
+
+config KEXEC_FILE
+   bool "Enable kexec file based system call"
+   depends on ARCH_SUPPORTS_KEXEC_FILE
+   select KEXEC_CORE
+   help
+ This is new version of kexec system call. This system call is
+ file based and takes file descriptors as system call argument
+ for kernel and initramfs as opposed to list of segments as
+ accepted by kexec system call.
+
+config KEXEC_SIG
+   bool "Verify kernel signature during kexec_file_load() syscall"
+   depends on ARCH_SUPPORTS_KEXEC_SIG
+   depends on KEXEC_FILE
+   help
+ This option makes the kexec_file_load() syscall check for a valid
+ signature of the kernel image. The image can still be loaded without
+ a valid signature unless you also enable KEXEC_SIG_FORCE, though if
+ there's a signature that we can check, then it must be valid.
+
+ In addition to this option, you need to enable signature
+ verification for the corresponding kernel image type being
+ loaded in order for this to work.
+
+config KEXEC_SIG_FORCE
+   bool "Require a valid signature in kexec_file_load() syscall"
+   depends on ARCH_SUPPORTS_KEXEC_SIG_FORCE
+   depends on KEXEC_SIG
+   help
+ This option makes kernel signature verification mandatory for
+ the kexec_file_load() syscall.
+
+config KEXEC_IMAGE_VERIFY_SIG
+   bool "Enable Image signature verification support (ARM)"
+   default ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   depends on ARCH_SUPPORTS_KEXEC_IMAGE_VERIFY_SIG
+   depends on KEXEC_SIG
+   depends on EFI && SIGNED_PE_FILE_VERIFICATION
+   help
+ Enable Image signature verification support.
+
+config KEXEC_BZIMAGE_VERIFY_SIG
+   bool "Enable bzImage signature verification

[PATCH v5 00/13] refactor Kconfig to consolidate KEXEC and CRASH options

2023-07-06 Thread Eric DeVolder

The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch//Kconfig files into new file kernel/Kconfig.kexec.

The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_IMAGE_VERIFY_SIG
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.

The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):

 - arm
 - arm64
 - ia64
 - loongarch
 - m68k
 - mips
 - parisc
 - powerpc
 - riscv
 - s390
 - sh
 - x86 

More information:

In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"

 https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devol...@oracle.com/

the new kernel feature introduces the config option CRASH_HOTPLUG.

In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch//Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.

 https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/

To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:

- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
  - These items from arch/Kconfig:
  CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
  - These items from arch/x86/Kconfig form the common options:
  KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
  KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
  - These items from arch/arm64/Kconfig form the common options:
  KEXEC_IMAGE_VERIFY_SIG
  - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
  and is now listed in "General Setup" submenu from init/Kconfig.
- To control the common options, each has a new ARCH_SUPPORTS_
  option. These gateway options determine whether the common options
  options are valid for the architecture.
- To account for the slight differences in the original architecture
  coding of the common options, each now has a corresponding
  ARCH_SELECTS_ which are used to elicit the same side effects
  as the original arch//Kconfig files for KEXEC and CRASH options.
- NOTE: The existing ARCH_HAS_KEXEC_PURGATORY remains unchanged.

An example, 'make menuconfig' illustrating the submenu:

  > General setup > Kexec and crash features
  [*] Enable kexec system call
  [*] Enable kexec file based system call
  [*]   Verify kernel signature during kexec_file_load() syscall
  [ ] Require a valid signature in kexec_file_load() syscall
  [ ] Enable bzImage signature verification support
  [*] kexec jump
  [*] kernel crash dumps
  [*]   Update the crash elfcorehdr on system configuration changes

In the process of consolidating the common options, I encountered
slight differences in the coding of these options in several of the
architectures. As a result, I settled on the following solution:

- Each of the common options has a 'depends on ARCH_SUPPORTS_'
  statement. For example, the KEXEC_FILE option has a 'depends on
  ARCH_SUPPORTS_KEXEC_FILE' statement.

  This approach is needed on all common options so as to prevent
  options from appearing for architectures which previously did
  not allow/enable them. For example, arm supports KEXEC but not
  KEXEC_FILE. The arch/arm/Kconfig does not provide
  ARCH_SUPPORTS_KEXEC_FILE and so KEXEC_FILE and related options
  are not available to arm.

- The boolean ARCH_SUPPORTS_ in effect allows the arch to
  determine when the feature is allowed.  Archs which don't have the
  feature simply do not provide the corresponding ARCH_SUPPORTS_.
  For each arch, where there previously were KEXEC and/or CRASH
  options, these have been replaced with the corresponding boolean
  ARCH_SUPPORTS_, and an appropriate def_bool statement.

  For example, if the arch supports KEXEC_FILE, then the
  ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits
  the KEXEC_FILE option to be available.

  If the arch has a 'depends on' statement in its original coding
  of the option, then that expression becomes part of the def_bool
  expression. For example, arm64 had:

  config KEXEC
depends on PM_SLEEP_SMP

  and in this solution, this converts to:

  config ARCH_SUPPORTS_KEXEC
def_bool PM_SLEEP_SMP


- In order to account for the architecture differences in the
  coding for the common options, the ARCH_SELECTS_ in the
  arch//Kconfig is used. This option has a 'depends on
  ' statement to couple it to the main option, and from
  there can insert the differences from the common

[PATCH v5 04/13] ia64/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/ia64/Kconfig | 28 +---
 1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 2cd93e6bf0fe..88382f105301 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -361,31 +361,13 @@ config IA64_HP_AML_NFW
  the "force" module parameter, e.g., with the "aml_nfw.force"
  kernel command line option.
 
-config KEXEC
-   bool "kexec system call"
-   depends on !SMP || HOTPLUG_CPU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+endmenu
 
-config CRASH_DUMP
- bool "kernel crash dumps"
- depends on IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
- help
-   Generate crash dump after being started by kexec.
+config ARCH_SUPPORTS_KEXEC
+   def_bool !SMP || HOTPLUG_CPU
 
-endmenu
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
 
 menu "Power management and ACPI options"
 
-- 
2.31.1

[PATCH v5 02/13] x86/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/x86/Kconfig | 92 ++--
 1 file changed, 19 insertions(+), 73 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7422db409770..9767a343f7c2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2040,88 +2040,34 @@ config EFI_RUNTIME_MAP
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   depends on X86_64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool KEXEC_FILE
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool X86_64 && CRYPTO && CRYPTO_SHA256
 
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
depends on KEXEC_FILE
-   help
+   select HAVE_IMA_KEXEC if IMA
 
- This option makes the kexec_file_load() syscall check for a valid
- signature of the kernel image.  The image can still be loaded without
- a valid signature unless you also enable KEXEC_SIG_FORCE, though if
- there's a signature that we can check, then it must be valid.
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
 
- In addition to this option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
+config ARCH_SUPPORTS_KEXEC_SIG
+   def_bool y
 
-config KEXEC_SIG_FORCE
-   bool "Require a valid signature in kexec_file_load() syscall"
-   depends on KEXEC_SIG
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
+config ARCH_SUPPORTS_KEXEC_SIG_FORCE
+   def_bool y
 
-config KEXEC_BZIMAGE_VERIFY_SIG
-   bool "Enable bzImage signature verification support"
-   depends on KEXEC_SIG
-   depends on SIGNED_PE_FILE_VERIFICATION
-   select SYSTEM_TRUSTED_KEYRING
-   help
- Enable bzImage signature verification support.
+config ARCH_SUPPORTS_KEXEC_BZIMAGE_VERIFY_SIG
+   def_bool y
 
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   depends on X86_64 || (X86_32 && HIGHMEM)
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or BIOS using
- PHYSICAL_START, or it must be built as a relocatable image
- (CONFIG_RELOCATABLE=y).
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
-config KEXEC_JUMP
-   bool "kexec jump"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code in physical address mode via KEXEC
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool X86_64 || (X86_32 && HIGHMEM)
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1

[PATCH v5 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder

The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm/Kconfig | 29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 7a27550ff3c1..1a6a6eb48a15 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1645,20 +1645,8 @@ config XIP_DEFLATED_DATA
  copied, saving some precious ROM space. A possible drawback is a
  slightly longer boot delay.
 
-config KEXEC
-   bool "Kexec system call (EXPERIMENTAL)"
-   depends on (!SMP || PM_SLEEP_SMP)
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.
+config ARCH_SUPPORTS_KEXEC
+   def_bool (!SMP || PM_SLEEP_SMP) && MMU
 
 config ATAGS_PROC
bool "Export atags in procfs"
@@ -1668,17 +1656,8 @@ config ATAGS_PROC
  Should the atags used to boot the kernel be exported in an "atags"
  file in procfs. Useful with kexec.
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel (EXPERIMENTAL)"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec. The crash dump kernel must be compiled to a
- memory address not used by the main kernel
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config AUTO_ZRELADDR
bool "Auto calculation of the decompressed kernel image address" if 
!ARCH_MULTIPLATFORM
-- 
2.31.1

Re: [apparmor] [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-06 Thread Seth Arnold

On Wed, Jul 05, 2023 at 08:04:41PM -0400, Jeff Layton wrote:
> 
> I don't believe it's an issue. I've seen nothing in the POSIX spec that
> mandates that timestamp updates to different inodes involved in an
> operation be set to the _same_ value. It just says they must be updated.
> 
> It's also hard to believe that any software would depend on this either,
> given that it's very inconsistent across filesystems today. AFAICT, this
> was mostly done in the past just as a matter of convenience.

I've seen this assumption in several programs:

mutt buffy.c
https://sources.debian.org/src/mutt/2.2.9-1/buffy.c/?hl=625#L625

  if (mailbox->newly_created &&
  (sb->st_ctime != sb->st_mtime || sb->st_ctime != sb->st_atime))
mailbox->newly_created = 0;


neomutt mbox/mbox.c
https://sources.debian.org/src/neomutt/20220429+dfsg1-4.1/mbox/mbox.c/?hl=1820#L1820

  if (m->newly_created && ((st.st_ctime != st.st_mtime) || (st.st_ctime != 
st.st_atime)))
m->newly_created = false;


screen logfile.c
https://sources.debian.org/src/screen/4.9.0-4/logfile.c/?hl=130#L130

  if ((!s->st_dev && !s->st_ino) || /* stat failed, that's new! */
  !s->st_nlink ||   /* red alert: file unlinked */
  (s->st_size < o.st_size) ||   /*   file truncated */
  (s->st_mtime != o.st_mtime) ||/*file modified */
  ((s->st_ctime != o.st_ctime) &&   /* file changed (moved) */
   !(s->st_mtime == s->st_ctime &&  /*  and it was not a change */
 o.st_ctime < s->st_ctime)))/* due to delayed nfs write */
  {

nemo libnemo-private/nemo-vfs-file.c
https://sources.debian.org/src/nemo/5.6.5-1/libnemo-private/nemo-vfs-file.c/?hl=344#L344

/* mtime is when the contents changed; ctime is when the
 * contents or the permissions (inc. owner/group) changed.
 * So we can only know when the permissions changed if mtime
 * and ctime are different.
 */
if (file->details->mtime == file->details->ctime) {
return FALSE;
}


While looking for more examples, I found a perl test that seems to suggest
that at least Solaris, AFS, AmigaOS, DragonFly BSD do as you suggest:
https://sources.debian.org/src/perl/5.36.0-7/t/op/stat.t/?hl=158#L140


Thanks


signature.asc
Description: PGP signature

Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-06 Thread Jeff Layton

On Thu, 2023-07-06 at 10:16 -0500, Eric W. Biederman wrote:
> Jeff Layton  writes:
> 
> > On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> > > v2:
> > > - prepend patches to add missing ctime updates
> > > - add simple_rename_timestamp helper function
> > > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> > > - drop individual inode_ctime_set_{sec,nsec} helpers
> > > 
> > > I've been working on a patchset to change how the inode->i_ctime is
> > > accessed in order to give us conditional, high-res timestamps for the
> > > ctime and mtime. struct timespec64 has unused bits in it that we can use
> > > to implement this. In order to do that however, we need to wrap all
> > > accesses of inode->i_ctime to ensure that bits used as flags are
> > > appropriately handled.
> > > 
> > > The patchset starts with reposts of some missing ctime updates that I
> > > spotted in the tree. It then adds a new helper function for updating the
> > > timestamp after a successful rename, and new ctime accessor
> > > infrastructure.
> > > 
> > > The bulk of the patchset is individual conversions of different
> > > subsysteme to use the new infrastructure. Finally, the patchset renames
> > > the i_ctime field to __i_ctime to help ensure that I didn't miss
> > > anything.
> > > 
> > > This should apply cleanly to linux-next as of this morning.
> > > 
> > > Most of this conversion was done via 5 different coccinelle scripts, run
> > > in succession, with a large swath of by-hand conversions to clean up the
> > > remainder.
> > > 
> > 
> > A couple of other things I should note:
> > 
> > If you sent me an Acked-by or Reviewed-by in the previous set, then I
> > tried to keep it on the patch here, since the respun patches are mostly
> > just renaming stuff from v1. Let me know if I've missed any.
> > 
> > I've also pushed the pile to my tree as this tag:
> > 
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705
> > 
> > In case that's easier to work with.
> 
> Are there any preliminary patches showing what you want your introduced
> accessors to turn into?  It is hard to judge the sanity of the
> introduction of wrappers without seeing what the wrappers are ultimately
> going to do.
> 
> Eric

I have a draft version of the multigrain patches on top of the wrapper
conversion I've already posted in my "mgctime-experimental" branch:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/log/?h=mgctime-experimental

The rationale is best explained in this changelog:


https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/commit/?h=mgctime-experimental=face437a144d3375afb7f70c233b0644b4edccba

The idea will be to enable this on a per-fs basis.
-- 
Jeff Layton

Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder





On 7/6/23 10:58, Alexander Gordeev wrote:

On Wed, Jul 05, 2023 at 08:49:58AM -0700, Nathan Chancellor wrote:
...

I just bisected the following build failure visible with 'ARCH=s390
allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec:
refactor for kernel/Kconfig.kexec") in -next.

   arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
 120 | static bool kdump_csum_valid(struct kimage *image)
 | ^~
   arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
 188 | int machine_kexec_prepare(struct kimage *image)
 |  ^~
   arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare':
   arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined 
type 'struct kimage'
 192 | if (image->type == KEXEC_TYPE_CRASH)
 |  ^~
   arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' 
undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'?
 192 | if (image->type == KEXEC_TYPE_CRASH)
 |^~~~
 |KEXEC_ON_CRASH
   arch/s390/kernel/machine_kexec.c:192:28: note: each undeclared identifier is 
reported only once for each function it appears in
   arch/s390/kernel/machine_kexec.c:196:18: error: invalid use of undefined 
type 'struct kimage'
 196 | if (image->type != KEXEC_TYPE_DEFAULT)
 |  ^~
   arch/s390/kernel/machine_kexec.c:196:28: error: 'KEXEC_TYPE_DEFAULT' 
undeclared (first use in this function); did you mean 'KEXEC_ARCH_DEFAULT'?
 196 | if (image->type != KEXEC_TYPE_DEFAULT)
 |^~
 |KEXEC_ARCH_DEFAULT
   In file included from arch/s390/include/asm/thread_info.h:31,
from include/linux/thread_info.h:60,
from arch/s390/include/asm/preempt.h:6,
from include/linux/preempt.h:79,
from arch/s390/include/asm/percpu.h:5,
from include/linux/irqflags.h:18,
from include/linux/rcupdate.h:26,
from include/linux/rculist.h:11,
from include/linux/pid.h:5,
from include/linux/sched.h:14,
from include/linux/ratelimit.h:6,
from include/linux/dev_printk.h:16,
from include/linux/device.h:15,
from arch/s390/kernel/machine_kexec.c:9:
   arch/s390/kernel/machine_kexec.c:200:48: error: invalid use of undefined 
type 'struct kimage'
 200 | reboot_code_buffer = page_to_virt(image->control_code_page);
 |^~
   arch/s390/include/asm/page.h:186:58: note: in definition of macro '__va'
 186 | #define __va(x) ((void *)(unsigned long)(x))
 |  ^
   arch/s390/include/asm/page.h:194:38: note: in expansion of macro 
'pfn_to_phys'
 194 | #define pfn_to_virt(pfn)__va(pfn_to_phys(pfn))
 |  ^~~
   arch/s390/include/asm/page.h:199:33: note: in expansion of macro 
'pfn_to_virt'
 199 | #define page_to_virt(page)  pfn_to_virt(page_to_pfn(page))
 | ^~~
   include/asm-generic/memory_model.h:64:21: note: in expansion of macro 
'__page_to_pfn'
  64 | #define page_to_pfn __page_to_pfn
 | ^
   arch/s390/kernel/machine_kexec.c:200:30: note: in expansion of macro 
'page_to_virt'
 200 | reboot_code_buffer = page_to_virt(image->control_code_page);
 |  ^~~~
   arch/s390/kernel/machine_kexec.c: At top level:
   arch/s390/kernel/machine_kexec.c:207:35: warning: 'struct kimage' declared 
inside parameter list will not be visible outside of this definition or 
declaration
 207 | void machine_kexec_cleanup(struct kimage *image)
 |   ^~
   arch/s390/kernel/machine_kexec.c: In function '__do_machine_kexec':
   arch/s390/kernel/machine_kexec.c:243:40: error: invalid use of undefined 
type 'struct kimage'
 243 | data_mover = page_to_phys(image->control_code_page);
 |^~
   arch/s390/include/asm/page.h:189:35: note: in definition of macro 
'pfn_to_phys'
 189 | #define pfn_to_phys(pfn)((pfn) << PAGE_SHIFT)
 |   ^~~
   include/asm-generic/memory_model.h:64:21: note: in expansion

Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-06 Thread Eric W. Biederman

Jeff Layton  writes:

> On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
>> v2:
>> - prepend patches to add missing ctime updates
>> - add simple_rename_timestamp helper function
>> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
>> - drop individual inode_ctime_set_{sec,nsec} helpers
>> 
>> I've been working on a patchset to change how the inode->i_ctime is
>> accessed in order to give us conditional, high-res timestamps for the
>> ctime and mtime. struct timespec64 has unused bits in it that we can use
>> to implement this. In order to do that however, we need to wrap all
>> accesses of inode->i_ctime to ensure that bits used as flags are
>> appropriately handled.
>> 
>> The patchset starts with reposts of some missing ctime updates that I
>> spotted in the tree. It then adds a new helper function for updating the
>> timestamp after a successful rename, and new ctime accessor
>> infrastructure.
>> 
>> The bulk of the patchset is individual conversions of different
>> subsysteme to use the new infrastructure. Finally, the patchset renames
>> the i_ctime field to __i_ctime to help ensure that I didn't miss
>> anything.
>> 
>> This should apply cleanly to linux-next as of this morning.
>> 
>> Most of this conversion was done via 5 different coccinelle scripts, run
>> in succession, with a large swath of by-hand conversions to clean up the
>> remainder.
>> 
>
> A couple of other things I should note:
>
> If you sent me an Acked-by or Reviewed-by in the previous set, then I
> tried to keep it on the patch here, since the respun patches are mostly
> just renaming stuff from v1. Let me know if I've missed any.
>
> I've also pushed the pile to my tree as this tag:
>
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux.git/tag/?h=ctime.20230705
>
> In case that's easier to work with.

Are there any preliminary patches showing what you want your introduced
accessors to turn into?  It is hard to judge the sanity of the
introduction of wrappers without seeing what the wrappers are ultimately
going to do.

Eric

Re: [PATCH v4 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-07-06 Thread Alexander Gordeev

On Wed, Jul 05, 2023 at 08:49:58AM -0700, Nathan Chancellor wrote:
...
> I just bisected the following build failure visible with 'ARCH=s390
> allnoconfig' to this change as commit 842ce0e1dafa ("s390/kexec:
> refactor for kernel/Kconfig.kexec") in -next.
> 
>   arch/s390/kernel/machine_kexec.c:120:37: warning: 'struct kimage' declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> 120 | static bool kdump_csum_valid(struct kimage *image)
> | ^~
>   arch/s390/kernel/machine_kexec.c:188:34: warning: 'struct kimage' declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> 188 | int machine_kexec_prepare(struct kimage *image)
> |  ^~
>   arch/s390/kernel/machine_kexec.c: In function 'machine_kexec_prepare':
>   arch/s390/kernel/machine_kexec.c:192:18: error: invalid use of undefined 
> type 'struct kimage'
> 192 | if (image->type == KEXEC_TYPE_CRASH)
> |  ^~
>   arch/s390/kernel/machine_kexec.c:192:28: error: 'KEXEC_TYPE_CRASH' 
> undeclared (first use in this function); did you mean 'KEXEC_ON_CRASH'?
> 192 | if (image->type == KEXEC_TYPE_CRASH)
> |^~~~
> |KEXEC_ON_CRASH
>   arch/s390/kernel/machine_kexec.c:192:28: note: each undeclared identifier 
> is reported only once for each function it appears in
>   arch/s390/kernel/machine_kexec.c:196:18: error: invalid use of undefined 
> type 'struct kimage'
> 196 | if (image->type != KEXEC_TYPE_DEFAULT)
> |  ^~
>   arch/s390/kernel/machine_kexec.c:196:28: error: 'KEXEC_TYPE_DEFAULT' 
> undeclared (first use in this function); did you mean 'KEXEC_ARCH_DEFAULT'?
> 196 | if (image->type != KEXEC_TYPE_DEFAULT)
> |^~
> |KEXEC_ARCH_DEFAULT
>   In file included from arch/s390/include/asm/thread_info.h:31,
>from include/linux/thread_info.h:60,
>from arch/s390/include/asm/preempt.h:6,
>from include/linux/preempt.h:79,
>from arch/s390/include/asm/percpu.h:5,
>from include/linux/irqflags.h:18,
>from include/linux/rcupdate.h:26,
>from include/linux/rculist.h:11,
>from include/linux/pid.h:5,
>from include/linux/sched.h:14,
>from include/linux/ratelimit.h:6,
>from include/linux/dev_printk.h:16,
>from include/linux/device.h:15,
>from arch/s390/kernel/machine_kexec.c:9:
>   arch/s390/kernel/machine_kexec.c:200:48: error: invalid use of undefined 
> type 'struct kimage'
> 200 | reboot_code_buffer = page_to_virt(image->control_code_page);
> |^~
>   arch/s390/include/asm/page.h:186:58: note: in definition of macro '__va'
> 186 | #define __va(x) ((void *)(unsigned long)(x))
> |  ^
>   arch/s390/include/asm/page.h:194:38: note: in expansion of macro 
> 'pfn_to_phys'
> 194 | #define pfn_to_virt(pfn)__va(pfn_to_phys(pfn))
> |  ^~~
>   arch/s390/include/asm/page.h:199:33: note: in expansion of macro 
> 'pfn_to_virt'
> 199 | #define page_to_virt(page)  pfn_to_virt(page_to_pfn(page))
> | ^~~
>   include/asm-generic/memory_model.h:64:21: note: in expansion of macro 
> '__page_to_pfn'
>  64 | #define page_to_pfn __page_to_pfn
> | ^
>   arch/s390/kernel/machine_kexec.c:200:30: note: in expansion of macro 
> 'page_to_virt'
> 200 | reboot_code_buffer = page_to_virt(image->control_code_page);
> |  ^~~~
>   arch/s390/kernel/machine_kexec.c: At top level:
>   arch/s390/kernel/machine_kexec.c:207:35: warning: 'struct kimage' declared 
> inside parameter list will not be visible outside of this definition or 
> declaration
> 207 | void machine_kexec_cleanup(struct kimage *image)
> |   ^~
>   arch/s390/kernel/machine_kexec.c: In function '__do_machine_kexec':
>   arch/s390/kernel/machine_kexec.c:243:40: error: invalid use of undefined 
> type 'struct kimage'
> 243 | data_mover = page_to_phys(image->control_code_page);
> |^~
>   arch/s390/include/asm/page.h:189:35: note: in definition of macro 
> 'pfn_to_phys'
> 189 | #define pfn_to_phys(pfn)((pfn) << PAGE_SHIFT)
> |   ^~~
>

Re: [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime

2023-07-06 Thread Jan Kara

On Wed 05-07-23 14:58:12, Jeff Layton wrote:
> Now that everything in-tree is converted to use the accessor functions,
> rename the i_ctime field in the inode to discourage direct access.
> 
> Signed-off-by: Jeff Layton 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 14e38bd900f1..b66442f91835 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,7 +642,7 @@ struct inode {
>   loff_t  i_size;
>   struct timespec64   i_atime;
>   struct timespec64   i_mtime;
> - struct timespec64   i_ctime;
> + struct timespec64   __i_ctime; /* use inode_*_ctime accessors! */
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
>   u8  i_blkbits;
> @@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode);
>   */
>  static inline struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
> - return inode->i_ctime;
> + return inode->__i_ctime;
>  }
>  
>  /**
> @@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const 
> struct inode *inode)
>  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
> struct timespec64 ts)
>  {
> - inode->i_ctime = ts;
> + inode->__i_ctime = ts;
>   return ts;
>  }
>  
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Fabio Estevam

On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang  wrote:

> The clean way for fixing is to remove the code in fsl_sai_set_bclk()
> and add "fsl,sai-mclk-direction-output;" property in dts for some
> node.

Yes, after thinking more about it, I agree.

So what you are proposing is basically a revert of the patch in
Subject and I assume that the
reason you did the original patch was that you missed passing
fsl,sai-mclk-direction-output in DT.

I will send the revert.

Thanks

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-06 Thread Hugh Dickins

On Thu, 6 Jul 2023, Gerald Schaefer wrote:
> 
> Since none of my remarks on the comments seem valid or strictly necessary
> any more, and I also could not find functional issues, I think you can add
> this patch as new version for 07/12. And I can now give you this:
> 
> Reviewed-by: Gerald Schaefer 

Great, thanks a lot Gerald.
The one change I'm making to it is then this merged in:

--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -455,6 +455,11 @@ void pte_free_defer(struct mm_struct *mm, pgtable_t 
pgtable)
page = virt_to_page(pgtable);
SetPageActive(page);
page_table_free(mm, (unsigned long *)pgtable);
+   /*
+* page_table_free() does not do the pgste gmap_unlink() which
+* page_table_free_rcu() does: warn us if pgste ever reaches here.
+*/
+   WARN_ON_ONCE(mm_alloc_pgste(mm));
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-06 Thread Shrikanth Hegde




On 5/15/23 5:16 PM, Tobias Huschle wrote:
> The current load balancer implementation implies that scheduler groups,
> within the same domain, all host the same number of CPUs. This is
> reflected in the condition, that a scheduler group, which is load
> balancing and classified as having spare capacity, should pull work
> from the busiest group, if the local group runs less processes than
> the busiest one. This implies that these two groups should run the
> same number of processes, which is problematic if the groups are not 
> of the same size.
> 
> The assumption that scheduler groups within the same scheduler domain
> host the same number of CPUs appears to be true for non-s390 
> architectures. Nevertheless, s390 can have scheduler groups of unequal 
> size.
> 
> This introduces a performance degredation in the following scenario:
> 
> Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
> the remaining 2 are located on another socket:
> 
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
> 
> Placing some workload ( x = one task ) yields the following
> scenarios:
> 
> The first 5 tasks are distributed evenly across the two groups.
> 
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
>  x x x  x x
> 
> Adding a 6th task yields the following distribution:
> 
> Socket   -1--2-
> CPU  1 2 3 4 5 67 8
> SMT1 x x x  x x
> SMT2x
> 
> The task is added to the 2nd scheduler group, as the scheduler has the
> assumption that scheduler groups are of the same size, so they should
> also host the same number of tasks. This makes CPU 7 run into SMT 
> thread, which comes with a performance penalty. This means, that in 
> the window of 6-8 tasks, load balancing is done suboptimally, because 
> SMT is used although there is no reason to do so as fully idle CPUs 
> are still available.
> 
> Taking the weight of the scheduler groups into account, ensures that
> a load balancing CPU within a smaller group will not try to pull tasks
> from a bigger group while the bigger group still has idle CPUs
> available.
> 
> Signed-off-by: Tobias Huschle 
> ---
>  kernel/sched/fair.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..b1307d7e4065 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10426,7 +10426,8 @@ static struct sched_group *find_busiest_group(struct 
> lb_env *env)
>* group's child domain.
>*/
>   if (sds.prefer_sibling && local->group_type == group_has_spare &&
> - busiest->sum_nr_running > local->sum_nr_running + 1)
> + busiest->sum_nr_running * local->group_weight >
> + local->sum_nr_running * busiest->group_weight + 1)
>   goto force_balance;
> 


I assume its SMT2 here. sched group is enclosed in [busy_cpus/idle_cpus/weight] 

Lets take an example:  we will begin the with the said imbalance.
[3/9/12] -- local group 
[3/1/4] -- busy group. 
we will evaluate 3*12 > (4*(3+1)) is true and proceeds further to balance. 
but calculate_imbalance will lead to zero as the imbalance no? in case of 
prefer sibling 
case it find the difference of sum_nr_running of the two groups. It will be 3-3 
= 0

this would need modifications to calculate_imbalance. 
Maybe something like this will help? NOT TESTED AT ALL. 

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a80a73909dc2..e027d4086edc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10296,7 +10296,9 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
return;
}

-   if (busiest->group_weight == 1 || sds->prefer_sibling) {
+   if (busiest->group_weight == 1 ||
+   (sds->prefer_sibling &&
+busiest->group_weight == local->group_weight)) {
unsigned int nr_diff = busiest->sum_nr_running;
/*
 * When prefer sibling, evenly spread running tasks on
@@ -10305,6 +10307,11 @@ static inline void calculate_imbalance(struct lb_env 
*env, struct sd_lb_stats *s
env->migration_type = migrate_task;
lsub_positive(_diff, local->sum_nr_running);
env->imbalance = nr_diff;
+   }
+   if (sds->prefer_sibling &&
+   busiest->group_weight != local->group_weight) {
+   env->migration_type = migrate_task;
+   env->imbalance = 1;
} else {


---
On a tangential dimension.


Since prefer_siblings make inherent assumption that all groups have equal 
weight, 
it will cause  complications when group_weights are different. I think that

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V

On 7/6/23 6:29 PM, David Hildenbrand wrote:
> On 06.07.23 14:32, Aneesh Kumar K V wrote:
>> On 7/6/23 4:44 PM, David Hildenbrand wrote:
>>> On 06.07.23 11:36, Aneesh Kumar K V wrote:
 On 7/6/23 2:48 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>
> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>
> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
> calculate the end_pfn?
>
> To rephrase: if the vmemmap is not at the beginning and doesn't cover 
> full apgeblocks, memory onlining/offlining would be broken.
>
> [...]


 With ppc64 and 64K pagesize and different memory block sizes, we can end 
 up allocating vmemmap backing memory from outside altmap because
 a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). 
 and that can point to pages outside the dev_pagemap range.
 So on free we  check
>>>
>>> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
>>> pages? That sound wrong (and is counter-intuitive to the feature in 
>>> general, where we *don't* want to allocate the vmemmap from outside the 
>>> altmap).
>>>
>>> (64 * 1024) / sizeof(struct page) -> 1024 pages
>>>
>>> 1024 pages * 64k = 64 MiB.
>>>
>>> What's the memory block size on these systems? If it's >= 64 MiB the 
>>> vmemmap of a single memory block fits into a single page and we should be 
>>> fine.
>>>
>>> Smells like you want to disable the feature on a 64k system.
>>>
>>
>> But that part of vmemmap_free is common for both dax,dax kmem and the new 
>> memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
>> a full altmap structure with all the details in. So for memmap on memmory to 
>> work on ppc64 we do require similar altmap struct. Hence the idea
>> of adding vmemmap_altmap to  struct memory_block
> 
> I'd suggest making sure that for the memmap_on_memory case your really 
> *always* allocate from the altmap (that's what the feature is about after 
> all), and otherwise block the feature (i.e., arch_mhp_supports_... should 
> reject it).
>

Sure. How about?

bool mhp_supports_memmap_on_memory(unsigned long size)
{

unsigned long nr_pages = size >> PAGE_SHIFT;
unsigned long vmemmap_size = nr_pages * sizeof(struct page);

if (!radix_enabled())
return false;
/*
 * memmap on memory only supported with memory block size add/remove
 */
if (size != memory_block_size_bytes())
return false;
/*
 * Also make sure the vmemmap allocation is fully contianed
 * so that we always allocate vmemmap memory from altmap area.
 */
if (!IS_ALIGNED(vmemmap_size,  PAGE_SIZE))
return false;
/*
 * The pageblock alignment requirement is met by using
 * reserve blocks in altmap.
 */
return true;
}



 
> Then, you can reconstruct the altmap layout trivially
> 
> base_pfn: start of the range to unplug
> end_pfn: base_pfn + nr_vmemmap_pages
> 
> and pass that to the removal code, which will do the right thing, no?
> 
> 
> Sure, remembering the altmap might be a potential cleanup (eventually?), but 
> the basic reasoning why this is required as patch #1 IMHO is wrong: if you 
> say you support memmap_on_memory for a configuration, then you should also 
> properly support it (allocate from the hotplugged memory), not silently fall 
> back to something else.

I guess you want to keep the altmap introduction as a later patch in the series 
and not the preparatory patch? Or are you ok with just adding the additional 
check I mentioned above w.r.t size value and keep this patch as patch 1  as a 
generic cleanup (avoiding
the recomputation of altmap->alloc/base_pfn/end_pfn? 

-aneesh

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-06 Thread Isaku Yamahata

On Thu, Jul 06, 2023 at 01:52:08PM +0900,
David Stevens  wrote:

> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang  wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens 
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.
> 
> > >
> > > The bulk of this change is tracking the is_refcounted_page flag so that
> > > non-refcounted pages don't trigger page_count() == 0 warnings. This is
> > > done by storing the flag in an unused bit in the sptes.
> >
> > Also, maybe we should mention this only works on x86-64.
> >
> > >
> > > Signed-off-by: David Stevens 
> > > ---
> > >  arch/x86/kvm/mmu/mmu.c  | 44 +
> > >  arch/x86/kvm/mmu/mmu_internal.h |  1 +
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  9 ---
> > >  arch/x86/kvm/mmu/spte.c |  4 ++-
> > >  arch/x86/kvm/mmu/spte.h | 12 -
> > >  arch/x86/kvm/mmu/tdp_mmu.c  | 22 ++---
> > >  6 files changed, 62 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index e44ab512c3a1..b1607e314497 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> >
> > ...
> >
> > > @@ -2937,6 +2943,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, 
> > > struct kvm_memory_slot *slot,
> > >   bool host_writable = !fault || fault->map_writable;
> > >   bool prefetch = !fault || fault->prefetch;
> > >   bool write_fault = fault && fault->write;
> > > + bool is_refcounted = !fault || fault->is_refcounted_page;
> >
> > Just wonder, what if a non-refcounted page is prefetched?  Or is it 
> > possible in
> > practice?
> 
> Prefetching is still done via gfn_to_page_many_atomic, which sets
> FOLL_GET. That's fixable, but it's not something this series currently
> does.

So if we prefetch a page, REFCOUNTED bit is cleared unconditionally with this
hunk.  kvm_set_page_{dirty, accessed} won't be called as expected for prefetched
spte.  If I read the patch correctly, REFCOUNTED bit in SPTE should represent
whether the corresponding page is ref-countable or not, right?

Because direct_pte_prefetch_many() is for legacy KVM MMU and FNAME(prefetch_pte)
is shadow paging, we need to test it with legacy KVM MMU or shadow paging to hit
the issue, though.

Thanks,
-- 
Isaku Yamahata

[PATCH v8 17/19] powerpc: mm: Convert to GENERIC_IOREMAP

2023-07-06 Thread Baoquan He

From: Christophe Leroy 

By taking GENERIC_IOREMAP method, the generic generic_ioremap_prot(),
generic_iounmap(), and their generic wrapper ioremap_prot(), ioremap()
and iounmap() are all visible and available to arch. Arch needs to
provide wrapper functions to override the generic versions if there's
arch specific handling in its ioremap_prot(), ioremap() or iounmap().
This change will simplify implementation by removing duplicated code
with generic_ioremap_prot() and generic_iounmap(), and has the equivalent
functioality as before.

Here, add wrapper functions ioremap_prot() and iounmap() for powerpc's
special operation when ioremap() and iounmap().

Signed-off-by: Christophe Leroy 
Signed-off-by: Baoquan He 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Mike Rapoport (IBM) 
Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/io.h |  8 +++-
 arch/powerpc/mm/ioremap.c | 26 +-
 arch/powerpc/mm/ioremap_32.c  | 19 +--
 arch/powerpc/mm/ioremap_64.c  | 12 ++--
 5 files changed, 16 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b1172cbeccb..9222c138c457 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -193,6 +193,7 @@ config PPC
select GENERIC_CPU_VULNERABILITIES  if PPC_BARRIER_NOSPEC
select GENERIC_EARLY_IOREMAP
select GENERIC_GETTIMEOFDAY
+   select GENERIC_IOREMAP
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
select GENERIC_PCI_IOMAPif PCI
diff --git a/arch/powerpc/include/asm/io.h b/arch/powerpc/include/asm/io.h
index 67a3fb6de498..0732b743e099 100644
--- a/arch/powerpc/include/asm/io.h
+++ b/arch/powerpc/include/asm/io.h
@@ -889,8 +889,8 @@ static inline void iosync(void)
  *
  */
 extern void __iomem *ioremap(phys_addr_t address, unsigned long size);
-extern void __iomem *ioremap_prot(phys_addr_t address, unsigned long size,
- unsigned long flags);
+#define ioremap ioremap
+#define ioremap_prot ioremap_prot
 extern void __iomem *ioremap_wc(phys_addr_t address, unsigned long size);
 #define ioremap_wc ioremap_wc
 
@@ -904,14 +904,12 @@ void __iomem *ioremap_coherent(phys_addr_t address, 
unsigned long size);
 #define ioremap_cache(addr, size) \
ioremap_prot((addr), (size), pgprot_val(PAGE_KERNEL))
 
-extern void iounmap(volatile void __iomem *addr);
+#define iounmap iounmap
 
 void __iomem *ioremap_phb(phys_addr_t paddr, unsigned long size);
 
 int early_ioremap_range(unsigned long ea, phys_addr_t pa,
unsigned long size, pgprot_t prot);
-void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long 
size,
-pgprot_t prot, void *caller);
 
 extern void __iomem *__ioremap_caller(phys_addr_t, unsigned long size,
  pgprot_t prot, void *caller);
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index 4f12504fb405..705e8e8ffde4 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -41,7 +41,7 @@ void __iomem *ioremap_coherent(phys_addr_t addr, unsigned 
long size)
return __ioremap_caller(addr, size, prot, caller);
 }
 
-void __iomem *ioremap_prot(phys_addr_t addr, unsigned long size, unsigned long 
flags)
+void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long flags)
 {
pte_t pte = __pte(flags);
void *caller = __builtin_return_address(0);
@@ -74,27 +74,3 @@ int early_ioremap_range(unsigned long ea, phys_addr_t pa,
 
return 0;
 }
-
-void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, unsigned long 
size,
-pgprot_t prot, void *caller)
-{
-   struct vm_struct *area;
-   int ret;
-   unsigned long va;
-
-   area = __get_vm_area_caller(size, VM_IOREMAP, IOREMAP_START, 
IOREMAP_END, caller);
-   if (area == NULL)
-   return NULL;
-
-   area->phys_addr = pa;
-   va = (unsigned long)area->addr;
-
-   ret = ioremap_page_range(va, va + size, pa, prot);
-   if (!ret)
-   return (void __iomem *)area->addr + offset;
-
-   vunmap_range(va, va + size);
-   free_vm_area(area);
-
-   return NULL;
-}
diff --git a/arch/powerpc/mm/ioremap_32.c b/arch/powerpc/mm/ioremap_32.c
index 9d13143b8be4..ca5bc6be3e6f 100644
--- a/arch/powerpc/mm/ioremap_32.c
+++ b/arch/powerpc/mm/ioremap_32.c
@@ -21,6 +21,13 @@ __ioremap_caller(phys_addr_t addr, unsigned long size, 
pgprot_t prot, void *call
phys_addr_t p, offset;
int err;
 
+   /*
+* If the address lies within the first 16 MB, assume it's in ISA
+* memory space
+*/
+   if (addr < SZ_16M)
+   addr += _ISA_MEM_BASE;
+
/*
 * Choose an address to map it to.
 * Once the vmalloc system is running, we use

Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing page

2023-07-06 Thread Gerald Schaefer

On Wed, 5 Jul 2023 18:20:21 -0700 (PDT)
Hugh Dickins  wrote:

> On Wed, 5 Jul 2023, Gerald Schaefer wrote:
> > On Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
> > Hugh Dickins  wrote:  
> > > On Tue, 4 Jul 2023, Gerald Schaefer wrote:  
> > > > On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> > > > Hugh Dickins  wrote:
> > > > > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > ...  
> > > > > --- a/arch/s390/mm/pgalloc.c
> > > > > +++ b/arch/s390/mm/pgalloc.c
> > > > > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> > > > >   * logic described above. Both AA bits are set to 1 to denote a 
> > > > > 4KB-pgtable
> > > > >   * while the PP bits are never used, nor such a page is added to or 
> > > > > removed
> > > > >   * from mm_context_t::pgtable_list.
> > > > > + *
> > > > > + * pte_free_defer() overrides those rules: it takes the page off 
> > > > > pgtable_list,
> > > > > + * and prevents both 2K fragments from being reused. 
> > > > > pte_free_defer() has to
> > > > > + * guarantee that its pgtable cannot be reused before the RCU grace 
> > > > > period
> > > > > + * has elapsed (which page_table_free_rcu() does not actually 
> > > > > guarantee).
> > > > 
> > > > Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> > > > allow reuse before grace period elapsed. And I hope that it does so, by
> > > > setting the PP bits, which would be noticed in page_table_alloc(), in
> > > > case the page would be seen there.
> > > > 
> > > > Unlike pte_free_defer(), page_table_free_rcu() would add pages back to 
> > > > the
> > > > end of the list, and so they could be seen in page_table_alloc(), but 
> > > > they
> > > > should not be reused before grace period elapsed and 
> > > > __tlb_remove_table()
> > > > cleared the PP bits, as far as I understand.
> > > > 
> > > > So what exactly do you mean with "which page_table_free_rcu() does not 
> > > > actually
> > > > guarantee"?
> > > 
> > > I'll answer without locating and re-reading what Jason explained earlier,
> > > perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
> > > he may have explained it better.  And without working out again all the
> > > MMU_GATHER #defines, and which of them do and do not apply to s390 here.
> > > 
> > > The detail that sticks in my mind is the fallback in tlb_remove_table()  
> > 
> > Ah ok, I was aware of that "semi-RCU" fallback logic in tlb_remove_table(),
> > but that is rather a generic issue, and not s390-specific.  
> 
> Yes.
> 
> > I thought you
> > meant some s390-oddity here, of which we have a lot, unfortunately...
> > Of course, we call tlb_remove_table() from our page_table_free_rcu(), so
> > I guess you could say that page_table_free_rcu() cannot guarantee what
> > tlb_remove_table() cannot guarantee.
> > 
> > Maybe change to "which page_table_free_rcu() does not actually guarantee,
> > by calling tlb_remove_table()", to make it clear that this is not a problem
> > of page_table_free_rcu() itself.  
> 
> Okay - I'll rephrase slightly to avoid being sued by s390's lawyers :-)
> 
> >   
> > > in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
> > > batch the tables for freeing by RCU, and resorts instead to an immediate 
> > > TLB flush (I think: that again involves chasing definitions) followed by
> > > tlb_remove_table_sync_one() - which just delivers an interrupt to each 
> > > CPU,
> > > and is commented: 
> > > /*
> > >  * This isn't an RCU grace period and hence the page-tables cannot be
> > >  * assumed to be actually RCU-freed.
> > >  *
> > >  * It is however sufficient for software page-table walkers that rely on
> > >  * IRQ disabling.
> > >  */
> > > 
> > > Whether that's good for your PP pages or not, I've given no thought:
> > > I've just taken it on trust that what s390 has working today is good.  
> > 
> > Yes, we should be fine with that, current code can be trusted :-)  
> 
> Glad to hear it :-)  Yes, I think it's not actually relying on the "rcu"
> implied by the function name.

Ah ok, now I get it. Never noticed that naming it "free_rcu" could be
misleading. It is only ever called from pte_free_tlb(), so always in that
"semi-RCU" context. If you just look at the name, you could expect this
to always free pagetables by RCU, which would be exactly what you need
for pte_free_defer(), and which of course cannot be guaranteed by our
page_table_free_rcu().

IOW, exactly what your comment says, and now I think it is actually fine
as it is :-)

I guess I am a bit lamebrained this week, due to early shift and not
enough sleep...

> 
> >   
> > > 
> > > If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
> > > then I would not have written "(which page_table_free_rcu() does not
> > > actually guarantee)".  But it cannot use call_rcu() because it does
> > > not have an rcu_head to work with - it's in some generic code, and
> > > there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.
> > > 
> > >

Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

2023-07-06 Thread Yu Zhang

On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang  wrote:
> >
> > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang  
> > > wrote:
> > > >
> > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long 
> > > > > addr, bool write_fault,
> > > > >   * The slow path to get the pfn of the specified host virtual 
> > > > > address,
> > > > >   * 1 indicates success, -errno is returned if error is detected.
> > > > >   */
> > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool 
> > > > > write_fault,
> > > > > -bool interruptible, bool *writable, 
> > > > > kvm_pfn_t *pfn)
> > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t 
> > > > > *pfn)
> > > > >  {
> > > > > - unsigned int flags = FOLL_HWPOISON;
> > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > >   struct page *page;
> > > > >   int npages;
> > > > >
> > > > >   might_sleep();
> > > > >
> > > > > - if (writable)
> > > > > - *writable = write_fault;
> > > > > -
> > > > > - if (write_fault)
> > > > > - flags |= FOLL_WRITE;
> > > > > - if (async)
> > > > > - flags |= FOLL_NOWAIT;
> > > > > - if (interruptible)
> > > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > > -
> > > > > - npages = get_user_pages_unlocked(addr, 1, , flags);
> > > > > + npages = get_user_pages_unlocked(foll->hva, 1, , flags);
> > > > >   if (npages != 1)
> > > > >   return npages;
> > > > >
> > > > > + foll->writable = (foll->flags & FOLL_WRITE) && 
> > > > > foll->allow_write_mapping;
> > > > > +
> > > > >   /* map read fault as writable if possible */
> > > > > - if (unlikely(!write_fault) && writable) {
> > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > >
> > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > >
> > > The two statements are logically equivalent, although I guess using
> > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > verbose.
> >
> > Well, as the comment says, we wanna try to map the read fault as writable
> > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > for write faults. So I guess using !foll->writable will not allow this.
> > Did I miss anything?
> 
> We just set the foll->writable out parameter to be equal to
> ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> b.

Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
my head while calculating... Thanks! :)

B.R.
Yu

Re: [PATCH] powernv/opal-prd: Silence memcpy() run-time false positive warnings

2023-07-06 Thread Mahesh J Salgaonkar

On 2023-07-05 11:06:46 Wed, Jordan Niethe wrote:
> 
> 
> On 26/6/23 5:04 pm, Mahesh Salgaonkar wrote:
> > opal_prd_msg_notifier extracts the opal prd message size from the message
> > header and uses it for allocating opal_prd_msg_queue_item that includes
> > the correct message size to be copied. However, while running under
> > CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning:
> > 
> > [ 6458.234352] memcpy: detected field-spanning write (size 32) of single 
> > field ">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4)
> > [ 6458.234390] WARNING: CPU: 9 PID: 660 at 
> > arch/powerpc/platforms/powernv/opal-prd.c:355 
> > opal_prd_msg_notifier+0x174/0x188 [opal_prd]
> > [...]
> > [ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 
> > [opal_prd]
> > [ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 
> > [opal_prd]
> > [ 6458.234736] Call Trace:
> > [ 6458.234742] [c002acb23c10] [c0080e0c0e68] 
> > opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable)
> > [ 6458.234759] [c002acb23ca0] [c019ccc0] 
> > notifier_call_chain+0xc0/0x1b0
> > [ 6458.234774] [c002acb23d00] [c019ceac] 
> > atomic_notifier_call_chain+0x2c/0x40
> > [ 6458.234788] [c002acb23d20] [c00d69b4] 
> > opal_message_notify+0xf4/0x2c0
> > [...]
> > 
> > Add a flexible array member to avoid false positive run-time warning.
> > 
> > Reported-by: Aneesh Kumar K.V 
> > Signed-off-by: Mahesh Salgaonkar 
> > ---
> >   arch/powerpc/platforms/powernv/opal-prd.c |7 +--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
> > b/arch/powerpc/platforms/powernv/opal-prd.c
> > index 113bdb151f687..9e2c4775f75f5 100644
> > --- a/arch/powerpc/platforms/powernv/opal-prd.c
> > +++ b/arch/powerpc/platforms/powernv/opal-prd.c
> > @@ -30,7 +30,10 @@
> >*/
> >   struct opal_prd_msg_queue_item {
> > struct list_headlist;
> > -   struct opal_prd_msg_header  msg;
> > +   union {
> > +   struct opal_prd_msg_header  msg;
> > +   DECLARE_FLEX_ARRAY(__u8, msg_flex);
> > +   };
> >   };
> >   static struct device_node *prd_node;
> > @@ -352,7 +355,7 @@ static int opal_prd_msg_notifier(struct notifier_block 
> > *nb,
> > if (!item)
> > return -ENOMEM;
> > -   memcpy(>msg, msg->params, msg_size);
> > +   memcpy(>msg_flex, msg->params, msg_size);
> 
> This does silence the warning but it seems like we might be able to go a
> little further.
> 
> What about not adding that flex array to struct opal_prd_msg_queue_item, but
> adding one to struct opal_prd_msg. That is what the data format actually is.
> 
> So we'd have something like:
> 
> 
> struct opal_prd_msg  {
> 
> struct opal_prd_msg_header hdr;
> 
> char msg[];
> 
> }
> 
> 
> and change things to use that instead?
> 
> But that might be more trouble than it is worth,

Yup, it would cause more changes to the code elsewhere as well.

> alternatively we can just
> do:
> 
>   item->msg = *hdr;
>   memcpy((char *)item->msg + sizeof(*hdr), (char *)hdr + sizeof(*hdr),
> msg_size - sizeof(*hdr));

Agree, this helps, I tried with small change to above code which
also works fine.

item->msg = *hdr;
hdr++;
memcpy((char *)>msg + sizeof(*hdr), hdr, msg_size - sizeof(*hdr));

Will send out v2.

Thanks,
-Mahesh.

Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-07-06 Thread Christoph Hellwig

> Thanks for your patch, which is now commit 322dbe898f82fd8a
> ("ARM: dma-mapping: split out arch_dma_mark_clean() helper") in
> esmil/jh7100-dmapool.

Well, something is wrong with that branch then, and this series still
needs more work, and should eventually be merged through the dma-mapping
tree.

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Fabio Estevam

Hi Andreas,

On Thu, Jul 6, 2023 at 9:34 AM Andreas Henriksson  wrote:

> I think your initial review comment was spot on Fabio. There probably
> needs to be a(n imx8mm) specific flag that says when this workaround
> should be applied and gate the code in bclk function on that.
> Atleast that's the only thing I can think of if my interpretation of the
> problem for imx8mm is correct.

Yes, deciding whether MCLK_EN should act as a clock gate based on
the number of SAI registers seems fragile to me.

I have sent a proposal as RFC. Please give it a try if you have a chance.

I would like Shengjiu to confirm if imx8mn and imx93 should also
behave like imx8mm in this aspect.

Cheers

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Andreas Henriksson

Hello Fabio,

Maybe I shouldn't comment as I'm already on deep water, but...

On Thu, Jul 06, 2023 at 08:32:57AM -0300, Fabio Estevam wrote:
> On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang  wrote:
> 
> > No, this is the code in probe().
> > The code with the issue is in fsl_sai_set_bclk().
> 
> Yes, I put it in the wrong place.
> 
> > The clean way for fixing is to remove the code in fsl_sai_set_bclk()
> > and add "fsl,sai-mclk-direction-output;" property in dts for some
> > node.
> 
> Yes, what about this?
> 
> --- a/sound/soc/fsl/fsl_sai.c
> +++ b/sound/soc/fsl/fsl_sai.c
> @@ -507,7 +507,7 @@ static int fsl_sai_set_bclk(struct snd_soc_dai
> *dai, bool tx, u32 freq)
>savediv / 2 - 1);
> }
> 
> -   if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
> +   if (sai->soc_data->max_register >= FSL_SAI_MCTL  &&
> sai->mclk_direction_output) {
> /* SAI is in master mode at this point, so enable MCLK */
> regmap_update_bits(sai->regmap, FSL_SAI_MCTL,
>FSL_SAI_MCTL_MCLK_EN, 
> FSL_SAI_MCTL_MCLK_EN);

This is exactly the same as and thus redundant to the code already in
the probe function?!

If I understood it correctly, the problem Shengjiu was trying to adress
was that apparently on i.MX8MM (and only imx8mm?!), even when using MCLK
in *input*, you still need to enable it because otherwise BCLK will not
be generated. (I personally don't know anything about this or the imx8mm
variant though.)

The problem could probably equally well be worked around by always setting the
fsl,sai-mclk-direction-output; devicetree property on imx8mm, even when
using MCLK in input. But I'm just guessing here to be honest. This
is just as I understood what purpose the initial patch that was merged
had.

The current state affects more devices than imx8mm though, making MCLK
in input mode not possible.

I think your initial review comment was spot on Fabio. There probably
needs to be a(n imx8mm) specific flag that says when this workaround
should be applied and gate the code in bclk function on that.
Atleast that's the only thing I can think of if my interpretation of the
problem for imx8mm is correct.

Regards,
Andreas Henriksson

Re: [PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-07-06 Thread Eric DeVolder





On 7/6/23 07:18, Arnd Bergmann wrote:

On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote:

+
+config CRASH_DUMP
+   bool "kernel crash dumps"
+   depends on ARCH_SUPPORTS_CRASH_DUMP
+   select CRASH_CORE
+   select KEXEC


Today's linux-next now runs into a warning on arm64 and
presumably others, with the same problem as on arm earlier:

WARNING: unmet direct dependencies detected for KEXEC
   Depends on [n]: ARCH_SUPPORTS_KEXEC [=n]
   Selected by [y]:
   - CRASH_DUMP [=y] && ARCH_SUPPORTS_CRASH_DUMP [=y]

I think the easiest way to make this reliable would be
this fixup:

diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index d82a7ce59c051..e58ca6128d6ee 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -91,6 +91,7 @@ config KEXEC_JUMP
  config CRASH_DUMP
 bool "kernel crash dumps"
 depends on ARCH_SUPPORTS_CRASH_DUMP
+   depends on ARCH_SUPPORTS_KEXEC
 select CRASH_CORE
 select KEXEC
 help

Arnd


Will do, thanks!
I changed my testing to include allnoconfig and allyesconfig and that revealed some minor issues as 
well. Almost there!

eric

Re: [PATCH v4 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-07-06 Thread Arnd Bergmann

On Wed, Jul 5, 2023, at 16:19, Eric DeVolder wrote:
> +
> +config CRASH_DUMP
> + bool "kernel crash dumps"
> + depends on ARCH_SUPPORTS_CRASH_DUMP
> + select CRASH_CORE
> + select KEXEC

Today's linux-next now runs into a warning on arm64 and
presumably others, with the same problem as on arm earlier:

WARNING: unmet direct dependencies detected for KEXEC
  Depends on [n]: ARCH_SUPPORTS_KEXEC [=n]
  Selected by [y]:
  - CRASH_DUMP [=y] && ARCH_SUPPORTS_CRASH_DUMP [=y]

I think the easiest way to make this reliable would be
this fixup:

diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
index d82a7ce59c051..e58ca6128d6ee 100644
--- a/kernel/Kconfig.kexec
+++ b/kernel/Kconfig.kexec
@@ -91,6 +91,7 @@ config KEXEC_JUMP
 config CRASH_DUMP
bool "kernel crash dumps"
depends on ARCH_SUPPORTS_CRASH_DUMP
+   depends on ARCH_SUPPORTS_KEXEC
select CRASH_CORE
select KEXEC
help

   Arnd

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Shengjiu Wang

On Thu, Jul 6, 2023 at 7:08 PM Fabio Estevam  wrote:

> Hi Andreas,
>
> On Thu, Jul 6, 2023 at 5:47 AM Andreas Henriksson 
> wrote:
>
> > We've been working on an i.MX8MP where MCLK needs to be input and found
> > that this enables the MCLK as output despite not having set the
> > `fsl,sai-mclk-direction-output;` devicetree property in our DT.
> > Reverting the patch fixes the issues for us.
> >
> > I have to say that the code comment made me a bit confused, but once
> > I found the commit message I understood why this code existed.
> > If this is really i.MX8MM specific maybe mention that in the code
> > comment and please make the code actually only trigger on i.MX8MM.
> > It seems to me like these all fulfill the current criteria:
> > imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93
> >
> > Should I report this in bugzilla.kernel.org ?
>
> Should we do a fix like this?
>
> --- a/sound/soc/fsl/fsl_sai.c
> +++ b/sound/soc/fsl/fsl_sai.c
> @@ -1453,7 +1453,7 @@ static int fsl_sai_probe(struct platform_device
> *pdev)
>
> /* Select MCLK direction */
> if (sai->mclk_direction_output &&
> -   sai->soc_data->max_register >= FSL_SAI_MCTL) {
> +   sai->soc_data->max_register >= FSL_SAI_MCTL &&
> sai->mclk_direction_output) {
> regmap_update_bits(sai->regmap, FSL_SAI_MCTL,
>FSL_SAI_MCTL_MCLK_EN,
> FSL_SAI_MCTL_MCLK_EN);
> }
>

No, this is the code in probe().
The code with the issue is in fsl_sai_set_bclk().

The clean way for fixing is to remove the code in fsl_sai_set_bclk()
and add "fsl,sai-mclk-direction-output;" property in dts for some
node.

best regards
wang shengjiu

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Shengjiu Wang

On Thu, Jul 6, 2023 at 4:47 PM Andreas Henriksson  wrote:

> Hello Shengjiu, Fabio,
>
> On Thu, May 19, 2022 at 10:23:06AM -0300, Fabio Estevam wrote:
> > Hi Shengjiu,
> >
> > On Thu, May 19, 2022 at 9:49 AM Shengjiu Wang 
> wrote:
> >
> > > diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
> > > index fa950dde5310..dae16a14f177 100644
> > > --- a/sound/soc/fsl/fsl_sai.c
> > > +++ b/sound/soc/fsl/fsl_sai.c
> > > @@ -437,6 +437,12 @@ static int fsl_sai_set_bclk(struct snd_soc_dai
> *dai, bool tx, u32 freq)
> > >FSL_SAI_CR2_DIV_MASK |
> FSL_SAI_CR2_BYP,
> > >savediv / 2 - 1);
> > >
> > > +   if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
> >
> > Isn't it a bit fragile to take this decision based on the number of
> > SAI registers in the SoC?
> >
> > What about adding a specific field in soc_data for such a purpose?
>
> We've been working on an i.MX8MP where MCLK needs to be input and found
> that this enables the MCLK as output despite not having set the
> `fsl,sai-mclk-direction-output;` devicetree property in our DT.
> Reverting the patch fixes the issues for us.
>
> Good catch. seems there is an issue here.

best regards
wang shengjiu

I have to say that the code comment made me a bit confused, but once
> I found the commit message I understood why this code existed.
> If this is really i.MX8MM specific maybe mention that in the code
> comment and please make the code actually only trigger on i.MX8MM.
> It seems to me like these all fulfill the current criteria:
> imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93
>
> Should I report this in bugzilla.kernel.org ?
>
> Regards,
> Andreas Henriksson
>

Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-06 Thread Jan Kara

On Wed 05-07-23 14:58:11, Jeff Layton wrote:
> A rename potentially involves updating 4 different inode timestamps. Add
> a function that handles the details sanely, and convert the libfs.c
> callers to use it.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/libfs.c | 36 +++-
>  include/linux/fs.h |  2 ++
>  2 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a7e56baf8bbd..9ee79668c909 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  EXPORT_SYMBOL(simple_rmdir);
>  
> +/**
> + * simple_rename_timestamp - update the various inode timestamps for rename
> + * @old_dir: old parent directory
> + * @old_dentry: dentry that is being renamed
> + * @new_dir: new parent directory
> + * @new_dentry: target for rename
> + *
> + * POSIX mandates that the old and new parent directories have their ctime 
> and
> + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> have
> + * their ctime updated.
> + */
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +  struct inode *new_dir, struct dentry *new_dentry)
> +{
> + struct inode *newino = d_inode(new_dentry);
> +
> + old_dir->i_mtime = inode_set_ctime_current(old_dir);
> + if (new_dir != old_dir)
> + new_dir->i_mtime = inode_set_ctime_current(new_dir);
> + inode_set_ctime_current(d_inode(old_dentry));
> + if (newino)
> + inode_set_ctime_current(newino);
> +}
> +EXPORT_SYMBOL_GPL(simple_rename_timestamp);
> +
>  int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
>  struct inode *new_dir, struct dentry *new_dentry)
>  {
> @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct 
> dentry *old_dentry,
>   inc_nlink(old_dir);
>   }
>   }
> - old_dir->i_ctime = old_dir->i_mtime =
> - new_dir->i_ctime = new_dir->i_mtime =
> - d_inode(old_dentry)->i_ctime =
> - d_inode(new_dentry)->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(simple_rename_exchange);
> @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
> struct dentry *old_dentry, struct inode *new_dir,
> struct dentry *new_dentry, unsigned int flags)
>  {
> - struct inode *inode = d_inode(old_dentry);
>   int they_are_dirs = d_is_dir(old_dentry);
>  
>   if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
> @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
>   inc_nlink(new_dir);
>   }
>  
> - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
> - new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>   return 0;
>  }
>  EXPORT_SYMBOL(simple_rename);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bdfbd11a5811..14e38bd900f1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file 
> *file);
>  extern int simple_link(struct dentry *, struct inode *, struct dentry *);
>  extern int simple_unlink(struct inode *, struct dentry *);
>  extern int simple_rmdir(struct inode *, struct dentry *);
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +  struct inode *new_dir, struct dentry *new_dentry);
>  extern int simple_rename_exchange(struct inode *old_dir, struct dentry 
> *old_dentry,
> struct inode *new_dir, struct dentry 
> *new_dentry);
>  extern int simple_rename(struct mnt_idmap *, struct inode *,
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Andreas Henriksson

Hello Shengjiu, Fabio,

On Thu, May 19, 2022 at 10:23:06AM -0300, Fabio Estevam wrote:
> Hi Shengjiu,
> 
> On Thu, May 19, 2022 at 9:49 AM Shengjiu Wang  wrote:
> 
> > diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
> > index fa950dde5310..dae16a14f177 100644
> > --- a/sound/soc/fsl/fsl_sai.c
> > +++ b/sound/soc/fsl/fsl_sai.c
> > @@ -437,6 +437,12 @@ static int fsl_sai_set_bclk(struct snd_soc_dai *dai, 
> > bool tx, u32 freq)
> >FSL_SAI_CR2_DIV_MASK | FSL_SAI_CR2_BYP,
> >savediv / 2 - 1);
> >
> > +   if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
> 
> Isn't it a bit fragile to take this decision based on the number of
> SAI registers in the SoC?
> 
> What about adding a specific field in soc_data for such a purpose?

We've been working on an i.MX8MP where MCLK needs to be input and found
that this enables the MCLK as output despite not having set the
`fsl,sai-mclk-direction-output;` devicetree property in our DT.
Reverting the patch fixes the issues for us.

I have to say that the code comment made me a bit confused, but once
I found the commit message I understood why this code existed.
If this is really i.MX8MM specific maybe mention that in the code
comment and please make the code actually only trigger on i.MX8MM.
It seems to me like these all fulfill the current criteria:
imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93

Should I report this in bugzilla.kernel.org ?

Regards,
Andreas Henriksson

Re: [next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)

2023-07-06 Thread Thomas Weißschuh

On 2023-07-06 11:41:38+0530, Sachin Sant wrote:
> While running LTP tests (madvise06) on IBM Power9 LPAR booted with
> 6.4.0-next-20230705 following crash is seen
> 
> Injecting memory failure for pfn 0x3f79 at process virtual address 
> 0x7fff9b74
> Memory failure: 0x3f79: recovery action for clean LRU page: Recovered
> madvise06 (133636): drop_caches: 3
> [ cut here ]
> kernel BUG at mm/memcontrol.c:3715!
> Oops: Exception in kernel mode, sig: 5 [#1]
> LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel 
> udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat 
> nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill 
> nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 
> sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth 
> fuse [last unloaded: init_module(O)]
> CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1
> Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 
> (VL950_131) hv:phyp pSeries
> NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0
> REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705)
> MSR: 80029033  CR: 28008288 XER: 
> CFAR: c054e904 IRQMASK: 0 
> GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 
> GPR04: c2b3b288 00010192  0001 
> GPR08: c000f9abb180 0002 c002bc978580  
> GPR12: c054e8d0 c0001ec53f00   
> GPR16:     
> GPR20: c0001b2e6578 00400cc0 7fff f000 
> GPR24: c0029dd7bd30  c0029dd7bd58 c0001b2e6568 
> GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 
> NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0
> LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160
> Call Trace:
> [c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160
> [c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60
> [c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620
> [c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0
> [c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350
> [c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140
> [c0029dd7be10] [c0036900] system_call_exception+0x140/0x350
> [c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4
> --- interrupt: c00 at 0x7fff9eb41484
> NIP: 7fff9eb41484 LR: 10008540 CTR: 
> REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705)
> MSR: 8280f033  CR: 28002282 XER: 
> 
> IRQMASK: 0 
> GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 
> GPR04: 7fffc33e0aa0 1fff  0013 
> GPR08: 7fffc33e0aa0    
> GPR12:  7fff9ecca3a0   
> GPR16:  10035520 10035b90 100347a8 
> GPR20: 1002fb68 10063900 2000 1002fb68 
> GPR24:  004c 1002fa78 7fffc33e0aa0 
> GPR28: 0013  1fff 1fff 
> NIP [7fff9eb41484] 0x7fff9eb41484
> LR [10008540] 0x10008540
> --- interrupt: c00
> Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 
> 7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 
> ---[ end trace  ]---
> pstore: backend (nvram) writing error (-1)
> 
> Kernel panic - not syncing: Fatal exception
> Rebooting in 10 seconds..
> 
> Git bisect points to following patch:
> 
> commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b
> memcg: drop kmem.limit_in_bytes
> 
> Does the testcase madvise06 need an update?
> 
> 90 tst_res(TINFO, "\tCached: %ld Kb",
> 91 SAFE_READ_MEMINFO("Cached:") - init_cached);
> 92   
> 93 print_cgmem("memory.current");
> 94 print_cgmem("memory.swap.current");
> 95 print_cgmem("memory.kmem.usage_in_bytes”);  <<== this line. 
> 96 }
> 
> If I comment line 95 from the testcase, it completes successfully.

The handling for _KMEM was removed from mem_cgroup_read_u64()
incorrectly.
It is used by the still existing kmem.*usage*_in_bytes in addition to
the now removed kmem.*limit*_in_bytes.
(And kmem.max_usage_in_bytes, kmem.failcnt)

The testcase seems to be fine, it actually did its job.

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread David Hildenbrand


On 06.07.23 14:32, Aneesh Kumar K V wrote:

On 7/6/23 4:44 PM, David Hildenbrand wrote:

On 06.07.23 11:36, Aneesh Kumar K V wrote:

On 7/6/23 2:48 PM, David Hildenbrand wrote:

On 06.07.23 10:50, Aneesh Kumar K.V wrote:

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.


Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?

IOW, why can't ppc64 simply allocate the vmemmap from the start of the memblock 
(-> base_pfn) and use the stored number of vmemmap pages to calculate the 
end_pfn?

To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
apgeblocks, memory onlining/offlining would be broken.

[...]



With ppc64 and 64K pagesize and different memory block sizes, we can end up 
allocating vmemmap backing memory from outside altmap because
a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and 
that can point to pages outside the dev_pagemap range.
So on free we  check


So you end up with a mixture of altmap and ordinarily-allocated vmemmap pages? 
That sound wrong (and is counter-intuitive to the feature in general, where we 
*don't* want to allocate the vmemmap from outside the altmap).

(64 * 1024) / sizeof(struct page) -> 1024 pages

1024 pages * 64k = 64 MiB.

What's the memory block size on these systems? If it's >= 64 MiB the vmemmap of 
a single memory block fits into a single page and we should be fine.

Smells like you want to disable the feature on a 64k system.



But that part of vmemmap_free is common for both dax,dax kmem and the new 
memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
a full altmap structure with all the details in. So for memmap on memmory to 
work on ppc64 we do require similar altmap struct. Hence the idea
of adding vmemmap_altmap to  struct memory_block


I'd suggest making sure that for the memmap_on_memory case your really 
*always* allocate from the altmap (that's what the feature is about 
after all), and otherwise block the feature (i.e., arch_mhp_supports_... 
should reject it).


Then, you can reconstruct the altmap layout trivially

base_pfn: start of the range to unplug
end_pfn: base_pfn + nr_vmemmap_pages

and pass that to the removal code, which will do the right thing, no?


Sure, remembering the altmap might be a potential cleanup (eventually?), 
but the basic reasoning why this is required as patch #1 IMHO is wrong: 
if you say you support memmap_on_memory for a configuration, then you 
should also properly support it (allocate from the hotplugged memory), 
not silently fall back to something else.



--
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V

On 7/6/23 4:44 PM, David Hildenbrand wrote:
> On 06.07.23 11:36, Aneesh Kumar K V wrote:
>> On 7/6/23 2:48 PM, David Hildenbrand wrote:
>>> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
 With memmap on memory, some architecture needs more details w.r.t altmap
 such as base_pfn, end_pfn, etc to unmap vmemmap memory.
>>>
>>> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
>>>
>>> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
>>> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
>>> calculate the end_pfn?
>>>
>>> To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
>>> apgeblocks, memory onlining/offlining would be broken.
>>>
>>> [...]
>>
>>
>> With ppc64 and 64K pagesize and different memory block sizes, we can end up 
>> allocating vmemmap backing memory from outside altmap because
>> a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). 
>> and that can point to pages outside the dev_pagemap range.
>> So on free we  check
> 
> So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
> pages? That sound wrong (and is counter-intuitive to the feature in general, 
> where we *don't* want to allocate the vmemmap from outside the altmap).
> 
> (64 * 1024) / sizeof(struct page) -> 1024 pages
> 
> 1024 pages * 64k = 64 MiB.
> 
> What's the memory block size on these systems? If it's >= 64 MiB the vmemmap 
> of a single memory block fits into a single page and we should be fine.
> 
> Smells like you want to disable the feature on a 64k system.
>

But that part of vmemmap_free is common for both dax,dax kmem and the new 
memmap on memory feature. ie, ppc64 vmemmap_free have checks which require
a full altmap structure with all the details in. So for memmap on memmory to 
work on ppc64 we do require similar altmap struct. Hence the idea
of adding vmemmap_altmap to  struct memory_block

 
>>
>> vmemmap_free() {
>> ...
>> if (altmap) {
>>     alt_start = altmap->base_pfn;
>>     alt_end = altmap->base_pfn + altmap->reserve +
>>   altmap->free + altmap->alloc + altmap->align;
>> }
>>
>> ...
>>     if (base_pfn >= alt_start && base_pfn < alt_end) {
>>     vmem_altmap_free(altmap, nr_pages);
>>
>> to see whether we did use altmap for the vmemmap allocation.
>>


-aneesh

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-06 Thread Geert Uytterhoeven

On Thu, Jul 6, 2023 at 9:50 AM Geert Uytterhoeven  wrote:
> On Thu, Jul 6, 2023 at 8:14 AM Benjamin Gray  wrote:
> > On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
> > > Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
> > > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
> > > > struct regmap_config, which is guarded behind CONFIG_REGMAP.
> > > >
> > > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
> > > > REGMAP")
> > > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS
> > > > is
> > > > set, causing the ppc64be allnoconfig build to fail.
> > > >
> > > > Signed-off-by: Benjamin Gray 
> > > > ---
> > > >   drivers/rtc/Kconfig | 1 +
> > > >   1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
> > > > index ffca9a8bb878..7455ebd189fe 100644
> > > > --- a/drivers/rtc/Kconfig
> > > > +++ b/drivers/rtc/Kconfig
> > > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
> > > >
> > > >   config RTC_DRV_DS1307
> > > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
> > > > EPSON RX-8025, ISL12057"
> > > > +   select REGMAP
> > >
> > > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
> > > selected.
> > > Can you explain more in details why you have to select it explicitely
> > > ?
> > > If there is something wrong with the logic, then the logic should be
> > > fixed instead of just adding a selection of REGMAP for that
> > > particular
> > > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
> > > RTC_DRV_ABEOZ9
> > > might have the exact same problem.
> >
> > Right, yeah, I don't want to assert this patch is the correct solution,
> > sending it was more to offer a fix and allow discussion if it should be
> > resolved some other way (so thanks for replying, I appreciate it).
> >
> > In terms of why I made this patch, the way I see it, if a config option
> > requires another config option be set, then "selects" is the natural
> > way of phrasing this dependency. "default" on the REGMAP side seems
> > weird. If it's a default, does that mean it can be overridden? But
> > RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
> > build breaks without it.
> >
> > But maybe KConfig works differently to my assumptions. Maybe the
> > referenced patch that causes the build failure is actually incorrect
> > (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
> > supposed to depend on REGMAP like KUnit does?
>
> Thanks for CCing me!
>
> Looks like I made a really silly mistake here: my patch not only allows
> the user to enable REGMAP manually (for the test), but also to disable
> it manually, regardless if there are users or not :-(
>
> I think the proper fix is to replace the "default y if ..." by
> "select REGMAP" for all users.

I have sent a patch to do so, followed by a few related fixes
https://lore.kernel.org/r/525c37a568b10623ffb2d108850afd7e37f9350e.1688643442.git.ge...@linux-m68k.org

Thanks!


Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH v2 15/92] spufs: convert to ctime accessor functions

2023-07-06 Thread Arnd Bergmann

On Wed, Jul 5, 2023, at 21:00, Jeff Layton wrote:
> In later patches, we're going to change how the inode's ctime field is
> used. Switch to using accessor functions instead of raw accesses of
> inode->i_ctime.
>
> Acked-by: Jeremy Kerr 
> Reviewed-by: Jan Kara 
> Signed-off-by: Jeff Layton 

Acked-by: Arnd Bergmann

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Fabio Estevam

On Thu, Jul 6, 2023 at 8:19 AM Shengjiu Wang  wrote:

> No, this is the code in probe().
> The code with the issue is in fsl_sai_set_bclk().

Yes, I put it in the wrong place.

> The clean way for fixing is to remove the code in fsl_sai_set_bclk()
> and add "fsl,sai-mclk-direction-output;" property in dts for some
> node.

Yes, what about this?

--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -507,7 +507,7 @@ static int fsl_sai_set_bclk(struct snd_soc_dai
*dai, bool tx, u32 freq)
   savediv / 2 - 1);
}

-   if (sai->soc_data->max_register >= FSL_SAI_MCTL) {
+   if (sai->soc_data->max_register >= FSL_SAI_MCTL  &&
sai->mclk_direction_output) {
/* SAI is in master mode at this point, so enable MCLK */
regmap_update_bits(sai->regmap, FSL_SAI_MCTL,
   FSL_SAI_MCTL_MCLK_EN, FSL_SAI_MCTL_MCLK_EN);

Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-07-06 Thread David Hildenbrand


On 06.07.23 12:04, Aneesh Kumar K V wrote:

On 7/6/23 2:54 PM, David Hildenbrand wrote:

On 06.07.23 10:50, Aneesh Kumar K.V wrote:

Instead of checking for memmap on memory feature enablement within the
functions checking for alignment, use the kernel parameter to control the
memory hotplug flags. The generic kernel now enables memmap on memory
feature if the hotplug flag request for the same.

The ACPI code now can pass the flag unconditionally because the kernel will
fallback to not using the feature if the alignment rules are not met.

Signed-off-by: Aneesh Kumar K.V 
---
   drivers/acpi/acpi_memhotplug.c |  3 +--
   include/linux/memory_hotplug.h | 14 ++
   mm/memory_hotplug.c    | 35 +++---
   3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..4d0096fc4cc2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
   if (!info->length)
   continue;
   -    if (mhp_supports_memmap_on_memory(info->length))
-    mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+    mhp_flags |= get_memmap_on_memory_flags();
   result = __add_memory(mgid, info->start_addr, info->length,
     mhp_flags);
   diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a769f44b8368..af7017122506 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
   bool __mhp_supports_memmap_on_memory(unsigned long size);
   #endif /* CONFIG_MEMORY_HOTPLUG */
   +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+extern bool memmap_on_memory;
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+    if (memmap_on_memory)
+    return MHP_MEMMAP_ON_MEMORY;
+    return 0;
+}
+#else
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+    return 0;
+}
+#endif


That's kind-of ugly TBH.


Why do we need this change?



I was trying to avoid rest of the kernel doing

if (mhp_supports_memmap_on_memory(info->length))
  mhp_flags |= MHP_MEMMAP_ON_MEMORY;


It would look much cleaner if you would simply have:

mhp_flags |= MHP_MEMMAP_ON_MEMORY;
result = __add_memory(mgid, info->start_addr, info->length, mhp_flags);

And modify the semantics of MHP_MEMMAP_ON_MEMORY to mean "allocate the 
memmap from hotplugged memory if supported and enabled globally".


Then, we can simply ignore the flag in __add_memory() if either the 
global toggle is off or if it's not supported for the range / by the arch.


Maybe, in the future we want more fine-grained control (as asked by 
dax/kmem) and maybe not have the global toggle. But for now, that should 
be good enough I guess.


--
Cheers,

David / dhildenb

Re: [RFC 0/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-07-06 Thread Dietmar Eggemann

On 04/07/2023 11:11, Tobias Huschle wrote:
> On 2023-05-16 18:35, Dietmar Eggemann wrote:
>> On 15/05/2023 13:46, Tobias Huschle wrote:
>>> The current load balancer implementation implies that scheduler groups,
>>> within the same scheduler domain, all host the same number of CPUs.
>>>
>>> This appears to be valid for non-s390 architectures. Nevertheless, s390
>>> can actually have scheduler groups of unequal size.
>>
>> Arm (classical) big.Little had this for years before we switched to flat
>> scheduling (only MC sched domain) over CPU capacity boundaries for Arm
>> DynamIQ.
>>
>> Arm64 Juno platform in mainline:
>>
>> root@juno:~# cat /sys/devices/system/cpu/cpu*/topology/cluster_cpus_list
>> 0,3-5
>> 1-2
>> 1-2
>> 0,3-5
>> 0,3-5
>> 0,3-5
>>
>> root@juno:~# cat /proc/schedstat | grep ^domain | awk '{print $1, $2}'
>>
>> domain0 39 <--
>> domain1 3f
>> domain0 06 <--
>> domain1 3f
>> domain0 06
>> domain1 3f
>> domain0 39
>> domain1 3f
>> domain0 39
>> domain1 3f
>> domain0 39
>> domain1 3f
>>
>> root@juno:~# cat /sys/kernel/debug/sched/domains/cpu0/domain*/name
>> MC
>> DIE
>>
>> But we don't have SMT on the mobile processors.
>>
>> It looks like you are only interested to get group_weight dependency
>> into this 'prefer_sibling' condition of find_busiest_group()?
>>
> Sorry, looks like your reply hit some bad filter of my mail program.
> Let me answer, although it's a bit late.
> 
> Yes, I would like to get the group_weight into the prefer_sibling path.
> Unfortunately, we cannot go for a flat hierarchy as the s390 hardware
> allows to have CPUs to be pretty far apart (cache-wise), which means,
> the load balancer should avoid to move tasks back and forth between
> those CPUs if possible.
> 
> We can't remove SD_PREFER_SIBLING either, as this would cause the load
> balancer to aim for having the same number of idle CPUs in all groups,
> which is a problem as well in asymmetric groups, for example:
> 
> With SD_PREFER_SIBLING, aiming for same number of non-idle CPUs
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>     x x x x  x  x  x  x
> 
> Without SD_PREFER_SIBLING, aiming for the same number of idle CPUs
> 00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
>     x  x  x x  x x x  x
> 
> 
> Hence the idea to add the group_weight to the prefer_sibling path.
> 
> I was wondering if this would be the right place to address this issue
> or if I should go down another route.

Yes, it's the right place to fix it for you. IMHO, there is still some
discussion needed about the correct condition and changes in
calculate_imbalance() for your case if I read the comments on this
thread correctly.

Arm64 big.Little wouldn't be affected since we explicitly remove
SD_PREFER_SIBLING on MC for our legacy MC,DIE setups to avoid spreading
tasks across DIE sched groups holding CPUs with different capacities.

[...]

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread David Hildenbrand


On 06.07.23 11:36, Aneesh Kumar K V wrote:

On 7/6/23 2:48 PM, David Hildenbrand wrote:

On 06.07.23 10:50, Aneesh Kumar K.V wrote:

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.


Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?

IOW, why can't ppc64 simply allocate the vmemmap from the start of the memblock 
(-> base_pfn) and use the stored number of vmemmap pages to calculate the 
end_pfn?

To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
apgeblocks, memory onlining/offlining would be broken.

[...]



With ppc64 and 64K pagesize and different memory block sizes, we can end up 
allocating vmemmap backing memory from outside altmap because
a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and 
that can point to pages outside the dev_pagemap range.
So on free we  check


So you end up with a mixture of altmap and ordinarily-allocated vmemmap 
pages? That sound wrong (and is counter-intuitive to the feature in 
general, where we *don't* want to allocate the vmemmap from outside the 
altmap).


(64 * 1024) / sizeof(struct page) -> 1024 pages

1024 pages * 64k = 64 MiB.

What's the memory block size on these systems? If it's >= 64 MiB the 
vmemmap of a single memory block fits into a single page and we should 
be fine.


Smells like you want to disable the feature on a 64k system.



vmemmap_free() {
...
if (altmap) {
alt_start = altmap->base_pfn;
alt_end = altmap->base_pfn + altmap->reserve +
  altmap->free + altmap->alloc + altmap->align;
}

...
if (base_pfn >= alt_start && base_pfn < alt_end) {
vmem_altmap_free(altmap, nr_pages);

to see whether we did use altmap for the vmemmap allocation.




   +/**
+ * struct vmem_altmap - pre-allocated storage for vmemmap_populate
+ * @base_pfn: base of the entire dev_pagemap mapping
+ * @reserve: pages mapped, but reserved for driver use (relative to @base)
+ * @free: free pages set aside in the mapping for memmap storage
+ * @align: pages reserved to meet allocation alignments
+ * @alloc: track pages consumed, private to vmemmap_populate()
+ */
+struct vmem_altmap {
+    unsigned long base_pfn;
+    const unsigned long end_pfn;
+    const unsigned long reserve;
+    unsigned long free;
+    unsigned long align;
+    unsigned long alloc;
+};


Instead of embedding that, what about conditionally allocating it and store a pointer to 
it in the "struct memory_block"?

In the general case as of today, we don't have an altmap.



Sure but with memmap on memory option it is essentially adding that right?.


At least on x86_64 and aarch64 only for 128 MiB DIMMs (and especially, 
not memory added by hv-balloon, virtio-mem, xen-balloon).


So in the general case it's not that frequently used. Maybe on ppc64 
once wired up.


Is the concern related to the increase in the size of

struct memory_block  ?


Partially. It looks cleaner to have !mem->altmap if there is no altmap.

--
Cheers,

David / dhildenb

Re: [PATCH] ASoC: fsl_sai: Enable MCTL_MCLK_EN bit for master mode

2023-07-06 Thread Fabio Estevam

Hi Andreas,

On Thu, Jul 6, 2023 at 5:47 AM Andreas Henriksson  wrote:

> We've been working on an i.MX8MP where MCLK needs to be input and found
> that this enables the MCLK as output despite not having set the
> `fsl,sai-mclk-direction-output;` devicetree property in our DT.
> Reverting the patch fixes the issues for us.
>
> I have to say that the code comment made me a bit confused, but once
> I found the commit message I understood why this code existed.
> If this is really i.MX8MM specific maybe mention that in the code
> comment and please make the code actually only trigger on i.MX8MM.
> It seems to me like these all fulfill the current criteria:
> imx7ulp, imx8mq, imx8mm, imx8mp, imx8ulp, imx93
>
> Should I report this in bugzilla.kernel.org ?

Should we do a fix like this?

--- a/sound/soc/fsl/fsl_sai.c
+++ b/sound/soc/fsl/fsl_sai.c
@@ -1453,7 +1453,7 @@ static int fsl_sai_probe(struct platform_device *pdev)

/* Select MCLK direction */
if (sai->mclk_direction_output &&
-   sai->soc_data->max_register >= FSL_SAI_MCTL) {
+   sai->soc_data->max_register >= FSL_SAI_MCTL &&
sai->mclk_direction_output) {
regmap_update_bits(sai->regmap, FSL_SAI_MCTL,
   FSL_SAI_MCTL_MCLK_EN, FSL_SAI_MCTL_MCLK_EN);
}

Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-07-06 Thread Aneesh Kumar K V

On 7/6/23 2:54 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> Instead of checking for memmap on memory feature enablement within the
>> functions checking for alignment, use the kernel parameter to control the
>> memory hotplug flags. The generic kernel now enables memmap on memory
>> feature if the hotplug flag request for the same.
>>
>> The ACPI code now can pass the flag unconditionally because the kernel will
>> fallback to not using the feature if the alignment rules are not met.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   drivers/acpi/acpi_memhotplug.c |  3 +--
>>   include/linux/memory_hotplug.h | 14 ++
>>   mm/memory_hotplug.c    | 35 +++---
>>   3 files changed, 26 insertions(+), 26 deletions(-)
>>
>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>> index 24f662d8bd39..4d0096fc4cc2 100644
>> --- a/drivers/acpi/acpi_memhotplug.c
>> +++ b/drivers/acpi/acpi_memhotplug.c
>> @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
>> acpi_memory_device *mem_device)
>>   if (!info->length)
>>   continue;
>>   -    if (mhp_supports_memmap_on_memory(info->length))
>> -    mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> +    mhp_flags |= get_memmap_on_memory_flags();
>>   result = __add_memory(mgid, info->start_addr, info->length,
>>     mhp_flags);
>>   diff --git a/include/linux/memory_hotplug.h 
>> b/include/linux/memory_hotplug.h
>> index a769f44b8368..af7017122506 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
>>   bool __mhp_supports_memmap_on_memory(unsigned long size);
>>   #endif /* CONFIG_MEMORY_HOTPLUG */
>>   +#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
>> +extern bool memmap_on_memory;
>> +static inline unsigned long get_memmap_on_memory_flags(void)
>> +{
>> +    if (memmap_on_memory)
>> +    return MHP_MEMMAP_ON_MEMORY;
>> +    return 0;
>> +}
>> +#else
>> +static inline unsigned long get_memmap_on_memory_flags(void)
>> +{
>> +    return 0;
>> +}
>> +#endif
> 
> That's kind-of ugly TBH.
> 
> 
> Why do we need this change?
> 

I was trying to avoid rest of the kernel doing 

if (mhp_supports_memmap_on_memory(info->length))
 mhp_flags |= MHP_MEMMAP_ON_MEMORY;

I was also following pattern similar to 

static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
{
return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
}


Overall goal of the patch is to handle the fallback of not using altmap/memmap 
in memory 
inside add_memory_resource and the callsites only express the desire to use 
memmap on memory
if possible/enabled. 

-aneesh

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K V

On 7/6/23 2:48 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> With memmap on memory, some architecture needs more details w.r.t altmap
>> such as base_pfn, end_pfn, etc to unmap vmemmap memory.
> 
> Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?
> 
> IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
> memblock (-> base_pfn) and use the stored number of vmemmap pages to 
> calculate the end_pfn?
> 
> To rephrase: if the vmemmap is not at the beginning and doesn't cover full 
> apgeblocks, memory onlining/offlining would be broken.
> 
> [...]


With ppc64 and 64K pagesize and different memory block sizes, we can end up 
allocating vmemmap backing memory from outside altmap because 
a single page vmemmap can cover 1024 pages (64 *1024/sizeof(struct page)). and 
that can point to pages outside the dev_pagemap range. 
So on free we  check 

vmemmap_free() {
...
if (altmap) {
alt_start = altmap->base_pfn;
alt_end = altmap->base_pfn + altmap->reserve +
  altmap->free + altmap->alloc + altmap->align;
}

...
if (base_pfn >= alt_start && base_pfn < alt_end) {
vmem_altmap_free(altmap, nr_pages);

to see whether we did use altmap for the vmemmap allocation. 

> 
>>   +/**
>> + * struct vmem_altmap - pre-allocated storage for vmemmap_populate
>> + * @base_pfn: base of the entire dev_pagemap mapping
>> + * @reserve: pages mapped, but reserved for driver use (relative to @base)
>> + * @free: free pages set aside in the mapping for memmap storage
>> + * @align: pages reserved to meet allocation alignments
>> + * @alloc: track pages consumed, private to vmemmap_populate()
>> + */
>> +struct vmem_altmap {
>> +    unsigned long base_pfn;
>> +    const unsigned long end_pfn;
>> +    const unsigned long reserve;
>> +    unsigned long free;
>> +    unsigned long align;
>> +    unsigned long alloc;
>> +};
> 
> Instead of embedding that, what about conditionally allocating it and store a 
> pointer to it in the "struct memory_block"?
> 
> In the general case as of today, we don't have an altmap.
> 

Sure but with memmap on memory option it is essentially adding that right?. Is 
the concern related to the increase in the size of
struct memory_block  ?

>> +
>>   struct memory_block {
>>   unsigned long start_section_nr;
>>   unsigned long state;    /* serialized by the dev->lock */
>> @@ -77,11 +94,7 @@ struct memory_block {
>>    */
>>   struct zone *zone;
>>   struct device dev;
>> -    /*
>> - * Number of vmemmap pages. These pages
>> - * lay at the beginning of the memory block.
>> - */
>> -    unsigned long nr_vmemmap_pages;
>> +    struct vmem_altmap altmap;
>>   struct memory_group *group;    /* group (if any) for this block */
>>   struct list_head group_next;    /* next block inside memory group */
>>   #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG)
>> @@ -147,7 +160,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t 
>> fn, int pri)
>>   extern int register_memory_notifier(struct notifier_block *nb);
>>   extern void unregister_memory_notifier(struct notifier_block *nb);
>>   int create_memory_block_devices(unsigned long start, unsigned long size,
> 
> [...]
> 
>>   static int check_cpu_on_node(int nid)
>> @@ -2036,9 +2042,8 @@ EXPORT_SYMBOL(try_offline_node);
>>     static int __ref try_remove_memory(u64 start, u64 size)
>>   {
>> -    struct vmem_altmap mhp_altmap = {};
>> +    int ret;
>>   struct vmem_altmap *altmap = NULL;
>> -    unsigned long nr_vmemmap_pages;
>>   int rc = 0, nid = NUMA_NO_NODE;
>>     BUG_ON(check_hotplug_memory_range(start, size));
>> @@ -2060,24 +2065,16 @@ static int __ref try_remove_memory(u64 start, u64 
>> size)
>>    * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
>>    * the same granularity it was added - a single memory block.
>>    */
>> +
> 
> ^ unrealted change?
>

Re: [PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-06 Thread Aneesh Kumar K V

On 7/6/23 2:37 PM, David Hildenbrand wrote:
> On 06.07.23 10:50, Aneesh Kumar K.V wrote:
>> Radix vmemmap mapping can map things correctly at the PMD level or PTE
>> level based on different device boundary checks. We also use altmap.reserve
>> feature to align things correctly at pageblock granularity. We can end up
>> loosing some pages in memory with this. For ex: with 256MB memory block
>> size, we require 4 pages to map vmemmap pages, In order to align things
>> correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
>> 28 pages get reserved.
>>
>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>   arch/powerpc/Kconfig  |  1 +
>>   arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 +++
>>   .../platforms/pseries/hotplug-memory.c    |  4 ++-
>>   3 files changed, 32 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 116d6add0bb0..f890907e5bbf 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -157,6 +157,7 @@ config PPC
>>   select ARCH_HAS_UBSAN_SANITIZE_ALL
>>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>>   select ARCH_KEEP_MEMBLOCK
>> +    select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE    if PPC_RADIX_MMU
>>   select ARCH_MIGHT_HAVE_PC_PARPORT
>>   select ARCH_MIGHT_HAVE_PC_SERIO
>>   select ARCH_OPTIONAL_KERNEL_RWX    if ARCH_HAS_STRICT_KERNEL_RWX
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
>> b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index a62729f70f2a..c0bd60b5fb64 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
>>     return 1;
>>   }
>> +
>> +/*
>> + * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
>> + * some of the restrictions. We don't check for PMD_SIZE because our
>> + * vmemmap allocation code can fallback correctly. The pageblock
> 
> x86 also has the fallback IIRC; the concern is rather that you end up with a 
> PTE-mapped vmemmap, which is slower at runtime than a PMD-mapped vmemmap.


The memory block size is dependent on a config option at the hypervisor for 
power and we can have values ranging from 16MB - 512MB
With these different memory block sizes I was thinking relaxing these checks 
makes the feature more useful. Also with page size
of 64K using a 2M vmemmap requires larger memory block size. 

> 
>> + * alignment requirement is met using altmap->reserve blocks.
>> + */
>> +bool mhp_supports_memmap_on_memory(unsigned long size)
>> +{
>> +    if (!radix_enabled())
>> +    return false;
>> +    /*
>> + * The pageblock alignment requirement is met by using
>> + * reserve blocks in altmap.
>> + */
>> +    return size == memory_block_size_bytes();
>> +}
> 
> I really dislike such arch overrides.
> 
> I think the flow should be something like that, having a generic:
> 
> bool mhp_supports_memmap_on_memory(unsigned long size)
> {
> ...
> return arch_mhp_supports_memmap_on_memory(size)) &&
>    size == memory_block_size_bytes() &&
>    ...
> }
> 

Sure we can do that. But for ppc64 we also want to skip the PMD_SIZE and 
pageblock
alignment restrictions. 

> where we'll also keep the pageblock check here.

For ppc64, I converted that pageblock rule as a reserve blocks allocation with 
altmap.
I am not sure whether we want to do that outside ppc64? 

> 
> And a ppc specific
> 
> bool arch_mhp_supports_memmap_on_memory(unsigned long size)
> {
> /*
>  * Don't check for the vmemmap covering PMD_SIZE, we accept that
>  * we might get a PTE-mapped vmemmap.
>  */
> return radix_enabled();
> }
> 
> We can then move the PMD_SIZE check into arch specific code (x86-aarch64).
> 

sure. will rework the changes accordingly. 

-aneesh

Re: [PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-07-06 Thread David Hildenbrand


On 06.07.23 10:50, Aneesh Kumar K.V wrote:

Instead of checking for memmap on memory feature enablement within the
functions checking for alignment, use the kernel parameter to control the
memory hotplug flags. The generic kernel now enables memmap on memory
feature if the hotplug flag request for the same.

The ACPI code now can pass the flag unconditionally because the kernel will
fallback to not using the feature if the alignment rules are not met.

Signed-off-by: Aneesh Kumar K.V 
---
  drivers/acpi/acpi_memhotplug.c |  3 +--
  include/linux/memory_hotplug.h | 14 ++
  mm/memory_hotplug.c| 35 +++---
  3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..4d0096fc4cc2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (!info->length)
continue;
  
-		if (mhp_supports_memmap_on_memory(info->length))

-   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+   mhp_flags |= get_memmap_on_memory_flags();
result = __add_memory(mgid, info->start_addr, info->length,
  mhp_flags);
  
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h

index a769f44b8368..af7017122506 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
  bool __mhp_supports_memmap_on_memory(unsigned long size);
  #endif /* CONFIG_MEMORY_HOTPLUG */
  
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY

+extern bool memmap_on_memory;
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   if (memmap_on_memory)
+   return MHP_MEMMAP_ON_MEMORY;
+   return 0;
+}
+#else
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   return 0;
+}
+#endif


That's kind-of ugly TBH.


Why do we need this change?

--
Cheers,

David / dhildenb

Re: [PATCH v2 2/5] mm/hotplug: Allow architecture override for memmap on memory feature

2023-07-06 Thread David Hildenbrand


On 06.07.23 10:50, Aneesh Kumar K.V wrote:

Some architectures like ppc64 wants to enable this feature only with radix
translation and their vemmap mappings have different alignment
requirements. Add overrider for mhp_supports_memmap_on_memory() and also
use altmap.reserve feature to adjust the pageblock alignment requirement.

The patch also fallback to allocation of memmap outside memblock if the
alignment rules are not met for memmap on memory allocation. This allows to
use the feature more widely allocating memmap as much as possible within
the memory block getting added.

A follow patch to enable memmap on memory for ppc64 will use this.


See my other reply. Rather have another arch check then overriding it 
completely.


--
Cheers,

David / dhildenb

Re: [PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread David Hildenbrand


On 06.07.23 10:50, Aneesh Kumar K.V wrote:

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.


Can you elaborate why ppc64 needs that and x86-64 + aarch64 don't?

IOW, why can't ppc64 simply allocate the vmemmap from the start of the 
memblock (-> base_pfn) and use the stored number of vmemmap pages to 
calculate the end_pfn?


To rephrase: if the vmemmap is not at the beginning and doesn't cover 
full apgeblocks, memory onlining/offlining would be broken.


[...]

  
+/**

+ * struct vmem_altmap - pre-allocated storage for vmemmap_populate
+ * @base_pfn: base of the entire dev_pagemap mapping
+ * @reserve: pages mapped, but reserved for driver use (relative to @base)
+ * @free: free pages set aside in the mapping for memmap storage
+ * @align: pages reserved to meet allocation alignments
+ * @alloc: track pages consumed, private to vmemmap_populate()
+ */
+struct vmem_altmap {
+   unsigned long base_pfn;
+   const unsigned long end_pfn;
+   const unsigned long reserve;
+   unsigned long free;
+   unsigned long align;
+   unsigned long alloc;
+};


Instead of embedding that, what about conditionally allocating it and 
store a pointer to it in the "struct memory_block"?


In the general case as of today, we don't have an altmap.


+
  struct memory_block {
unsigned long start_section_nr;
unsigned long state;/* serialized by the dev->lock */
@@ -77,11 +94,7 @@ struct memory_block {
 */
struct zone *zone;
struct device dev;
-   /*
-* Number of vmemmap pages. These pages
-* lay at the beginning of the memory block.
-*/
-   unsigned long nr_vmemmap_pages;
+   struct vmem_altmap altmap;
struct memory_group *group; /* group (if any) for this block */
struct list_head group_next;/* next block inside memory group */
  #if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_MEMORY_HOTPLUG)
@@ -147,7 +160,7 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, 
int pri)
  extern int register_memory_notifier(struct notifier_block *nb);
  extern void unregister_memory_notifier(struct notifier_block *nb);
  int create_memory_block_devices(unsigned long start, unsigned long size,


[...]


  static int check_cpu_on_node(int nid)
@@ -2036,9 +2042,8 @@ EXPORT_SYMBOL(try_offline_node);
  
  static int __ref try_remove_memory(u64 start, u64 size)

  {
-   struct vmem_altmap mhp_altmap = {};
+   int ret;
struct vmem_altmap *altmap = NULL;
-   unsigned long nr_vmemmap_pages;
int rc = 0, nid = NUMA_NO_NODE;
  
  	BUG_ON(check_hotplug_memory_range(start, size));

@@ -2060,24 +2065,16 @@ static int __ref try_remove_memory(u64 start, u64 size)
 * We only support removing memory added with MHP_MEMMAP_ON_MEMORY in
 * the same granularity it was added - a single memory block.
 */
+


^ unrealted change?

--
Cheers,

David / dhildenb

Re: [PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-06 Thread David Hildenbrand


On 06.07.23 10:50, Aneesh Kumar K.V wrote:

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. We also use altmap.reserve
feature to align things correctly at pageblock granularity. We can end up
loosing some pages in memory with this. For ex: with 256MB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/powerpc/Kconfig  |  1 +
  arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 +++
  .../platforms/pseries/hotplug-memory.c|  4 ++-
  3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index a62729f70f2a..c0bd60b5fb64 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
  
  	return 1;

  }
+
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock


x86 also has the fallback IIRC; the concern is rather that you end up 
with a PTE-mapped vmemmap, which is slower at runtime than a PMD-mapped 
vmemmap.



+ * alignment requirement is met using altmap->reserve blocks.
+ */
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   if (!radix_enabled())
+   return false;
+   /*
+* The pageblock alignment requirement is met by using
+* reserve blocks in altmap.
+*/
+   return size == memory_block_size_bytes();
+}


I really dislike such arch overrides.

I think the flow should be something like that, having a generic:

bool mhp_supports_memmap_on_memory(unsigned long size)
{
...
return arch_mhp_supports_memmap_on_memory(size)) &&
   size == memory_block_size_bytes() &&
   ...
}

where we'll also keep the pageblock check here.

And a ppc specific

bool arch_mhp_supports_memmap_on_memory(unsigned long size)
{
/*
 * Don't check for the vmemmap covering PMD_SIZE, we accept that
 * we might get a PTE-mapped vmemmap.
 */
return radix_enabled();
}

We can then move the PMD_SIZE check into arch specific code (x86-aarch64).

--
Cheers,

David / dhildenb

[PATCH v2 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-07-06 Thread Aneesh Kumar K.V

With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.

Embed vmem_altmap data structure to memory_bock and use that instead of
nr_vmemmap_pages.

On memory unplug, if the kernel finds any memory block in the range to
be using vmem_altmap, the kernel fails to unplug the memory if the
request is not a single memory block unplug.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/base/memory.c| 28 +++-
 include/linux/memory.h   | 25 +++--
 include/linux/memremap.h | 18 +-
 mm/memory_hotplug.c  | 31 ++-
 4 files changed, 53 insertions(+), 49 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b456ac213610..523cc1d37c81 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
 {
struct memory_block *mem = to_memory_block(dev);
 
+   WARN(mem->altmap.alloc, "Altmap not fully unmapped");
kfree(mem);
 }
 
@@ -183,7 +184,7 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   unsigned long nr_vmemmap_pages = 0;
struct zone *zone;
int ret;
 
@@ -200,6 +201,9 @@ static int memory_block_online(struct memory_block *mem)
 * stage helps to keep accounting easier to follow - e.g vmemmaps
 * belong to the same zone as the memory they backed.
 */
+   if (mem->altmap.alloc)
+   nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve;
+
if (nr_vmemmap_pages) {
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
zone);
if (ret)
@@ -230,7 +234,7 @@ static int memory_block_offline(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   unsigned long nr_vmemmap_pages = 0;
int ret;
 
if (!mem->zone)
@@ -240,6 +244,9 @@ static int memory_block_offline(struct memory_block *mem)
 * Unaccount before offlining, such that unpopulated zone and kthreads
 * can properly be torn down in offline_pages().
 */
+   if (mem->altmap.alloc)
+   nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve;
+
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
  -nr_vmemmap_pages);
@@ -726,7 +733,7 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
 #endif
 
 static int add_memory_block(unsigned long block_id, unsigned long state,
-   unsigned long nr_vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
struct memory_block *mem;
@@ -744,7 +751,10 @@ static int add_memory_block(unsigned long block_id, 
unsigned long state,
mem->start_section_nr = block_id * sections_per_block;
mem->state = state;
mem->nid = NUMA_NO_NODE;
-   mem->nr_vmemmap_pages = nr_vmemmap_pages;
+   if (altmap)
+   memcpy(>altmap, altmap, sizeof(*altmap));
+   else
+   mem->altmap.alloc = 0;
INIT_LIST_HEAD(>group_next);
 
 #ifndef CONFIG_NUMA
@@ -783,14 +793,14 @@ static int __init add_boot_memory_block(unsigned long 
base_section_nr)
if (section_count == 0)
return 0;
return add_memory_block(memory_block_id(base_section_nr),
-   MEM_ONLINE, 0,  NULL);
+   MEM_ONLINE, NULL,  NULL);
 }
 
 static int add_hotplug_memory_block(unsigned long block_id,
-   unsigned long nr_vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
-   return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group);
+   return add_memory_block(block_id, MEM_OFFLINE, altmap, group);
 }
 
 static void remove_memory_block(struct memory_block *memory)
@@ -818,7 +828,7 @@ static void remove_memory_block(struct memory_block *memory)
  * Called under device_hotplug_lock.
  */
 int create_memory_block_devices(unsigned long start, unsigned long size,
-   unsigned long vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
@@ -832,7 +842,7 @@ int

[PATCH v3 13/13] powerpc/book3s64/radix: Add debug message to give more details of vmemmap allocation

2023-07-06 Thread Aneesh Kumar K.V

Add some extra vmemmap pr_debug message that will indicate the type of
vmemmap allocations.

For ex: with DAX vmemmap optimization we can find the below details:
[  187.166580] radix-mmu: PAGE_SIZE vmemmap mapping
[  187.166587] radix-mmu: PAGE_SIZE vmemmap mapping
[  187.166591] radix-mmu: Tail page reuse vmemmap mapping
[  187.166594] radix-mmu: Tail page reuse vmemmap mapping
[  187.166598] radix-mmu: Tail page reuse vmemmap mapping
[  187.166601] radix-mmu: Tail page reuse vmemmap mapping
[  187.166604] radix-mmu: Tail page reuse vmemmap mapping
[  187.166608] radix-mmu: Tail page reuse vmemmap mapping
[  187.166611] radix-mmu: Tail page reuse vmemmap mapping
[  187.166614] radix-mmu: Tail page reuse vmemmap mapping
[  187.166617] radix-mmu: Tail page reuse vmemmap mapping
[  187.166620] radix-mmu: Tail page reuse vmemmap mapping
[  187.166623] radix-mmu: Tail page reuse vmemmap mapping
[  187.166626] radix-mmu: Tail page reuse vmemmap mapping
[  187.166629] radix-mmu: Tail page reuse vmemmap mapping
[  187.166632] radix-mmu: Tail page reuse vmemmap mapping

And without vmemmap optimization
[  293.549931] radix-mmu: PMD_SIZE vmemmap mapping
[  293.549984] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550032] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550076] radix-mmu: PMD_SIZE vmemmap mapping
[  293.550117] radix-mmu: PMD_SIZE vmemmap mapping

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index dac56c883099..a62729f70f2a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1026,6 +1026,7 @@ static pte_t * __meminit 
radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long
p = vmemmap_alloc_block_buf(PAGE_SIZE, node, 
NULL);
if (!p)
return NULL;
+   pr_debug("PAGE_SIZE vmemmap mapping\n");
} else {
/*
 * When a PTE/PMD entry is freed from the init_mm
@@ -1038,6 +1039,7 @@ static pte_t * __meminit 
radix__vmemmap_pte_populate(pmd_t *pmdp, unsigned long
 */
get_page(reuse);
p = page_to_virt(reuse);
+   pr_debug("Tail page reuse vmemmap mapping\n");
}
 
VM_BUG_ON(!PAGE_ALIGNED(addr));
@@ -1147,6 +1149,7 @@ int __meminit radix__vmemmap_populate(unsigned long 
start, unsigned long end, in
p = vmemmap_alloc_block_buf(PMD_SIZE, node, altmap);
if (p) {
vmemmap_set_pmd(pmd, p, node, addr, next);
+   pr_debug("PMD_SIZE vmemmap mapping\n");
continue;
} else if (altmap) {
/*
-- 
2.41.0

[PATCH v3 12/13] powerpc/book3s64/radix: Remove mmu_vmemmap_psize

2023-07-06 Thread Aneesh Kumar K.V

This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 11 ---
 arch/powerpc/mm/init_64.c| 21 ++---
 2 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index c05783466562..dac56c883099 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,17 +601,6 @@ void __init radix__early_init_mmu(void)
 #else
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
-
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-   /* vmemmap mapping */
-   if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-   /*
-* map vmemmap using 2M if available
-*/
-   mmu_vmemmap_psize = MMU_PAGE_2M;
-   } else
-   mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
/*
 * initialize page table size
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, 
unsigned long start,
return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
-   struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+struct vmem_altmap *altmap)
 {
bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (radix_enabled())
-   return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
/* Align to the page size of the linear mapping. */
start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+  struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (radix_enabled())
+   return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+   return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.41.0

[PATCH v3 11/13] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-07-06 Thread Aneesh Kumar K.V

With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap
page can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence
with 64K page size, we don't use vmemmap deduplication for PMD-level
mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/mm/vmemmap_dedup.rst |   1 +
 Documentation/powerpc/index.rst|   1 +
 Documentation/powerpc/vmemmap_dedup.rst| 101 ++
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   9 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +
 6 files changed, 316 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst 
b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index d33b554ca7ba..a50834798454 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -36,6 +36,7 @@ powerpc
 ultravisor
 vas-api
 vcpudispatch_stats
+vmemmap_dedup
 
 features
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst 
b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index ..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Device DAX
+==
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level 
mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PUD| +---+   | | |
+ |   level   | | . | --+ | |
+ |  mapping  | +---+ | |
+ |   | | . | + |
+ |   | +---+   |
+ |   | | 15| --+
+ |   | +---+
+ |   |
+ |   |
+ |   |
+ +---+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   |

[PATCH v3 10/13] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-07-06 Thread Aneesh Kumar K.V

This is in preparation to update radix to implement vmemmap optimization
for devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to
   PAGE_SIZE
3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
   PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the
vmemmap area is invalid. If found to be invalid, that implies we can safely
free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86
because with 64K page size, we need to do the above check even at the
PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h |   4 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 318 +++--
 arch/powerpc/mm/init_64.c  |  26 +-
 4 files changed, 319 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 2ef92f36340f..f1461289643a 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -331,6 +331,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned 
long start,
 unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
  int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+  struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 6a88bfdaa69b..68817ea7f994 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -165,6 +165,10 @@ static inline bool is_ioremap_addr(const void *x)
 
return addr >= IOREMAP_BASE && addr < IOREMAP_END;
 }
+
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+  unsigned long page_size);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 227fea53c217..8a03e1005fd3 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -744,8 +744,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+   return !vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+   return !vmemmap_populated(start, PAGE_SIZE);
+
+}
+
+static void __meminit free_vmemmap_pages(struct page *page,
+struct vmem_altmap *altmap,
+int order)
+{
+   unsigned int nr_pages = 1 << order;
+
+   if (altmap) {
+   unsigned long alt_start, alt_end;
+   unsigned long base_pfn = page_to_pfn(page);
+
+   /*
+* with 1G vmemmap mmaping we can have things setup
+* such that even though atlmap is specified we never
+* used altmap.
+*/
+   alt_start = altmap->base_pfn;
+   alt_end = altmap->base_pfn + altmap->reserve +
+   altmap->free + altmap->alloc + altmap->align;
+
+   if (base_pfn >= alt_start && base_pfn < alt_end) {
+   vmem_altmap_free(altmap, nr_pages);
+   return;
+   }
+   }
+
+   if (PageReserved(page)) {
+   /* allocated from memblock */
+   while (nr_pages--)
+   free_reserved_page(page++);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end, bool direct)
+unsigned long end, bool direct,
+struct vmem_altmap *altmap)
 {
unsigned long next, pages = 0;
pte_t *pte;
@@ -759,24 +808,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
if (!pte_present(*pte))
continue;
 
-   if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-   /*
-* The vmemmap_free() and

[PATCH v3 09/13] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-07-06 Thread Aneesh Kumar K.V

This is enabled only with radix translation and 1G hugepage size. This will
be used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation
and hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 --
 arch/powerpc/include/asm/book3s/64/radix.h|  36 
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/trace/events/thp.h|  10 ++
 9 files changed, 315 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..9a05de007956 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+   return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)   pte_pfn(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_young(pud) pte_young(pud_pte(pud))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)   pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)   pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)   pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)   pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud) pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+  pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+   if (radix_enabled())
+   return radix__has_transparent_pud_hugepage();
+   return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+static inline unsigned long
+pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
+   unsigned long clr, unsigned long set)
+{
+   if (radix_enabled())
+   return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
+   BUG();
+   return pud_val(*pudp);
+}
+
 /*
  * returns true for pmd migration entries, THP, devmap, hugetlb
  * But compile time dependent on THP config
@@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd)
return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
+static inline int pud_large(pud_t pud)
+{
+   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
 /*
  * For radix we should always find H_PAGE_HASHPTE zero. Hence
  * the below will work for radix too
@@

[PATCH v3 08/13] powerpc/mm/trace: Convert trace event to trace event class

2023-07-06 Thread Aneesh Kumar K.V

A follow-up patch will add a pud variant for this same event.
Using event class makes that addition simpler.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/hash_pgtable.c  |  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c |  2 +-
 include/trace/events/thp.h   | 23 ---
 3 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 51f48984abca..988948d69bc1 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -214,7 +214,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long addr
 
old = be64_to_cpu(old_be);
 
-   trace_hugepage_update(addr, old, clr, set);
+   trace_hugepage_update_pmd(addr, old, clr, set);
if (old & H_PAGE_HASHPTE)
hpte_do_hugepage_flush(mm, addr, pmdp, old);
return old;
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index e7ea492ac510..02e185d2e4d6 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -962,7 +962,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long add
 #endif
 
old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
-   trace_hugepage_update(addr, old, clr, set);
+   trace_hugepage_update_pmd(addr, old, clr, set);
 
return old;
 }
diff --git a/include/trace/events/thp.h b/include/trace/events/thp.h
index 202b3e3e67ff..a95c78b10561 100644
--- a/include/trace/events/thp.h
+++ b/include/trace/events/thp.h
@@ -8,25 +8,29 @@
 #include 
 #include 
 
-TRACE_EVENT(hugepage_set_pmd,
+DECLARE_EVENT_CLASS(hugepage_set,
 
-   TP_PROTO(unsigned long addr, unsigned long pmd),
-   TP_ARGS(addr, pmd),
+   TP_PROTO(unsigned long addr, unsigned long pte),
+   TP_ARGS(addr, pte),
TP_STRUCT__entry(
__field(unsigned long, addr)
-   __field(unsigned long, pmd)
+   __field(unsigned long, pte)
),
 
TP_fast_assign(
__entry->addr = addr;
-   __entry->pmd = pmd;
+   __entry->pte = pte;
),
 
-   TP_printk("Set pmd with 0x%lx with 0x%lx", __entry->addr, 
__entry->pmd)
+   TP_printk("Set page table entry with 0x%lx with 0x%lx", 
__entry->addr, __entry->pte)
 );
 
+DEFINE_EVENT(hugepage_set, hugepage_set_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd),
+   TP_ARGS(addr, pmd)
+);
 
-TRACE_EVENT(hugepage_update,
+DECLARE_EVENT_CLASS(hugepage_update,
 
TP_PROTO(unsigned long addr, unsigned long pte, unsigned long clr, 
unsigned long set),
TP_ARGS(addr, pte, clr, set),
@@ -48,6 +52,11 @@ TRACE_EVENT(hugepage_update,
TP_printk("hugepage update at addr 0x%lx and pte = 0x%lx clr = 
0x%lx, set = 0x%lx", __entry->addr, __entry->pte, __entry->clr, __entry->set)
 );
 
+DEFINE_EVENT(hugepage_update, hugepage_update_pmd,
+   TP_PROTO(unsigned long addr, unsigned long pmd, unsigned long clr, 
unsigned long set),
+   TP_ARGS(addr, pmd, clr, set)
+);
+
 DECLARE_EVENT_CLASS(migration_pmd,
 
TP_PROTO(unsigned long addr, unsigned long pmd),
-- 
2.41.0

[PATCH v3 07/13] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-07-06 Thread Aneesh Kumar K.V

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to
read-only) and the output address (pfn) of the vmemmap ptes. That is not
supported without unmapping of pte(marking it invalid) by some
architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb
vmemmap optimization disabled. Hence split DAX optimization support into a
different config.

s390, loongarch and riscv don't have devdax support. So the DAX config is not
enabled for them. With this change, arm64 should be able to select DAX
optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/s390/Kconfig  | 2 +-
 arch/x86/Kconfig   | 3 ++-
 fs/Kconfig | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 5 -
 7 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index e55511af4c77..537ca2a4005a 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -59,7 +59,7 @@ config LOONGARCH
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b49793cf34eb..fd9b683c4263 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -53,7 +53,7 @@ config RISCV
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 5b39918b7042..975fd06e4f4d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -127,7 +127,7 @@ config S390
select ARCH_WANTS_NO_INSTR
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_IPC_PARSE_VERSION
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select BUILDTIME_TABLE_SORT
select CLONE_BACKWARDS2
select DMA_OPS if PCI
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7422db409770..78224aa76409 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -128,7 +128,8 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP   if X86_64
select ARCH_WANTS_THP_SWAP  if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 18d034ec7953..9c104c130a6e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -252,7 +252,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
-   depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+   depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1a2234ee14d2..83f51ec0897d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3640,7 +3640,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR 2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..923bd35f81f2 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -487,7 +487,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+   bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.41.0

[PATCH v3 06/13] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

2023-07-06 Thread Aneesh Kumar K.V

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 91def34f7784..b5af3e014606 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -558,6 +558,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pud_t *pudp)
 {
@@ -571,6 +572,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 11e06e4ab33b..056478c106ee 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -349,7 +349,7 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && 
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
 {
-- 
2.41.0

[PATCH v3 05/13] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME

2023-07-06 Thread Aneesh Kumar K.V

This helps architectures to override pmd_same and pud_same independently.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 6fd9b2831338..91def34f7784 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -693,7 +693,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef __HAVE_ARCH_PUD_SAME
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
-- 
2.41.0

[PATCH v3 04/13] mm/vmemmap: Allow architectures to override how vmemmap optimization works

2023-07-06 Thread Aneesh Kumar K.V

Architectures like powerpc will like to use different page table allocators
and mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement
vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a044a130405b..541b3f69a481 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
 }
 
+#ifndef vmemmap_populate_compound_pages
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int 
node,
   struct vmem_altmap *altmap,
   struct page *reuse)
@@ -446,6 +447,8 @@ static int __meminit 
vmemmap_populate_compound_pages(unsigned long start_pfn,
return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
-- 
2.41.0

[PATCH v3 03/13] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override

2023-07-06 Thread Aneesh Kumar K.V

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within
vmemmap such that tail page mapping can point to the second PAGE_SIZE area.
Enforce that in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 27 +++
 mm/mm_init.c   |  2 +-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2dd73e4f3d8e..1a2234ee14d2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3639,13 +3639,32 @@ void vmemmap_free(unsigned long start, unsigned long 
end,
struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR 2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-  struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
 {
-   return is_power_of_2(sizeof(struct page)) &&
-   pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+   unsigned long nr_pages;
+   unsigned long nr_vmemmap_pages;
+
+   if (!pgmap || !is_power_of_2(sizeof(struct page)))
+   return false;
+
+   nr_pages = pgmap_vmemmap_nr(pgmap);
+   nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> PAGE_SHIFT);
+   /*
+* For vmemmap optimization with DAX we need minimum 2 vmemmap
+* pages. See layout diagram in Documentation/mm/vmemmap_dedup.rst
+*/
+   return !altmap && (nr_vmemmap_pages > VMEMMAP_RESERVE_NR);
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index a1963c3322af..245ac69b66a5 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct 
vmem_altmap *altmap,
if (!vmemmap_can_optimize(altmap, pgmap))
return pgmap_vmemmap_nr(pgmap);
 
-   return 2 * (PAGE_SIZE / sizeof(struct page));
+   return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.41.0

[PATCH v3 01/13] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support

2023-07-06 Thread Aneesh Kumar K.V

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned 
long *alignments)
 
if (has_transparent_hugepage()) {
alignments[1] = HPAGE_PMD_SIZE;
-   if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   if (has_transparent_pud_hugepage())
alignments[2] = HPAGE_PUD_SIZE;
}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..cf13f8d938a8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1499,6 +1499,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() 
IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.41.0

[PATCH v3 02/13] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg

2023-07-06 Thread Aneesh Kumar K.V

We will use this in a later patch to do tlb flush when clearing pud entries
on powerpc. This is similar to commit 93a98695f2f9 ("mm: change
pmdp_huge_get_and_clear_full take vm_area_struct as arg")

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index cf13f8d938a8..6fd9b2831338 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -450,11 +450,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index ee119e33fef1..ee2c4c1dcfc8 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -385,7 +385,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+   pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..ba20cef681a4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1981,7 +1981,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (!ptl)
return 0;
 
-   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+   pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
-- 
2.41.0

[PATCH v3 00/13] Add support for DAX vmemmap optimization for ppc64

2023-07-06 Thread Aneesh Kumar K.V

This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
devdax vmemmap optimization"

Changes from v1:
* Rebase to latest linus tree
* Address review feedback

Changes from V1:
* Fix make htmldocs warning
* Fix vmemmap allocation bugs with different alignment values.
* Correctly check for section validity to before we free vmemmap area


Aneesh Kumar K.V (13):
  mm/hugepage pud: Allow arch-specific helper function to check huge
page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
works
  mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/mm/trace: Convert trace event to trace event class
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize
  powerpc/book3s64/radix: Add debug message to give more details of
vmemmap allocation

 Documentation/mm/vmemmap_dedup.rst|   1 +
 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/vmemmap_dedup.rst   | 101 
 arch/loongarch/Kconfig|   2 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 -
 arch/powerpc/include/asm/book3s/64/radix.h|  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h|   4 +
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   2 +-
 arch/powerpc/mm/book3s64/pgtable.c|  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 565 --
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/mm/init_64.c |  37 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 arch/riscv/Kconfig|   2 +-
 arch/s390/Kconfig |   2 +-
 arch/x86/Kconfig  |   3 +-
 drivers/nvdimm/pfn_devs.c |   2 +-
 fs/Kconfig|   2 +-
 include/linux/mm.h|  29 +-
 include/linux/pgtable.h   |  11 +-
 include/trace/events/thp.h|  33 +-
 mm/Kconfig|   5 +-
 mm/debug_vm_pgtable.c |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/mm_init.c  |   2 +-
 mm/mremap.c   |   2 +-
 mm/sparse-vmemmap.c   |   3 +
 30 files changed, 1031 insertions(+), 82 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.41.0

Re: [PATCH v2 4/5] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig

2023-07-06 Thread David Hildenbrand


On 06.07.23 10:50, Aneesh Kumar K.V wrote:

Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
  arch/arm64/Kconfig | 4 +---
  arch/x86/Kconfig   | 4 +---
  mm/Kconfig | 3 +++
  3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7856c3a3e35a..7e5985c018f8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_GNU_PROPERTY
select ARCH_USE_MEMTEST
@@ -346,9 +347,6 @@ config GENERIC_CSUM
  config GENERIC_CALIBRATE_DELAY
def_bool y
  
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

-   def_bool y
-
  config SMP
def_bool y
  
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig

index 78224aa76409..d0258e92a8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -102,6 +102,7 @@ config X86
select ARCH_HAS_DEBUG_WX
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
@@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES
def_bool y
depends on ARCH_ENABLE_MEMORY_HOTPLUG
  
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

-   def_bool y
-
  menu "Power management and ACPI options"
  
  config ARCH_HIBERNATION_HEADER

diff --git a/mm/Kconfig b/mm/Kconfig
index 923bd35f81f2..2f9d28fee75d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -570,6 +570,9 @@ config MHP_MEMMAP_ON_MEMORY
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
  
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE

+   bool
+
  endif # MEMORY_HOTPLUG
  
  # Heavily threaded applications may benefit from splitting the mm-wide


Acked-by: David Hildenbrand 

--
Cheers,

David / dhildenb

[PATCH v2 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-07-06 Thread Aneesh Kumar K.V

Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. We also use altmap.reserve
feature to align things correctly at pageblock granularity. We can end up
loosing some pages in memory with this. For ex: with 256MB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 +++
 .../platforms/pseries/hotplug-memory.c|  4 ++-
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 116d6add0bb0..f890907e5bbf 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -157,6 +157,7 @@ config PPC
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index a62729f70f2a..c0bd60b5fb64 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1678,3 +1678,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
+
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   if (!radix_enabled())
+   return false;
+   /*
+* The pageblock alignment requirement is met by using
+* reserve blocks in altmap.
+*/
+   return size == memory_block_size_bytes();
+}
+
+unsigned long memory_block_align_base(struct resource *res)
+{
+   unsigned long base_pfn = PHYS_PFN(res->start);
+   unsigned long align, size = resource_size(res);
+   unsigned long nr_vmemmap_pages = size >> PAGE_SHIFT;
+   unsigned long vmemmap_size = (nr_vmemmap_pages * sizeof(struct page)) 
>> PAGE_SHIFT;
+
+   align = pageblock_align(base_pfn + vmemmap_size) - (base_pfn + 
vmemmap_size);
+   return align;
+}
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9c62c2c3b3d0..326db26d773e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
 
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
+   mhp_t mhp_flags = MHP_NONE;
unsigned long block_sz;
int nid, rc;
 
@@ -637,7 +638,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = first_online_node;
 
/* Add the memory */
-   rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+   mhp_flags |= get_memmap_on_memory_flags();
+   rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
-- 
2.41.0

[PATCH v2 4/5] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig

2023-07-06 Thread Aneesh Kumar K.V

Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/Kconfig | 4 +---
 arch/x86/Kconfig   | 4 +---
 mm/Kconfig | 3 +++
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 7856c3a3e35a..7e5985c018f8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_GNU_PROPERTY
select ARCH_USE_MEMTEST
@@ -346,9 +347,6 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
def_bool y
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-   def_bool y
-
 config SMP
def_bool y
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 78224aa76409..d0258e92a8af 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -102,6 +102,7 @@ config X86
select ARCH_HAS_DEBUG_WX
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
@@ -2610,9 +2611,6 @@ config ARCH_HAS_ADD_PAGES
def_bool y
depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-   def_bool y
-
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/mm/Kconfig b/mm/Kconfig
index 923bd35f81f2..2f9d28fee75d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -570,6 +570,9 @@ config MHP_MEMMAP_ON_MEMORY
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   bool
+
 endif # MEMORY_HOTPLUG
 
 # Heavily threaded applications may benefit from splitting the mm-wide
-- 
2.41.0

[PATCH v2 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-07-06 Thread Aneesh Kumar K.V

Instead of checking for memmap on memory feature enablement within the
functions checking for alignment, use the kernel parameter to control the
memory hotplug flags. The generic kernel now enables memmap on memory
feature if the hotplug flag request for the same.

The ACPI code now can pass the flag unconditionally because the kernel will
fallback to not using the feature if the alignment rules are not met.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/acpi/acpi_memhotplug.c |  3 +--
 include/linux/memory_hotplug.h | 14 ++
 mm/memory_hotplug.c| 35 +++---
 3 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..4d0096fc4cc2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (!info->length)
continue;
 
-   if (mhp_supports_memmap_on_memory(info->length))
-   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+   mhp_flags |= get_memmap_on_memory_flags();
result = __add_memory(mgid, info->start_addr, info->length,
  mhp_flags);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index a769f44b8368..af7017122506 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,4 +358,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
 bool __mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+extern bool memmap_on_memory;
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   if (memmap_on_memory)
+   return MHP_MEMMAP_ON_MEMORY;
+   return 0;
+}
+#else
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   return 0;
+}
+#endif
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 423f96dd5481..a1542b8f64e6 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -45,19 +45,9 @@
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
+bool memmap_on_memory __ro_after_init;
 module_param(memmap_on_memory, bool, 0444);
 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
hotplug");
-
-static inline bool mhp_memmap_on_memory(void)
-{
-   return memmap_on_memory;
-}
-#else
-static inline bool mhp_memmap_on_memory(void)
-{
-   return false;
-}
 #endif
 
 enum {
@@ -1280,10 +1270,9 @@ bool __mhp_supports_memmap_on_memory(unsigned long size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   return mhp_memmap_on_memory() &&
-  size == memory_block_size_bytes() &&
-  IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-  IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+   return size == memory_block_size_bytes() &&
+   IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
 bool __weak mhp_supports_memmap_on_memory(unsigned long size)
@@ -2082,15 +2071,13 @@ static int __ref try_remove_memory(u64 start, u64 size)
 * the same granularity it was added - a single memory block.
 */
 
-   if (mhp_memmap_on_memory()) {
-   ret = walk_memory_blocks(start, size, , 
get_vmemmap_altmap_cb);
-   if (ret) {
-   if (size != memory_block_size_bytes()) {
-   pr_warn("Refuse to remove %#llx - %#llx,"
-   "wrong granularity\n",
-   start, start + size);
-   return -EINVAL;
-   }
+   ret = walk_memory_blocks(start, size, , get_vmemmap_altmap_cb);
+   if (ret) {
+   if (size != memory_block_size_bytes()) {
+   pr_warn("Refuse to remove %#llx - %#llx,"
+   "wrong granularity\n",
+   start, start + size);
+   return -EINVAL;
}
}
 
-- 
2.41.0

[PATCH v2 2/5] mm/hotplug: Allow architecture override for memmap on memory feature

2023-07-06 Thread Aneesh Kumar K.V

Some architectures like ppc64 wants to enable this feature only with radix
translation and their vemmap mappings have different alignment
requirements. Add overrider for mhp_supports_memmap_on_memory() and also
use altmap.reserve feature to adjust the pageblock alignment requirement.

The patch also fallback to allocation of memmap outside memblock if the
alignment rules are not met for memmap on memory allocation. This allows to
use the feature more widely allocating memmap as much as possible within
the memory block getting added.

A follow patch to enable memmap on memory for ppc64 will use this.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/mm/mmu.c|  5 +
 arch/x86/mm/init_64.c  |  6 ++
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c| 36 --
 4 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..0ff2ec634a58 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1340,6 +1340,11 @@ void arch_remove_memory(u64 start, u64 size, struct 
vmem_altmap *altmap)
__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return __mhp_supports_memmap_on_memory(size);
+}
+
 /*
  * This memory hotplug notifier helps prevent boot memory from being
  * inadvertently removed as it blocks pfn range offlining process in
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..b318d26a70d4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1264,6 +1264,12 @@ void __ref arch_remove_memory(u64 start, u64 size, 
struct vmem_altmap *altmap)
__remove_pages(start_pfn, nr_pages, altmap);
kernel_physical_mapping_remove(start, start + size);
 }
+
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return __mhp_supports_memmap_on_memory(size);
+}
+
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 013c69753c91..a769f44b8368 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -354,7 +354,8 @@ extern struct zone *zone_for_pfn_range(int online_type, int 
nid,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
  struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
+bool mhp_supports_memmap_on_memory(unsigned long size);
+bool __mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c4bac38cc147..423f96dd5481 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1247,7 +1247,8 @@ static int online_memory_block(struct memory_block *mem, 
void *arg)
return device_online(>dev);
 }
 
-bool mhp_supports_memmap_on_memory(unsigned long size)
+/* Helper function for architecture to use. */
+bool __mhp_supports_memmap_on_memory(unsigned long size)
 {
unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
@@ -1285,6 +1286,20 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
+bool __weak mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return false;
+}
+
+/*
+ * Architectures may want to override the altmap reserve details based
+ * on the alignment requirement for vmemmap mapping.
+ */
+unsigned __weak long memory_block_align_base(struct resource *res)
+{
+   return 0;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1295,7 +1310,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 {
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-   struct vmem_altmap mhp_altmap = {};
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn =  PHYS_PFN(res->start),
+   .end_pfn  =  PHYS_PFN(res->end),
+   .reserve  = memory_block_align_base(res),
+   };
struct memory_group *group = NULL;
u64 start, size;
bool new_node = false;
@@ -1339,13 +1358,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 * Self hosted memmap array
 */
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-   if (!mhp_supports_memmap_on_memory(size)) {
-   ret = -EINVAL;
-   goto error;
+   if (mhp_supports_memmap_on_memory(size)) {
+

[PATCH v2 0/5] Add support for memmap on memory feature on ppc64

2023-07-06 Thread Aneesh Kumar K.V

This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.

This patch series is dependent on dax vmemmap optimization series
posted here 
https://lore.kernel.org/linux-mm/20230616110826.344417-1-aneesh.ku...@linux.ibm.com

Changes from v1:
* update the memblock to store vmemmap_altmap details. This is required
so that when we remove the memory we can find the altmap details which
is needed on some architectures.
* rebase to latest linus tree



Aneesh Kumar K.V (5):
  mm/hotplug: Embed vmem_altmap details in memory block
  mm/hotplug: Allow architecture override for memmap on memory feature
  mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
  mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  powerpc/book3s64/memhotplug: Enable memmap on memory for radix

 arch/arm64/Kconfig|  4 +-
 arch/arm64/mm/mmu.c   |  5 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 ++
 .../platforms/pseries/hotplug-memory.c|  4 +-
 arch/x86/Kconfig  |  4 +-
 arch/x86/mm/init_64.c |  6 ++
 drivers/acpi/acpi_memhotplug.c|  3 +-
 drivers/base/memory.c | 28 --
 include/linux/memory.h| 25 +++--
 include/linux/memory_hotplug.h| 17 +++-
 include/linux/memremap.h  | 18 +---
 mm/Kconfig|  3 +
 mm/memory_hotplug.c   | 94 +--
 14 files changed, 151 insertions(+), 89 deletions(-)

-- 
2.41.0

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2023-07-06 Thread Zhouyi Zhou

On Thu, Jul 6, 2023 at 3:09 PM Christophe Leroy
 wrote:
>
>
>
> Le 21/11/2022 à 04:51, Zhouyi Zhou a écrit :
> > During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> > offline tick_do_timer_cpu, the operation will fail because in
> > function tick_nohz_cpu_down:
> > ```
> > if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
> >return -EBUSY;
> > ```
> > Above bug was first discovered in torture tests performed in PPC VM
> > of Open Source Lab of Oregon State University, and reproducable in RISC-V
> > and X86-64 (with additional kernel commandline cpu0_hotplug).
> >
> > In this patch, we avoid offline tick_do_timer_cpu by distribute
> > the offlining cpu among remaining cpus.
> >
> > Signed-off-by: Zhouyi Zhou 
> > ---
> >   include/linux/tick.h|  1 +
> >   kernel/time/tick-common.c   |  1 +
> >   kernel/time/tick-internal.h |  1 -
> >   kernel/torture.c| 10 ++
> >   4 files changed, 12 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index bfd571f18cfd..23cc0b205853 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -14,6 +14,7 @@
> >   #include 
> >
> >   #ifdef CONFIG_GENERIC_CLOCKEVENTS
> > +extern int tick_do_timer_cpu __read_mostly;
> >   extern void __init tick_init(void);
> >   /* Should be core only, but ARM BL switcher requires it */
> >   extern void tick_suspend_local(void);
> > diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> > index 46789356f856..87b9b9afa320 100644
> > --- a/kernel/time/tick-common.c
> > +++ b/kernel/time/tick-common.c
> > @@ -48,6 +48,7 @@ ktime_t tick_next_period;
> >*procedure also covers cpu hotplug.
> >*/
> >   int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> > +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);
> >   #ifdef CONFIG_NO_HZ_FULL
> >   /*
> >* tick_do_timer_boot_cpu indicates the boot CPU temporarily owns
> > diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> > index 649f2b48e8f0..8953dca10fdd 100644
> > --- a/kernel/time/tick-internal.h
> > +++ b/kernel/time/tick-internal.h
> > @@ -15,7 +15,6 @@
> >
> >   DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
> >   extern ktime_t tick_next_period;
> > -extern int tick_do_timer_cpu __read_mostly;
> >
> >   extern void tick_setup_periodic(struct clock_event_device *dev, int 
> > broadcast);
> >   extern void tick_handle_periodic(struct clock_event_device *dev);
> > diff --git a/kernel/torture.c b/kernel/torture.c
> > index 789aeb0e1159..bccbdd33dda2 100644
> > --- a/kernel/torture.c
> > +++ b/kernel/torture.c
> > @@ -33,6 +33,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -358,7 +359,16 @@ torture_onoff(void *arg)
> >   schedule_timeout_interruptible(HZ / 10);
> >   continue;
> >   }
> > +#ifdef CONFIG_NO_HZ_FULL
> > + /* do not offline tick do timer cpu */
> > + if (tick_nohz_full_running) {
>
> Can you use fonction tick_nohz_full_enabled() instead and avoid the #ifdef ?
Thank Christophe for your wonderful advice, I will follow your advice
next time I prepare a patch.
For this false positive report, 58d766824264 ("tick/nohz: Fix
cpu_is_hotpluggable() by checking with nohz subsystem") has beaten me
in mainline.

Thanks again
Zhouyi
>
> > + cpu = (torture_random() >> 4) % maxcpu;
> > + if (cpu >= tick_do_timer_cpu)
> > + cpu = (cpu + 1) % (maxcpu + 1);
> > + } else
> > +#else
> >   cpu = (torture_random() >> 4) % (maxcpu + 1);
> > +#endif
> >   if (!torture_offline(cpu,
> >_offline_attempts, 
> > _offline_successes,
> >_offline, _offline, 
> > _offline))

Re: [PATCH] arch/powerpc: Remove unnecessary endian conversion code in XICS

2023-07-06 Thread Jordan Niethe





On 30/6/23 3:56 pm, Gautam Menghani wrote:

Remove an unnecessary piece of code that does an endianness conversion but
does not use the result. The following warning was reported by Clang's
static analyzer:

arch/powerpc/sysdev/xics/ics-opal.c:114:2: warning: Value stored to
'server' is never read [deadcode.DeadStores]
 server = be16_to_cpu(oserver);

As the result of endianness conversion is never used, delete the line
and fix the warning.

Signed-off-by: Gautam Menghani 


'server' was used as a parameter to opal_get_xive() in commit 
5c7c1e9444d8 ("powerpc/powernv: Add OPAL ICS backend") when it was 
introduced. 'server' was also used in an error message for the call to 
opal_get_xive().


'server' was always later set by a call to ics_opal_mangle_server() 
before being used.


Commit bf8e0f891a32 ("powerpc/powernv: Fix endian issues in OPAL ICS 
backend") used a new variable 'oserver' as the parameter to 
opal_get_xive() instead of 'server' for endian correctness. It also 
removed 'server' from the error message for the call to opal_get_xive().


It was commit bf8e0f891a32 that added the unnecessary conversion and 
never used the result.


Reviewed-by: Jordan Niethe 



---
  arch/powerpc/sysdev/xics/ics-opal.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/sysdev/xics/ics-opal.c 
b/arch/powerpc/sysdev/xics/ics-opal.c
index 6cfbb4fac7fb..5fe73dabab79 100644
--- a/arch/powerpc/sysdev/xics/ics-opal.c
+++ b/arch/powerpc/sysdev/xics/ics-opal.c
@@ -111,7 +111,6 @@ static int ics_opal_set_affinity(struct irq_data *d,
   __func__, d->irq, hw_irq, rc);
return -1;
}
-   server = be16_to_cpu(oserver);
  
  	wanted_server = xics_get_irq_server(d->irq, cpumask, 1);

if (wanted_server < 0) {

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-06 Thread Geert Uytterhoeven

Hi Benjamin,

On Thu, Jul 6, 2023 at 8:14 AM Benjamin Gray  wrote:
> On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
> > Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
> > > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
> > > struct regmap_config, which is guarded behind CONFIG_REGMAP.
> > >
> > > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
> > > REGMAP")
> > > exposed this by disabling the default pick unless KUNIT_ALL_TESTS
> > > is
> > > set, causing the ppc64be allnoconfig build to fail.
> > >
> > > Signed-off-by: Benjamin Gray 
> > > ---
> > >   drivers/rtc/Kconfig | 1 +
> > >   1 file changed, 1 insertion(+)
> > >
> > > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
> > > index ffca9a8bb878..7455ebd189fe 100644
> > > --- a/drivers/rtc/Kconfig
> > > +++ b/drivers/rtc/Kconfig
> > > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
> > >
> > >   config RTC_DRV_DS1307
> > > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
> > > EPSON RX-8025, ISL12057"
> > > +   select REGMAP
> >
> > As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
> > selected.
> > Can you explain more in details why you have to select it explicitely
> > ?
> > If there is something wrong with the logic, then the logic should be
> > fixed instead of just adding a selection of REGMAP for that
> > particular
> > RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
> > RTC_DRV_ABEOZ9
> > might have the exact same problem.
>
> Right, yeah, I don't want to assert this patch is the correct solution,
> sending it was more to offer a fix and allow discussion if it should be
> resolved some other way (so thanks for replying, I appreciate it).
>
> In terms of why I made this patch, the way I see it, if a config option
> requires another config option be set, then "selects" is the natural
> way of phrasing this dependency. "default" on the REGMAP side seems
> weird. If it's a default, does that mean it can be overridden? But
> RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
> build breaks without it.
>
> But maybe KConfig works differently to my assumptions. Maybe the
> referenced patch that causes the build failure is actually incorrect
> (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
> supposed to depend on REGMAP like KUnit does?

Thanks for CCing me!

Looks like I made a really silly mistake here: my patch not only allows
the user to enable REGMAP manually (for the test), but also to disable
it manually, regardless if there are users or not :-(

I think the proper fix is to replace the "default y if ..." by
"select REGMAP" for all users.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

2023-07-06 Thread Yu Zhang

On Thu, Jul 06, 2023 at 01:52:08PM +0900, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang  wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens 
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted.  IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
> 
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.
> 
> Treating the head and tail pages would require changing how KVM
> behaves in a situation it supports today (rather than just adding
> support for an unsupported situation). Currently, without this series,
> KVM can map VM_PFNMAP|VM_IO memory backed by refcounted pages into the
> guest. When that happens, KVM sets the A/D flags. I'm not sure whether
> that's actually valid behavior, nor do I know whether anyone actually
> cares about it. But it's what KVM does today, and I would shy away
> from modifying that behavior without good reason.

I know the A/D status of the refcounted, VM_PFNMAP|VM_IO backed pages
will be recorded. And I have no idea if this is a necessary requirement
either.

But it feels awkward to see the head and the tail ones of non-compound
pages be treated inconsistently. After all, the head page just happens
to have its refcount being 1, it is not a real refcounted page.

So I would suggest to mention such different behehavior in the commit
message at least. :)

> > >
> > > @@ -883,7 +884,7 @@ static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, 
> > > struct kvm_mmu *mmu,
> > >   */
> > >  static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page 
> > > *sp, int i)
> > >  {
> > > - bool host_writable;
> > > + bool host_writable, is_refcounted;
> > >   gpa_t first_pte_gpa;
> > >   u64 *sptep, spte;
> > >   struct kvm_memory_slot *slot;
> > > @@ -940,10 +941,12 @@ static int FNAME(sync_spte)(struct kvm_vcpu *vcpu, 
> > > struct kvm_mmu_page *sp, int
> > >   sptep = >spt[i];
> > >   spte = *sptep;
> > >   host_writable = spte & shadow_host_writable_mask;
> > > + // TODO: is this correct?
> > > + is_refcounted = spte & SPTE_MMU_PAGE_REFCOUNTED;
> > >   slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
> > >   make_spte(vcpu, sp, slot, pte_access, gfn,
> > > spte_to_pfn(spte), spte, true, false,
> > > -   host_writable, );
> > > +   host_writable, is_refcounted, );
> >
> > Could we restrict that a non-refcounted page shall not be used as shadow 
> > page?
> 
> I'm not very familiar with the shadow mmu, so my response might not
> make sense. But do you mean not allowing non-refcoutned pages as the
> guest page tables shadowed by a kvm_mmu_page? It would probably be
> possible to do that, and I doubt anyone would care about the
> restriction. But as far as I can tell, the guest page table is only
> accessed via kvm_vcpu_read_guest_atomic, which handles non-refcounted
> pages just fine.

Sorry, my brain just got baked... Pls just ignore this question :)

B.R.
Yu

Re: [PATCH v3] powerpc/fsl_rio: add missing of_node_put() in fsl_rio_setup()

2023-07-06 Thread Christophe Leroy



Le 24/11/2022 à 15:33, Yang Yingliang a écrit :
> The of node returned by of_find_compatible_node() with refcount
> decremented, of_node_put() need be called after using it to avoid
> refcount leak.

Is that patch still required ? If yes, please rebase and resend.

Thanks
Christophe

> 
> Fixes: abc3aeae3aaa ("fsl-rio: Add two ports and rapidio message units 
> support")
> Signed-off-by: Yang Yingliang 
> ---
> v2 -> v3:
>drop of_node_put() in loop for_each_child_of_node(), it's not needed.
> 
> v1 -> v2:
>Add fix tag.
> 
> v1 patch link:
>
> https://lore.kernel.org/lkml/20220615032137.1878219-1-yangyingli...@huawei.com/
> ---
>   arch/powerpc/sysdev/fsl_rio.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/arch/powerpc/sysdev/fsl_rio.c b/arch/powerpc/sysdev/fsl_rio.c
> index c8f044d62fe2..1ad962d3060e 100644
> --- a/arch/powerpc/sysdev/fsl_rio.c
> +++ b/arch/powerpc/sysdev/fsl_rio.c
> @@ -550,12 +550,13 @@ int fsl_rio_setup(struct platform_device *dev)
>   if (!dt_range) {
>   pr_err("%pOF: unable to find 'reg' property\n",
>   np);
>   rc = -ENOMEM;
>   goto err_pw;
>   }
> + of_node_put(np);
>   range_start = of_read_number(dt_range, aw);
>   dbell->dbell_regs = (struct rio_dbell_regs *)(rmu_regs_win +
>   (u32)range_start);
>   
>   /*set up port write node*/
>   np = of_find_compatible_node(NULL, NULL, "fsl,srio-port-write-unit");
> @@ -578,12 +579,13 @@ int fsl_rio_setup(struct platform_device *dev)
>   if (!dt_range) {
>   pr_err("%pOF: unable to find 'reg' property\n",
>   np);
>   rc = -ENOMEM;
>   goto err;
>   }
> + of_node_put(np);
>   range_start = of_read_number(dt_range, aw);
>   pw->pw_regs = (struct rio_pw_regs *)(rmu_regs_win + (u32)range_start);
>   
>   /*set up ports node*/
>   for_each_child_of_node(dev->dev.of_node, np) {
>   port_index = of_get_property(np, "cell-index", NULL);
> @@ -760,12 +762,13 @@ int fsl_rio_setup(struct platform_device *dev)
>   err_pw:
>   kfree(dbell);
>   dbell = NULL;
>   err_dbell:
>   iounmap(rmu_regs_win);
>   rmu_regs_win = NULL;
> + of_node_put(np);
>   err_rmu:
>   kfree(ops);
>   err_ops:
>   iounmap(rio_regs_win);
>   rio_regs_win = NULL;
>   err_rio_regs:

Re: [PATCH linux-next][RFC]torture: avoid offline tick_do_timer_cpu

2023-07-06 Thread Christophe Leroy



Le 21/11/2022 à 04:51, Zhouyi Zhou a écrit :
> During CPU-hotplug torture (CONFIG_NO_HZ_FULL=y), if we try to
> offline tick_do_timer_cpu, the operation will fail because in
> function tick_nohz_cpu_down:
> ```
> if (tick_nohz_full_running && tick_do_timer_cpu == cpu)
>return -EBUSY;
> ```
> Above bug was first discovered in torture tests performed in PPC VM
> of Open Source Lab of Oregon State University, and reproducable in RISC-V
> and X86-64 (with additional kernel commandline cpu0_hotplug).
> 
> In this patch, we avoid offline tick_do_timer_cpu by distribute
> the offlining cpu among remaining cpus.
> 
> Signed-off-by: Zhouyi Zhou 
> ---
>   include/linux/tick.h|  1 +
>   kernel/time/tick-common.c   |  1 +
>   kernel/time/tick-internal.h |  1 -
>   kernel/torture.c| 10 ++
>   4 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index bfd571f18cfd..23cc0b205853 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -14,6 +14,7 @@
>   #include 
>   
>   #ifdef CONFIG_GENERIC_CLOCKEVENTS
> +extern int tick_do_timer_cpu __read_mostly;
>   extern void __init tick_init(void);
>   /* Should be core only, but ARM BL switcher requires it */
>   extern void tick_suspend_local(void);
> diff --git a/kernel/time/tick-common.c b/kernel/time/tick-common.c
> index 46789356f856..87b9b9afa320 100644
> --- a/kernel/time/tick-common.c
> +++ b/kernel/time/tick-common.c
> @@ -48,6 +48,7 @@ ktime_t tick_next_period;
>*procedure also covers cpu hotplug.
>*/
>   int tick_do_timer_cpu __read_mostly = TICK_DO_TIMER_BOOT;
> +EXPORT_SYMBOL_GPL(tick_do_timer_cpu);
>   #ifdef CONFIG_NO_HZ_FULL
>   /*
>* tick_do_timer_boot_cpu indicates the boot CPU temporarily owns
> diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
> index 649f2b48e8f0..8953dca10fdd 100644
> --- a/kernel/time/tick-internal.h
> +++ b/kernel/time/tick-internal.h
> @@ -15,7 +15,6 @@
>   
>   DECLARE_PER_CPU(struct tick_device, tick_cpu_device);
>   extern ktime_t tick_next_period;
> -extern int tick_do_timer_cpu __read_mostly;
>   
>   extern void tick_setup_periodic(struct clock_event_device *dev, int 
> broadcast);
>   extern void tick_handle_periodic(struct clock_event_device *dev);
> diff --git a/kernel/torture.c b/kernel/torture.c
> index 789aeb0e1159..bccbdd33dda2 100644
> --- a/kernel/torture.c
> +++ b/kernel/torture.c
> @@ -33,6 +33,7 @@
>   #include 
>   #include 
>   #include 
> +#include 
>   #include 
>   #include 
>   #include 
> @@ -358,7 +359,16 @@ torture_onoff(void *arg)
>   schedule_timeout_interruptible(HZ / 10);
>   continue;
>   }
> +#ifdef CONFIG_NO_HZ_FULL
> + /* do not offline tick do timer cpu */
> + if (tick_nohz_full_running) {

Can you use fonction tick_nohz_full_enabled() instead and avoid the #ifdef ?

> + cpu = (torture_random() >> 4) % maxcpu;
> + if (cpu >= tick_do_timer_cpu)
> + cpu = (cpu + 1) % (maxcpu + 1);
> + } else
> +#else
>   cpu = (torture_random() >> 4) % (maxcpu + 1);
> +#endif
>   if (!torture_offline(cpu,
>_offline_attempts, _offline_successes,
>_offline, _offline, _offline))

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-06 Thread David Stevens

On Wed, Jul 5, 2023 at 10:19 PM Zhi Wang  wrote:
>
> On Tue,  4 Jul 2023 16:50:48 +0900
> David Stevens  wrote:
>
> > From: David Stevens 
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h | 10 ++
> >  virt/kvm/kvm_main.c  | 67 +---
> >  virt/kvm/pfncache.c  |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >   const struct kvm_memory_slot *slot;
> >   gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >   bool atomic;
> >   /* Allow a read fault to create a writeable mapping. */
> >   bool allow_write_mapping;
> > + /*
> > +  * Usage of the returned pfn will be guared by a mmu notifier. Must
>   ^guarded
> > +  * be true if FOLL_GET is not set.
> > +  */
> > + bool guarded_by_mmu_notifier;
> >
> It seems no one sets the guraded_by_mmu_notifier in this patch. Is
> guarded_by_mmu_notifier always equal to !foll->FOLL_GET and set by the
> caller of __kvm_follow_pfn()?

Yes, this is the case.

> If yes, do we have to use FOLL_GET to resolve GFN associated with a tail page?
> It seems gup can tolerate gup_flags without FOLL_GET, but it is more like a
> temporary solution. I don't think it is a good idea to play tricks with
> a temporary solution, more like we are abusing the toleration.

I'm not sure I understand what you're getting at. This series never
calls gup without FOLL_GET.

This series aims to provide kvm_follow_pfn as a unified API on top of
gup+follow_pte. Since one of the major clients of this API uses an mmu
notifier, it makes sense to support returning a pfn without taking a
reference. And we indeed need to do that for certain types of memory.

> Is a flag like guarded_by_mmu_notifier (perhaps a better name) enough to
> indicate a tail page?

What do you mean by to indicate a tail page? Do you mean to indicate
that the returned pfn refers to non-refcounted page? That's specified
by is_refcounted_page.

-David

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-06 Thread Christophe Leroy



Le 06/07/2023 à 08:14, Benjamin Gray a écrit :
> On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
>>
>>
>> Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
>>> The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
>>> struct regmap_config, which is guarded behind CONFIG_REGMAP.
>>>
>>> Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
>>> REGMAP")
>>> exposed this by disabling the default pick unless KUNIT_ALL_TESTS
>>> is
>>> set, causing the ppc64be allnoconfig build to fail.
>>>
>>> Signed-off-by: Benjamin Gray 
>>> ---
>>>    drivers/rtc/Kconfig | 1 +
>>>    1 file changed, 1 insertion(+)
>>>
>>> diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
>>> index ffca9a8bb878..7455ebd189fe 100644
>>> --- a/drivers/rtc/Kconfig
>>> +++ b/drivers/rtc/Kconfig
>>> @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
>>>
>>>    config RTC_DRV_DS1307
>>>  tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
>>> EPSON RX-8025, ISL12057"
>>> +   select REGMAP
>>
>> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
>> selected.
>> Can you explain more in details why you have to select it explicitely
>> ?
>> If there is something wrong with the logic, then the logic should be
>> fixed instead of just adding a selection of REGMAP for that
>> particular
>> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
>> RTC_DRV_ABEOZ9
>> might have the exact same problem.
> 
> Right, yeah, I don't want to assert this patch is the correct solution,
> sending it was more to offer a fix and allow discussion if it should be
> resolved some other way (so thanks for replying, I appreciate it).
> 
> In terms of why I made this patch, the way I see it, if a config option
> requires another config option be set, then "selects" is the natural
> way of phrasing this dependency. "default" on the REGMAP side seems
> weird. If it's a default, does that mean it can be overridden? But
> RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
> build breaks without it.

I mean't "why doesn't it work as is", and/or "why didn't you fix what 
doesn't work".

Well, until commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select 
REGMAP") it was not user-selectable so I couldn't be overridden.
After commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select 
REGMAP") it is overridable so we have an additional problem.

Does RTC_DRV_DS1307 require REGMAP or does RTC_DRV_DS1307 require 
REGMAP_I2C and then REGMAP_I2C requires REGMAP ?

I think that huge default like for REGMAP should be replaced by 
individual selection of REGMAP from each of those config items. For 
exemple REGMAP_I2C should select REGMAP then I guess it would also solve 
your issue wouldn't it ?

> 
> But maybe KConfig works differently to my assumptions. Maybe the
> referenced patch that causes the build failure is actually incorrect
> (CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
> supposed to depend on REGMAP like KUnit does?
> 
> As for other drivers having the same problem, quite possibly, yes. But
> the PPC configs don't seem to build those drivers, so I'd prefer to
> leave it to anyone who does build them to report the error. I wasn't
> planning to audit all the drivers, I was just trying to fix the PPC
> configs I build and test.

Well ok but that's probably the indication of a deeper problem which 
deserves a more generic fix.

Re: [PATCH] rtc: Kconfig: select REGMAP for RTC_DRV_DS1307

2023-07-06 Thread Benjamin Gray

On Thu, 2023-07-06 at 05:13 +, Christophe Leroy wrote:
> 
> 
> Le 05/07/2023 à 02:30, Benjamin Gray a écrit :
> > The drivers/rtc/rtc-ds1307.c driver has a direct dependency on
> > struct regmap_config, which is guarded behind CONFIG_REGMAP.
> > 
> > Commit 70a640c0efa7 ("regmap: REGMAP_KUNIT should not select
> > REGMAP")
> > exposed this by disabling the default pick unless KUNIT_ALL_TESTS
> > is
> > set, causing the ppc64be allnoconfig build to fail.
> > 
> > Signed-off-by: Benjamin Gray 
> > ---
> >   drivers/rtc/Kconfig | 1 +
> >   1 file changed, 1 insertion(+)
> > 
> > diff --git a/drivers/rtc/Kconfig b/drivers/rtc/Kconfig
> > index ffca9a8bb878..7455ebd189fe 100644
> > --- a/drivers/rtc/Kconfig
> > +++ b/drivers/rtc/Kconfig
> > @@ -246,6 +246,7 @@ config RTC_DRV_AS3722
> >   
> >   config RTC_DRV_DS1307
> > tristate "Dallas/Maxim DS1307/37/38/39/40/41, ST M41T00,
> > EPSON RX-8025, ISL12057"
> > +   select REGMAP
> 
> As far as I can see, REGMAP defaults to Y when REGMAP_I2C is
> selected.
> Can you explain more in details why you have to select it explicitely
> ? 
> If there is something wrong with the logic, then the logic should be 
> fixed instead of just adding a selection of REGMAP for that
> particular 
> RTC_DRV_DS1307. Because others like RTC_DRV_ABB5ZES3 or
> RTC_DRV_ABEOZ9 
> might have the exact same problem.

Right, yeah, I don't want to assert this patch is the correct solution,
sending it was more to offer a fix and allow discussion if it should be
resolved some other way (so thanks for replying, I appreciate it).

In terms of why I made this patch, the way I see it, if a config option
requires another config option be set, then "selects" is the natural
way of phrasing this dependency. "default" on the REGMAP side seems
weird. If it's a default, does that mean it can be overridden? But
RTC_DRV_DS1307 *requires* REGMAP; it's not just a "would be nice". The
build breaks without it.

But maybe KConfig works differently to my assumptions. Maybe the
referenced patch that causes the build failure is actually incorrect
(CC Geert). I spoke with Joel Stanley (CC) and he indicated you're not
supposed to depend on REGMAP like KUnit does?

As for other drivers having the same problem, quite possibly, yes. But
the PPC configs don't seem to build those drivers, so I'd prefer to
leave it to anyone who does build them to report the error. I wasn't
planning to audit all the drivers, I was just trying to fix the PPC
configs I build and test.

[next-20230705] kernel BUG mm/memcontrol.c:3715! (ltp/madvise06)

2023-07-06 Thread Sachin Sant

While running LTP tests (madvise06) on IBM Power9 LPAR booted with
6.4.0-next-20230705 following crash is seen

Injecting memory failure for pfn 0x3f79 at process virtual address 
0x7fff9b74
Memory failure: 0x3f79: recovery action for clean LRU page: Recovered
madvise06 (133636): drop_caches: 3
[ cut here ]
kernel BUG at mm/memcontrol.c:3715!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=8192 NUMA pSeries
Modules linked in: brd overlay exfat vfat fat xfs loop sctp ip6_udp_tunnel 
udp_tunnel dm_mod nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat 
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bonding ip_set tls rfkill 
nf_tables libcrc32c nfnetlink sunrpc pseries_rng vmx_crypto ext4 mbcache jbd2 
sd_mod t10_pi crc64_rocksoft crc64 sg ibmvscsi scsi_transport_srp ibmveth fuse 
[last unloaded: init_module(O)]
CPU: 10 PID: 133636 Comm: madvise06 Tainted: G O 6.4.0-next-20230705 #1
Hardware name: IBM,8375-42A POWER9 (raw) 0x4e0202 0xf05 of:IBM,FW950.80 
(VL950_131) hv:phyp pSeries
NIP: c054ea88 LR: c028b2a8 CTR: c054e8d0
REGS: c0029dd7b890 TRAP: 0700 Tainted: G O (6.4.0-next-20230705)
MSR: 80029033  CR: 28008288 XER: 
CFAR: c054e904 IRQMASK: 0 
GPR00: c028b2a8 c0029dd7bb30 c1431600 c002bc978000 
GPR04: c2b3b288 00010192  0001 
GPR08: c000f9abb180 0002 c002bc978580  
GPR12: c054e8d0 c0001ec53f00   
GPR16:     
GPR20: c0001b2e6578 00400cc0 7fff f000 
GPR24: c0029dd7bd30  c0029dd7bd58 c0001b2e6568 
GPR28: c0029dd7bde0 0001 0001 c0001b2e6540 
NIP [c054ea88] mem_cgroup_read_u64+0x1b8/0x1d0
LR [c028b2a8] cgroup_seqfile_show+0xb8/0x160
Call Trace:
[c0029dd7bb50] [c028b2a8] cgroup_seqfile_show+0xb8/0x160
[c0029dd7bbc0] [c0673ba4] kernfs_seq_show+0x44/0x60
[c0029dd7bbe0] [c05c4238] seq_read_iter+0x238/0x620
[c0029dd7bcb0] [c0675064] kernfs_fop_read_iter+0x1d4/0x2c0
[c0029dd7bd00] [c057fbac] vfs_read+0x26c/0x350
[c0029dd7bdc0] [c058077c] ksys_read+0x7c/0x140
[c0029dd7be10] [c0036900] system_call_exception+0x140/0x350
[c0029dd7be50] [c000d6a0] system_call_common+0x160/0x2e4
--- interrupt: c00 at 0x7fff9eb41484
NIP: 7fff9eb41484 LR: 10008540 CTR: 
REGS: c0029dd7be80 TRAP: 0c00 Tainted: G O (6.4.0-next-20230705)
MSR: 8280f033  CR: 28002282 XER: 

IRQMASK: 0 
GPR00: 0003 7fffc33de7d0 7fff9ec27300 0013 
GPR04: 7fffc33e0aa0 1fff  0013 
GPR08: 7fffc33e0aa0    
GPR12:  7fff9ecca3a0   
GPR16:  10035520 10035b90 100347a8 
GPR20: 1002fb68 10063900 2000 1002fb68 
GPR24:  004c 1002fa78 7fffc33e0aa0 
GPR28: 0013  1fff 1fff 
NIP [7fff9eb41484] 0x7fff9eb41484
LR [10008540] 0x10008540
--- interrupt: c00
Code: 7fa34800 409effc4 7c0802a6 3861 f8010030 4bfffdfd e8010030 786383e4 
7c0803a6 4b6c 7c0802a6 f8010030 <0fe0> 7c0802a6 f8010030 0fe0 
---[ end trace  ]---
pstore: backend (nvram) writing error (-1)

Kernel panic - not syncing: Fatal exception
Rebooting in 10 seconds..

Git bisect points to following patch:

commit 29bf1eb7d2abbdfc24c4ef7acf7a51b72dc43d2b
memcg: drop kmem.limit_in_bytes

Does the testcase madvise06 need an update?

90 tst_res(TINFO, "\tCached: %ld Kb",
91 SAFE_READ_MEMINFO("Cached:") - init_cached);
92   
93 print_cgmem("memory.current");
94 print_cgmem("memory.swap.current");
95 print_cgmem("memory.kmem.usage_in_bytes”);  <<== this line. 
96 }

If I comment line 95 from the testcase, it completes successfully.

- Sachin

Re: [PATCH v7 3/8] KVM: Make __kvm_follow_pfn not imply FOLL_GET

2023-07-06 Thread David Stevens

On Wed, Jul 5, 2023 at 8:56 PM Yu Zhang  wrote:
>
> On Tue, Jul 04, 2023 at 04:50:48PM +0900, David Stevens wrote:
> > From: David Stevens 
> >
> > Make it so that __kvm_follow_pfn does not imply FOLL_GET. This allows
> > callers to resolve a gfn when the associated pfn has a valid struct page
> > that isn't being actively refcounted (e.g. tail pages of non-compound
> > higher order pages). For a caller to safely omit FOLL_GET, all usages of
> > the returned pfn must be guarded by a mmu notifier.
> >
> > This also adds a is_refcounted_page out parameter to kvm_follow_pfn that
> > is set when the returned pfn has an associated struct page with a valid
> > refcount. Callers that don't pass FOLL_GET should remember this value
> > and use it to avoid places like kvm_is_ad_tracked_page that assume a
> > non-zero refcount.
> >
> > Signed-off-by: David Stevens 
> > ---
> >  include/linux/kvm_host.h | 10 ++
> >  virt/kvm/kvm_main.c  | 67 +---
> >  virt/kvm/pfncache.c  |  2 +-
> >  3 files changed, 47 insertions(+), 32 deletions(-)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index ef2763c2b12e..a45308c7d2d9 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1157,6 +1157,9 @@ unsigned long gfn_to_hva_memslot_prot(struct 
> > kvm_memory_slot *slot, gfn_t gfn,
> >  void kvm_release_page_clean(struct page *page);
> >  void kvm_release_page_dirty(struct page *page);
> >
> > +void kvm_set_page_accessed(struct page *page);
> > +void kvm_set_page_dirty(struct page *page);
> > +
> >  struct kvm_follow_pfn {
> >   const struct kvm_memory_slot *slot;
> >   gfn_t gfn;
> > @@ -1164,10 +1167,17 @@ struct kvm_follow_pfn {
> >   bool atomic;
> >   /* Allow a read fault to create a writeable mapping. */
> >   bool allow_write_mapping;
> > + /*
> > +  * Usage of the returned pfn will be guared by a mmu notifier. Must
> > +  * be true if FOLL_GET is not set.
> > +  */
> > + bool guarded_by_mmu_notifier;
>
> And how? Any place to check the invalidate seq?

kvm_follow_pfn can't meaningfully validate the seq number, since the
mmu notifier locking is handled by the caller. This is more of a
sanity check that the API is being used properly, as proposed here
[1]. I did deviate from the proposal with a bool instead of some type
of integer, since the exact value of mmu_seq wouldn't be useful.

[1] https://lore.kernel.org/all/zgvusf7lmkrnd...@google.com/#t

-David

95 matches

Mail list logo