Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
Oscar Salvador  writes:

> On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +   struct vmem_altmap *altmap)
>> +{
>> +if (altmap)
>> +vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +else
>> +free_pages((unsigned long)page_address(page), get_order(size));
>
> David already pointed this out, but can check
> arch/x86/mm/init_64.c:free_pagetable().
>
> You will see that we have to do some magic for bootmem memory (DIMMs
> which were not hotplugged but already present)

Thank you!

>> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
>> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct 
>> vmem_altmap *altmap)
>> +{
>> +remove_pgd_mapping(start, end, true, altmap);
>> +}
>> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
>> +#endif /* CONFIG_MEMORY_HOTPLUG */
>
> I will comment on the patch where you add support for hotplug and the
> dependency, but on a track in LSFMM today, we decided that most likely
> we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
> environments.
> So, since you are adding this plain fresh, please consider to tight the
> hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
> As a bonus, you will only have to maintain one flavour of functions.

Ah, yeah, I saw it mentioned on the LSF/MM/BPF topics. Less is
definitely more -- I'll make the next version depend on
SPARSEMEM_VMEMMAP.


Björn



Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Björn Töpel
Oscar Salvador  writes:

> On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote:
>> From: Björn Töpel 
>> 
>> Prepare for memory hotplugging support by changing from __init to
>> __meminit for the page table functions that are used by the upcoming
>> architecture specific callbacks.
>> 
>> Changing the __init attribute to __meminit, avoids that the functions
>> are removed after init. The __meminit attribute makes sure the
>> functions are kept in the kernel text post init, but only if memory
>> hotplugging is enabled for the build.
>> 
>> Also, make sure that the altmap parameter is properly passed on to
>> vmemmap_populate_hugepages().
>> 
>> Signed-off-by: Björn Töpel 
>
> Reviewed-by: Oscar Salvador 
>
>> +static void __meminit create_linear_mapping_range(phys_addr_t start, 
>> phys_addr_t end,
>> +  uintptr_t fixed_map_size)
>>  {
>>  phys_addr_t pa;
>>  uintptr_t va, map_size;
>> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, 
>> unsigned long end, int node,
>>   * memory hotplug, we are not able to update all the page tables with
>>   * the new PMDs.
>>   */
>> -return vmemmap_populate_hugepages(start, end, node, NULL);
>> +return vmemmap_populate_hugepages(start, end, node, altmap);
>
> I would have put this into a separate patch.

Thanks for the review, Oscar!

I'll split this up (also suggested by Alex!).


Cheers,
Björn



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Tue, May 14, 2024 at 06:21:57PM -0700, Yuanchu Xie wrote:
> On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman
>  wrote:
> >
> > On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> > > Memctl provides a way for the guest to control its physical memory
> > > properties, and enables optimizations and security features. For
> > > example, the guest can provide information to the host where parts of a
> > > hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> > >...
> > Pretty generic name for a hardware-specific driver :(
> It's not for real hardware btw. Its use case is similar to pvpanic
> where the device is emulated by the VMM. I can change the name if it's
> a problem.

This file is only used for a single PCI device, that is very
hardware-specific even if that hardware is "fake" :)

Please make the name more specific as well.

thanks,

greg k-h



[PATCH] ring-buffer: Add cast to unsigned long addr passed to virt_to_page()

2024-05-14 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The sub-buffer pages are held in an unsigned long array, and when it is
passed to virt_to_page() a cast is needed.

Link: https://lore.kernel.org/all/20240515124808.06279...@canb.auug.org.au/

Fixes: 117c39200d9d ("ring-buffer: Introducing ring-buffer mapping functions")
Reported-by: Stephen Rothwell 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a02c7a52a0f5..7345a8b625fb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -6283,7 +6283,7 @@ static int __rb_map_vma(struct ring_buffer_per_cpu 
*cpu_buffer,
}
 
while (p < nr_pages) {
-   struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]);
+   struct page *page = virt_to_page((void 
*)cpu_buffer->subbuf_ids[s]);
int off = 0;
 
if (WARN_ON_ONCE(s >= nr_subbufs)) {
-- 
2.43.0




Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-14 Thread Deepak Gupta

On Wed, May 15, 2024 at 01:10:03AM +, Edgecombe, Rick P wrote:

On Mon, 2024-05-13 at 15:23 -0600, Jiri Olsa wrote:

so at the moment the patch 6 changes shadow stack for

1) current uretprobe which are not working at the moment and we change
   the top value of shadow stack with shstk_push_frame
2) optimized uretprobe which needs to push new frame on shadow stack
   with shstk_update_last_frame

I think we should do 1) and have current uretprobe working with shadow
stack, which is broken at the moment

I'm ok with not using optimized uretprobe when shadow stack is detected
as enabled and we go with current uretprobe in that case

would this work for you?


Sorry for the delay. It seems reasonable to me due to 1 being at a fixed address
where 2 was arbitrary address. But Peterz might have felt the opposite earlier.
Not sure.

I'd also love to get some second opinions from broonie (arm shadow stack) and
Deepak (riscv shadow stack).

Deepak, even if riscv has a special instruction that pushes to the shadow stack,
will it be ok if there is a callable operation that does the same thing? Like,
aren't you relying on endbranches or the compiler or something such that
arbitrary data can't be pushed via that instruction?


Instruction is `sspush x1/ra`. It pushes contents of register return address (ra 
also called x1) onto shadow stack. `ra` is like arm's equivalent of link register.

Prologue of function is supposed to have `sspush x1` to save it away.
ISA doesn't allow encodings with register in risc-v GPRs (except register x5
because some embedded riscv space toolchains have used x5 as ra too).

On question of callable operation, I think still need to fully understand who 
manages
the probe and forward progress.

Question,

Is it kernel who is maintaining all return probes, meaning original return 
addresses
are saved in kernel data structures on per task basis. Once uretprobe did its 
job then
its kernel who is ensuring return to original return address ?



BTW Jiri, thanks for considering shadow stack in your work.




Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Yuanchu Xie
On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman
 wrote:
>
> On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> > Memctl provides a way for the guest to control its physical memory
> > properties, and enables optimizations and security features. For
> > example, the guest can provide information to the host where parts of a
> > hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> >...
> Pretty generic name for a hardware-specific driver :(
It's not for real hardware btw. Its use case is similar to pvpanic
where the device is emulated by the VMM. I can change the name if it's
a problem.

> Yup, you write this to hardware, please use proper structures and types
> for that, otherwise you will have problems in the near future.
Thanks for the review and comments on endianness and using proper
types. Will do.

Thanks,
Yuanchu



Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support

2024-05-14 Thread Edgecombe, Rick P
On Mon, 2024-05-13 at 15:23 -0600, Jiri Olsa wrote:
> so at the moment the patch 6 changes shadow stack for
> 
> 1) current uretprobe which are not working at the moment and we change
>    the top value of shadow stack with shstk_push_frame
> 2) optimized uretprobe which needs to push new frame on shadow stack
>    with shstk_update_last_frame
> 
> I think we should do 1) and have current uretprobe working with shadow
> stack, which is broken at the moment
> 
> I'm ok with not using optimized uretprobe when shadow stack is detected
> as enabled and we go with current uretprobe in that case
> 
> would this work for you?

Sorry for the delay. It seems reasonable to me due to 1 being at a fixed address
where 2 was arbitrary address. But Peterz might have felt the opposite earlier.
Not sure.

I'd also love to get some second opinions from broonie (arm shadow stack) and
Deepak (riscv shadow stack).

Deepak, even if riscv has a special instruction that pushes to the shadow stack,
will it be ok if there is a callable operation that does the same thing? Like,
aren't you relying on endbranches or the compiler or something such that
arbitrary data can't be pushed via that instruction?

BTW Jiri, thanks for considering shadow stack in your work.


Re: [PATCH] sched/rt: Clean up usage of rt_task()

2024-05-14 Thread Phil Auld


Hi Qais,

On Wed, May 15, 2024 at 12:41:12AM +0100 Qais Yousef wrote:
> rt_task() checks if a task has RT priority. But depends on your
> dictionary, this could mean it belongs to RT class, or is a 'realtime'
> task, which includes RT and DL classes.
> 
> Since this has caused some confusion already on discussion [1], it
> seemed a clean up is due.
> 
> I define the usage of rt_task() to be tasks that belong to RT class.
> Make sure that it returns true only for RT class and audit the users and
> replace them with the new realtime_task() which returns true for RT and
> DL classes - the old behavior. Introduce similar realtime_prio() to
> create similar distinction to rt_prio() and update the users.

I think making the difference clear is good. However, I think rt_task() is
a better name. We have dl_task() still.  And rt tasks are things managed
by rt.c, basically. Not realtime.c :)  I know that doesn't work for deadline.c
and dl_ but this change would be the reverse of that pattern.

> 
> Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> 
> Document the functions to make it more obvious what is the difference
> between them. PI-boosted tasks is a factor that must be taken into
> account when choosing which function to use.
> 
> Rename task_is_realtime() to task_has_realtime_policy() as the old name
> is confusing against the new realtime_task().

Keeping it rt_task() above could mean this stays as it was but this
change makes sense as you have written it too. 



Cheers,
Phil

> 
> No functional changes were intended.
> 
> [1] 
> https://lore.kernel.org/lkml/20240506100509.gl40...@noisy.programming.kicks-ass.net/
> 
> Signed-off-by: Qais Yousef 
> ---
>  fs/select.c   |  2 +-
>  include/linux/ioprio.h|  2 +-
>  include/linux/sched/deadline.h|  6 --
>  include/linux/sched/prio.h|  1 +
>  include/linux/sched/rt.h  | 27 ++-
>  kernel/locking/rtmutex.c  |  4 ++--
>  kernel/locking/rwsem.c|  4 ++--
>  kernel/locking/ww_mutex.h |  2 +-
>  kernel/sched/core.c   |  6 +++---
>  kernel/time/hrtimer.c |  6 +++---
>  kernel/trace/trace_sched_wakeup.c |  2 +-
>  mm/page-writeback.c   |  4 ++--
>  mm/page_alloc.c   |  2 +-
>  13 files changed, 48 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 9515c3fa1a03..8d5c1419416c 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -82,7 +82,7 @@ u64 select_estimate_accuracy(struct timespec64 *tv)
>* Realtime tasks get a slack of 0 for obvious reasons.
>*/
>  
> - if (rt_task(current))
> + if (realtime_task(current))
>   return 0;
>  
>   ktime_get_ts64();
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index db1249cd9692..6c00342b6166 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -40,7 +40,7 @@ static inline int task_nice_ioclass(struct task_struct 
> *task)
>  {
>   if (task->policy == SCHED_IDLE)
>   return IOPRIO_CLASS_IDLE;
> - else if (task_is_realtime(task))
> + else if (task_has_realtime_policy(task))
>   return IOPRIO_CLASS_RT;
>   else
>   return IOPRIO_CLASS_BE;
> diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> index df3aca89d4f5..5cb88b748ad6 100644
> --- a/include/linux/sched/deadline.h
> +++ b/include/linux/sched/deadline.h
> @@ -10,8 +10,6 @@
>  
>  #include 
>  
> -#define MAX_DL_PRIO  0
> -
>  static inline int dl_prio(int prio)
>  {
>   if (unlikely(prio < MAX_DL_PRIO))
> @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
>   return 0;
>  }
>  
> +/*
> + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> + */
>  static inline int dl_task(struct task_struct *p)
>  {
>   return dl_prio(p->prio);
> diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> index ab83d85e1183..6ab43b4f72f9 100644
> --- a/include/linux/sched/prio.h
> +++ b/include/linux/sched/prio.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #define MAX_RT_PRIO  100
> +#define MAX_DL_PRIO  0
>  
>  #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
>  #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index b2b9e6eb9683..b31be3c50152 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -7,18 +7,43 @@
>  struct task_struct;
>  
>  static inline int rt_prio(int prio)
> +{
> + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> + return 1;
> + return 0;
> +}
> +
> +static inline int realtime_prio(int prio)
>  {
>   if (unlikely(prio < MAX_RT_PRIO))
>   return 1;
>   return 0;
>  }
>  
> +/*
> + * Returns true if a task has a 

[PATCH] sched/rt: Clean up usage of rt_task()

2024-05-14 Thread Qais Yousef
rt_task() checks if a task has RT priority. But depends on your
dictionary, this could mean it belongs to RT class, or is a 'realtime'
task, which includes RT and DL classes.

Since this has caused some confusion already on discussion [1], it
seemed a clean up is due.

I define the usage of rt_task() to be tasks that belong to RT class.
Make sure that it returns true only for RT class and audit the users and
replace them with the new realtime_task() which returns true for RT and
DL classes - the old behavior. Introduce similar realtime_prio() to
create similar distinction to rt_prio() and update the users.

Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.

Document the functions to make it more obvious what is the difference
between them. PI-boosted tasks is a factor that must be taken into
account when choosing which function to use.

Rename task_is_realtime() to task_has_realtime_policy() as the old name
is confusing against the new realtime_task().

No functional changes were intended.

[1] 
https://lore.kernel.org/lkml/20240506100509.gl40...@noisy.programming.kicks-ass.net/

Signed-off-by: Qais Yousef 
---
 fs/select.c   |  2 +-
 include/linux/ioprio.h|  2 +-
 include/linux/sched/deadline.h|  6 --
 include/linux/sched/prio.h|  1 +
 include/linux/sched/rt.h  | 27 ++-
 kernel/locking/rtmutex.c  |  4 ++--
 kernel/locking/rwsem.c|  4 ++--
 kernel/locking/ww_mutex.h |  2 +-
 kernel/sched/core.c   |  6 +++---
 kernel/time/hrtimer.c |  6 +++---
 kernel/trace/trace_sched_wakeup.c |  2 +-
 mm/page-writeback.c   |  4 ++--
 mm/page_alloc.c   |  2 +-
 13 files changed, 48 insertions(+), 20 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 9515c3fa1a03..8d5c1419416c 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -82,7 +82,7 @@ u64 select_estimate_accuracy(struct timespec64 *tv)
 * Realtime tasks get a slack of 0 for obvious reasons.
 */
 
-   if (rt_task(current))
+   if (realtime_task(current))
return 0;
 
ktime_get_ts64();
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index db1249cd9692..6c00342b6166 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -40,7 +40,7 @@ static inline int task_nice_ioclass(struct task_struct *task)
 {
if (task->policy == SCHED_IDLE)
return IOPRIO_CLASS_IDLE;
-   else if (task_is_realtime(task))
+   else if (task_has_realtime_policy(task))
return IOPRIO_CLASS_RT;
else
return IOPRIO_CLASS_BE;
diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
index df3aca89d4f5..5cb88b748ad6 100644
--- a/include/linux/sched/deadline.h
+++ b/include/linux/sched/deadline.h
@@ -10,8 +10,6 @@
 
 #include 
 
-#define MAX_DL_PRIO0
-
 static inline int dl_prio(int prio)
 {
if (unlikely(prio < MAX_DL_PRIO))
@@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
return 0;
 }
 
+/*
+ * Returns true if a task has a priority that belongs to DL class. PI-boosted
+ * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
+ */
 static inline int dl_task(struct task_struct *p)
 {
return dl_prio(p->prio);
diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
index ab83d85e1183..6ab43b4f72f9 100644
--- a/include/linux/sched/prio.h
+++ b/include/linux/sched/prio.h
@@ -14,6 +14,7 @@
  */
 
 #define MAX_RT_PRIO100
+#define MAX_DL_PRIO0
 
 #define MAX_PRIO   (MAX_RT_PRIO + NICE_WIDTH)
 #define DEFAULT_PRIO   (MAX_RT_PRIO + NICE_WIDTH / 2)
diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
index b2b9e6eb9683..b31be3c50152 100644
--- a/include/linux/sched/rt.h
+++ b/include/linux/sched/rt.h
@@ -7,18 +7,43 @@
 struct task_struct;
 
 static inline int rt_prio(int prio)
+{
+   if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
+   return 1;
+   return 0;
+}
+
+static inline int realtime_prio(int prio)
 {
if (unlikely(prio < MAX_RT_PRIO))
return 1;
return 0;
 }
 
+/*
+ * Returns true if a task has a priority that belongs to RT class. PI-boosted
+ * tasks will return true. Use rt_policy() to ignore PI-boosted tasks.
+ */
 static inline int rt_task(struct task_struct *p)
 {
return rt_prio(p->prio);
 }
 
-static inline bool task_is_realtime(struct task_struct *tsk)
+/*
+ * Returns true if a task has a priority that belongs to RT or DL classes.
+ * PI-boosted tasks will return true. Use task_has_realtime_policy() to ignore
+ * PI-boosted tasks.
+ */
+static inline int realtime_task(struct task_struct *p)
+{
+   return realtime_prio(p->prio);
+}
+
+/*
+ * Returns true if a task has a policy that belongs to RT or DL classes.
+ * PI-boosted tasks will return false.
+ */
+static inline bool 

Re: [PATCH v2 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option

2024-05-14 Thread kernel test robot
Hi Kris,

kernel test robot noticed the following build warnings:

[auto build test WARNING on dd5a440a31fae6e459c0d627162825505361]

url:
https://github.com/intel-lab-lkp/linux/commits/Kris-Van-Hees/kbuild-add-modules-builtin-objs/20240512-065954
base:   dd5a440a31fae6e459c0d627162825505361
patch link:
https://lore.kernel.org/r/20240511224035.27775-3-kris.van.hees%40oracle.com
patch subject: [PATCH v2 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option
config: arc-kismet-CONFIG_VMLINUX_MAP-CONFIG_BUILTIN_MODULE_RANGES-0-0 
(https://download.01.org/0day-ci/archive/20240515/202405150623.lms5svhm-...@intel.com/config)
reproduce: 
(https://download.01.org/0day-ci/archive/20240515/202405150623.lms5svhm-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202405150623.lms5svhm-...@intel.com/

kismet warnings: (new ones prefixed by >>)
>> kismet: WARNING: unmet direct dependencies detected for VMLINUX_MAP when 
>> selected by BUILTIN_MODULE_RANGES
   WARNING: unmet direct dependencies detected for VMLINUX_MAP
 Depends on [n]: EXPERT [=n]
 Selected by [y]:
 - BUILTIN_MODULE_RANGES [=y] && FTRACE [=y]

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:44PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
> 
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
>   select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>   select ARCH_DMA_DEFAULT_COHERENT
>   select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

Hopefully this should be SPARSEMEM_VMEMMAP.
We are trying to deprecate memory-hotplug on !SPARSEMEM_VMEMMAP.

And it is always easier to do it now that when the code goes already in,
so please consider if you really need SPARSEMEM and why (I do not think
you do).

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:43PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> During memory hot remove, the ptdump functionality can end up touching
> stale data. Avoid any potential crashes (or worse), by holding the
> memory hotplug read-lock while traversing the page table.
> 
> This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
> Hold memory hotplug lock while walking for kernel page table dump").
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

funny enough, it seems arm64 and riscv are the only ones holding the
hotplug lock here.
I think we have the same problem on the other arches as well (at least
on x86_64 that I can see).

If we happen to finally need the lock in those, I would rather have a
centric function in the generic mm code with the locking and then
calling an arch specific ptdump_show function, so the lock is not
scattered. But that is another story.

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote:
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +struct vmem_altmap *altmap)
> +{
> + if (altmap)
> + vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> + else
> + free_pages((unsigned long)page_address(page), get_order(size));

David already pointed this out, but can check
arch/x86/mm/init_64.c:free_pagetable().

You will see that we have to do some magic for bootmem memory (DIMMs
which were not hotplugged but already present)

> +#ifdef CONFIG_SPARSEMEM_VMEMMAP
> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct 
> vmem_altmap *altmap)
> +{
> + remove_pgd_mapping(start, end, true, altmap);
> +}
> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */
> +#endif /* CONFIG_MEMORY_HOTPLUG */

I will comment on the patch where you add support for hotplug and the
dependency, but on a track in LSFMM today, we decided that most likely
we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP
environments.
So, since you are adding this plain fresh, please consider to tight the
hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP.
As a bonus, you will only have to maintain one flavour of functions.


-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread David Hildenbrand

On 14.05.24 20:17, Björn Töpel wrote:

Alexandre Ghiti  writes:


On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:


From: Björn Töpel 

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel 
---
  arch/riscv/Kconfig | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
 select ACPI_REDUCED_HARDWARE_ONLY if ACPI
 select ARCH_DMA_DEFAULT_COHERENT
 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU


I think this should be SPARSEMEM_VMEMMAP here.


Hmm, care to elaborate? I thought that was optional.


There was a discussion at LSF/MM today to maybe require 
SPARSEMEM_VMEMMAP for hotplug. Would that work here as well?


--
Cheers,

David / dhildenb




Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/ptdump.c | 3 +++
  1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
  
  static int ptdump_show(struct seq_file *m, void *v)

  {
+   get_online_mems();
ptdump_walk(m, m->private);
+   put_online_mems();
  
  	return 0;

  }


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:41PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

> ---
>  arch/riscv/mm/init.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>  
>  static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> -   uintptr_t fixed_map_size)
> +   uintptr_t fixed_map_size, 
> const pgprot_t *pgprot)
>  {
>   phys_addr_t pa;
>   uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit 
> create_linear_mapping_range(phys_addr_t start, phys_addr_t
>   best_map_size(pa, va, end - pa);
>  
>   create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -pgprot_from_va(va));
> +pgprot ? *pgprot : pgprot_from_va(va));
>   }
>  }
>  
> @@ -1282,22 +1282,19 @@ static void __init 
> create_linear_mapping_page_table(void)
>   if (end >= __pa(PAGE_OFFSET) + memory_limit)
>   end = __pa(PAGE_OFFSET) + memory_limit;
>  
> - create_linear_mapping_range(start, end, 0);
> + create_linear_mapping_range(start, end, 0, NULL);
>   }
>  
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> - create_linear_mapping_range(krodata_start,
> - krodata_start + krodata_size, 0);
> + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, 
> NULL);
> + create_linear_mapping_range(krodata_start, krodata_start + 
> krodata_size, 0, NULL);
>  
>   memblock_clear_nomap(ktext_start,  ktext_size);
>   memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>  
>  #ifdef CONFIG_KFENCE
> - create_linear_mapping_range(kfence_pool,
> - kfence_pool + KFENCE_POOL_SIZE,
> - PAGE_SIZE);
> + create_linear_mapping_range(kfence_pool, kfence_pool + 
> KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>  
>   memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> -- 
> 2.40.1
> 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Oscar Salvador
On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote:
> From: Björn Töpel 
> 
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
> 
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
> 
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
> 
> Signed-off-by: Björn Töpel 

Reviewed-by: Oscar Salvador 

> +static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> +   uintptr_t fixed_map_size)
>  {
>   phys_addr_t pa;
>   uintptr_t va, map_size;
> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, 
> unsigned long end, int node,
>* memory hotplug, we are not able to update all the page tables with
>* the new PMDs.
>*/
> - return vmemmap_populate_hugepages(start, end, node, NULL);
> + return vmemmap_populate_hugepages(start, end, node, altmap);

I would have put this into a separate patch.

 

-- 
Oscar Salvador
SUSE Labs



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
Alexandre Ghiti  writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:

>> +int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params 
>> *params)
>> +{
>> +   int ret;
>> +
>> +   create_linear_mapping_range(start, start + size, 0, >pgprot);
>> +   flush_tlb_all();
>> +   ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, 
>> params);
>> +   if (ret) {
>> +   remove_linear_mapping(start, size);
>> +   return ret;
>> +   }
>> +
>
> You need to flush the TLB here too since __add_pages() populates the
> page table with the new vmemmap mapping (only because riscv allows to
> cache invalid entries, I'll adapt this in my next version of Svvptc
> support).
>
>> +   max_pfn = PFN_UP(start + size);
>> +   max_low_pfn = max_pfn;
>> +   return 0;
>> +}
>> +
>> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap 
>> *altmap)
>> +{
>> +   __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap);
>> +   remove_linear_mapping(start, size);
>
> You need to flush the TLB here too.

I'll address all of the above in the next version. Thanks for reviewing
the series!


Björn



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 8:17 PM Björn Töpel  wrote:
>
> Alexandre Ghiti  writes:
>
> > On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
> >>
> >> From: Björn Töpel 
> >>
> >> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> >> RISC-V.
> >>
> >> Signed-off-by: Björn Töpel 
> >> ---
> >>  arch/riscv/Kconfig | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> >> index 6bec1bce6586..b9398b64bb69 100644
> >> --- a/arch/riscv/Kconfig
> >> +++ b/arch/riscv/Kconfig
> >> @@ -16,6 +16,8 @@ config RISCV
> >> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> >> select ARCH_DMA_DEFAULT_COHERENT
> >> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> >> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
> >
> > I think this should be SPARSEMEM_VMEMMAP here.
>
> Hmm, care to elaborate? I thought that was optional.

My bad, I thought VMEMMAP was required in your patchset. Sorry for the noise!



Re: [GIT PULL] OpenRISC updates for 6.10

2024-05-14 Thread pr-tracker-bot
The pull request you sent on Tue, 14 May 2024 16:34:42 +0100:

> https://github.com/openrisc/linux.git tags/for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/590103732442b4bb83886f03f2ddd39d129c3289

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Björn Töpel
Alexandre Ghiti  writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>>
>> From: Björn Töpel 
>>
>> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
>> RISC-V.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>>  arch/riscv/Kconfig | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
>> index 6bec1bce6586..b9398b64bb69 100644
>> --- a/arch/riscv/Kconfig
>> +++ b/arch/riscv/Kconfig
>> @@ -16,6 +16,8 @@ config RISCV
>> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
>> select ARCH_DMA_DEFAULT_COHERENT
>> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
>> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
>
> I think this should be SPARSEMEM_VMEMMAP here.

Hmm, care to elaborate? I thought that was optional.



Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
> RISC-V.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/Kconfig | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> index 6bec1bce6586..b9398b64bb69 100644
> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -16,6 +16,8 @@ config RISCV
> select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> select ARCH_DMA_DEFAULT_COHERENT
> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
> +   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU

I think this should be SPARSEMEM_VMEMMAP here.

> +   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
> select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
> select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
> select ARCH_HAS_BINFMT_FLAT
> --
> 2.40.1
>



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> For an architecture to support memory hotplugging, a couple of
> callbacks needs to be implemented:
>
>  arch_add_memory()
>   This callback is responsible for adding the physical memory into the
>   direct map, and call into the memory hotplugging generic code via
>   __add_pages() that adds the corresponding struct page entries, and
>   updates the vmemmap mapping.
>
>  arch_remove_memory()
>   This is the inverse of the callback above.
>
>  vmemmap_free()
>   This function tears down the vmemmap mappings (if
>   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>   backing vmemmap pages. Note that for persistent memory, an
>   alternative allocator for the backing pages can be used; The
>   vmem_altmap. This means that when the backing pages are cleared,
>   extra care is needed so that the correct deallocation method is
>   used.
>
>  arch_get_mappable_range()
>   This functions returns the PA range that the direct map can map.
>   Used by the MHP internals for sanity checks.
>
> The page table unmap/teardown functions are heavily based on code from
> the x86 tree. The same remove_pgd_mapping() function is used in both
> vmemmap_free() and arch_remove_memory(), but in the latter function
> the backing pages are not removed.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 242 +++
>  1 file changed, 242 insertions(+)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 6f72b0b2b854..7f0b921a3d3a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
> }
>  }
>  #endif
> +
> +#ifdef CONFIG_MEMORY_HOTPLUG
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +   pte_t *pte;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PTE; i++) {
> +   pte = pte_start + i;
> +   if (!pte_none(*pte))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
> +   pmd_clear(pmd);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +   pmd_t *pmd;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PMD; i++) {
> +   pmd = pmd_start + i;
> +   if (!pmd_none(*pmd))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
> +   pud_clear(pud);
> +}
> +
> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
> +{
> +   pud_t *pud;
> +   int i;
> +
> +   for (i = 0; i < PTRS_PER_PUD; i++) {
> +   pud = pud_start + i;
> +   if (!pud_none(*pud))
> +   return;
> +   }
> +
> +   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
> +   p4d_clear(p4d);
> +}
> +
> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
> +  struct vmem_altmap *altmap)
> +{
> +   if (altmap)
> +   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
> +   else
> +   free_pages((unsigned long)page_address(page), 
> get_order(size));
> +}
> +
> +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *ptep, pte;
> +
> +   for (; addr < end; addr = next) {
> +   next = (addr + PAGE_SIZE) & PAGE_MASK;
> +   if (next > end)
> +   next = end;
> +
> +   ptep = pte_base + pte_index(addr);
> +   pte = READ_ONCE(*ptep);
> +
> +   if (!pte_present(*ptep))
> +   continue;
> +
> +   pte_clear(_mm, addr, ptep);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, 
> altmap);
> +   }
> +}
> +
> +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long 
> addr, unsigned long end,
> +bool is_vmemmap, struct vmem_altmap 
> *altmap)
> +{
> +   unsigned long next;
> +   pte_t *pte_base;
> +   pmd_t *pmdp, pmd;
> +
> +   for (; addr < end; addr = next) {
> +   next = pmd_addr_end(addr, end);
> +   pmdp = pmd_base + pmd_index(addr);
> +   pmd = READ_ONCE(*pmdp);
> +
> +   if (!pmd_present(pmd))
> +   continue;
> +
> +   if (pmd_leaf(pmd)) {
> +   pmd_clear(pmdp);
> +   if (is_vmemmap)
> +   free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, 
> altmap);
> +   continue;
> +   }
> +
> +   pte_base = 

Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Björn Töpel
Alexandre Ghiti  writes:

> On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>>
>> From: Björn Töpel 
>>
>> Prepare for memory hotplugging support by changing from __init to
>> __meminit for the page table functions that are used by the upcoming
>> architecture specific callbacks.
>>
>> Changing the __init attribute to __meminit, avoids that the functions
>> are removed after init. The __meminit attribute makes sure the
>> functions are kept in the kernel text post init, but only if memory
>> hotplugging is enabled for the build.
>>
>> Also, make sure that the altmap parameter is properly passed on to
>> vmemmap_populate_hugepages().
>>
>> Signed-off-by: Björn Töpel 
>> ---
>>  arch/riscv/include/asm/mmu.h |  4 +--
>>  arch/riscv/include/asm/pgtable.h |  2 +-
>>  arch/riscv/mm/init.c | 58 ++--
>>  3 files changed, 29 insertions(+), 35 deletions(-)
>>
>> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
>> index 60be458e94da..c09c3c79f496 100644
>> --- a/arch/riscv/include/asm/mmu.h
>> +++ b/arch/riscv/include/asm/mmu.h
>> @@ -28,8 +28,8 @@ typedef struct {
>>  #endif
>>  } mm_context_t;
>>
>> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
>> -  phys_addr_t sz, pgprot_t prot);
>> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t 
>> pa, phys_addr_t sz,
>> + pgprot_t prot);
>>  #endif /* __ASSEMBLY__ */
>>
>>  #endif /* _ASM_RISCV_MMU_H */
>> diff --git a/arch/riscv/include/asm/pgtable.h 
>> b/arch/riscv/include/asm/pgtable.h
>> index 58fd7b70b903..7933f493db71 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -162,7 +162,7 @@ struct pt_alloc_ops {
>>  #endif
>>  };
>>
>> -extern struct pt_alloc_ops pt_ops __initdata;
>> +extern struct pt_alloc_ops pt_ops __meminitdata;
>>
>>  #ifdef CONFIG_MMU
>>  /* Number of PGD entries that a user-mode program can use */
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 5b8cdfafb52a..c969427eab88 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
>>  }
>>
>>  #ifdef CONFIG_MMU
>> -struct pt_alloc_ops pt_ops __initdata;
>> +struct pt_alloc_ops pt_ops __meminitdata;
>>
>>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>>  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>> @@ -357,7 +357,7 @@ static inline pte_t *__init 
>> get_pte_virt_fixmap(phys_addr_t pa)
>> return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
>>  }
>>
>> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
>> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
>>  {
>> return (pte_t *) __va(pa);
>>  }
>> @@ -376,7 +376,7 @@ static inline phys_addr_t __init 
>> alloc_pte_fixmap(uintptr_t va)
>> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t __init alloc_pte_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
>>  {
>> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
>> 0);
>>
>> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
>> return __pa((pte_t *)ptdesc_address(ptdesc));
>>  }
>>
>> -static void __init create_pte_mapping(pte_t *ptep,
>> - uintptr_t va, phys_addr_t pa,
>> - phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, 
>> phys_addr_t pa, phys_addr_t sz,
>> +pgprot_t prot)
>>  {
>> uintptr_t pte_idx = pte_index(va);
>>
>> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
>> return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
>>  }
>>
>> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
>> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
>>  {
>> return (pmd_t *) __va(pa);
>>  }
>> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
>> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>>  }
>>
>> -static phys_addr_t __init alloc_pmd_late(uintptr_t va)
>> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
>>  {
>> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
>> 0);
>>
>> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
>> return __pa((pmd_t *)ptdesc_address(ptdesc));
>>  }
>>
>> -static void __init create_pmd_mapping(pmd_t *pmdp,
>> - uintptr_t va, phys_addr_t pa,
>> - phys_addr_t sz, pgprot_t prot)
>> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
>> +uintptr_t va, phys_addr_t pa,
>> +phys_addr_t sz, 

Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Add a parameter to the direct map setup function, so it can be used in
> arch_add_memory() later.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/mm/init.c | 15 ++-
>  1 file changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index c969427eab88..6f72b0b2b854 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
>  }
>
>  static void __meminit create_linear_mapping_range(phys_addr_t start, 
> phys_addr_t end,
> - uintptr_t fixed_map_size)
> + uintptr_t fixed_map_size, 
> const pgprot_t *pgprot)
>  {
> phys_addr_t pa;
> uintptr_t va, map_size;
> @@ -1238,7 +1238,7 @@ static void __meminit 
> create_linear_mapping_range(phys_addr_t start, phys_addr_t
> best_map_size(pa, va, end - pa);
>
> create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
> -  pgprot_from_va(va));
> +  pgprot ? *pgprot : pgprot_from_va(va));
> }
>  }
>
> @@ -1282,22 +1282,19 @@ static void __init 
> create_linear_mapping_page_table(void)
> if (end >= __pa(PAGE_OFFSET) + memory_limit)
> end = __pa(PAGE_OFFSET) + memory_limit;
>
> -   create_linear_mapping_range(start, end, 0);
> +   create_linear_mapping_range(start, end, 0, NULL);
> }
>
>  #ifdef CONFIG_STRICT_KERNEL_RWX
> -   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
> -   create_linear_mapping_range(krodata_start,
> -   krodata_start + krodata_size, 0);
> +   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, 
> NULL);
> +   create_linear_mapping_range(krodata_start, krodata_start + 
> krodata_size, 0, NULL);
>
> memblock_clear_nomap(ktext_start,  ktext_size);
> memblock_clear_nomap(krodata_start, krodata_size);
>  #endif
>
>  #ifdef CONFIG_KFENCE
> -   create_linear_mapping_range(kfence_pool,
> -   kfence_pool + KFENCE_POOL_SIZE,
> -   PAGE_SIZE);
> +   create_linear_mapping_range(kfence_pool, kfence_pool + 
> KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
>
> memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
>  #endif
> --
> 2.40.1
>

You can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex



Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Alexandre Ghiti
On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> Prepare for memory hotplugging support by changing from __init to
> __meminit for the page table functions that are used by the upcoming
> architecture specific callbacks.
>
> Changing the __init attribute to __meminit, avoids that the functions
> are removed after init. The __meminit attribute makes sure the
> functions are kept in the kernel text post init, but only if memory
> hotplugging is enabled for the build.
>
> Also, make sure that the altmap parameter is properly passed on to
> vmemmap_populate_hugepages().
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/include/asm/mmu.h |  4 +--
>  arch/riscv/include/asm/pgtable.h |  2 +-
>  arch/riscv/mm/init.c | 58 ++--
>  3 files changed, 29 insertions(+), 35 deletions(-)
>
> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
> index 60be458e94da..c09c3c79f496 100644
> --- a/arch/riscv/include/asm/mmu.h
> +++ b/arch/riscv/include/asm/mmu.h
> @@ -28,8 +28,8 @@ typedef struct {
>  #endif
>  } mm_context_t;
>
> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
> -  phys_addr_t sz, pgprot_t prot);
> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, 
> phys_addr_t sz,
> + pgprot_t prot);
>  #endif /* __ASSEMBLY__ */
>
>  #endif /* _ASM_RISCV_MMU_H */
> diff --git a/arch/riscv/include/asm/pgtable.h 
> b/arch/riscv/include/asm/pgtable.h
> index 58fd7b70b903..7933f493db71 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -162,7 +162,7 @@ struct pt_alloc_ops {
>  #endif
>  };
>
> -extern struct pt_alloc_ops pt_ops __initdata;
> +extern struct pt_alloc_ops pt_ops __meminitdata;
>
>  #ifdef CONFIG_MMU
>  /* Number of PGD entries that a user-mode program can use */
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 5b8cdfafb52a..c969427eab88 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
>  }
>
>  #ifdef CONFIG_MMU
> -struct pt_alloc_ops pt_ops __initdata;
> +struct pt_alloc_ops pt_ops __meminitdata;
>
>  pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
>  pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
> @@ -357,7 +357,7 @@ static inline pte_t *__init 
> get_pte_virt_fixmap(phys_addr_t pa)
> return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
>  }
>
> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
>  {
> return (pte_t *) __va(pa);
>  }
> @@ -376,7 +376,7 @@ static inline phys_addr_t __init 
> alloc_pte_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pte_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
>  {
> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
> 0);
>
> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
> return __pa((pte_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pte_mapping(pte_t *ptep,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, 
> phys_addr_t pa, phys_addr_t sz,
> +pgprot_t prot)
>  {
> uintptr_t pte_idx = pte_index(va);
>
> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
> return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
>  }
>
> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
>  {
> return (pmd_t *) __va(pa);
>  }
> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
>  }
>
> -static phys_addr_t __init alloc_pmd_late(uintptr_t va)
> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
>  {
> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 
> 0);
>
> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
> return __pa((pmd_t *)ptdesc_address(ptdesc));
>  }
>
> -static void __init create_pmd_mapping(pmd_t *pmdp,
> - uintptr_t va, phys_addr_t pa,
> - phys_addr_t sz, pgprot_t prot)
> +static void __meminit create_pmd_mapping(pmd_t *pmdp,
> +uintptr_t va, phys_addr_t pa,
> +phys_addr_t sz, pgprot_t prot)
>  {
> pte_t *ptep;
> phys_addr_t pte_phys;
> @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
>   

Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
David Hildenbrand  writes:

> On 14.05.24 16:04, Björn Töpel wrote:
>> From: Björn Töpel 
>> 
>> For an architecture to support memory hotplugging, a couple of
>> callbacks needs to be implemented:
>> 
>>   arch_add_memory()
>>This callback is responsible for adding the physical memory into the
>>direct map, and call into the memory hotplugging generic code via
>>__add_pages() that adds the corresponding struct page entries, and
>>updates the vmemmap mapping.
>> 
>>   arch_remove_memory()
>>This is the inverse of the callback above.
>> 
>>   vmemmap_free()
>>This function tears down the vmemmap mappings (if
>>CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
>>backing vmemmap pages. Note that for persistent memory, an
>>alternative allocator for the backing pages can be used; The
>>vmem_altmap. This means that when the backing pages are cleared,
>>extra care is needed so that the correct deallocation method is
>>used.
>> 
>>   arch_get_mappable_range()
>>This functions returns the PA range that the direct map can map.
>>Used by the MHP internals for sanity checks.
>> 
>> The page table unmap/teardown functions are heavily based on code from
>> the x86 tree. The same remove_pgd_mapping() function is used in both
>> vmemmap_free() and arch_remove_memory(), but in the latter function
>> the backing pages are not removed.
>> 
>> Signed-off-by: Björn Töpel 
>> ---
>>   arch/riscv/mm/init.c | 242 +++
>>   1 file changed, 242 insertions(+)
>> 
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 6f72b0b2b854..7f0b921a3d3a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
>>  }
>>   }
>>   #endif
>> +
>> +#ifdef CONFIG_MEMORY_HOTPLUG
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +pte_t *pte;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PTE; i++) {
>> +pte = pte_start + i;
>> +if (!pte_none(*pte))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
>> +pmd_clear(pmd);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +pmd_t *pmd;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PMD; i++) {
>> +pmd = pmd_start + i;
>> +if (!pmd_none(*pmd))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(pud_page(*pud)), 0);
>> +pud_clear(pud);
>> +}
>> +
>> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
>> +{
>> +pud_t *pud;
>> +int i;
>> +
>> +for (i = 0; i < PTRS_PER_PUD; i++) {
>> +pud = pud_start + i;
>> +if (!pud_none(*pud))
>> +return;
>> +}
>> +
>> +free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
>> +p4d_clear(p4d);
>> +}
>> +
>> +static void __meminit free_vmemmap_storage(struct page *page, size_t size,
>> +   struct vmem_altmap *altmap)
>> +{
>> +if (altmap)
>> +vmem_altmap_free(altmap, size >> PAGE_SHIFT);
>> +else
>> +free_pages((unsigned long)page_address(page), get_order(size));
>
> If you unplug a DIMM that was added during boot (can happen on x86-64, 
> can it happen on riscv?), free_pages() would not be sufficient. You'd be 
> freeing a PG_reserved page that has to be freed differently.

I'd say if it can happen on x86-64, it probably can on RISC-V. I'll look
into this for the next spin!

Thanks for spending time on the series!


Cheers,
Björn



Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries

2024-05-14 Thread Björn Töpel
Alexandre Ghiti  writes:

> Hi Björn,
>
> On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>>
>> From: Björn Töpel 
>>
>> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
>> all userland page tables, which means that if the PGD level table is
>> changed, other page tables has to be updated as well.
>>
>> Instead of having the PGD changes ripple out to all tables, the
>> synchronization can be avoided by pre-allocating the PGD entries/pages
>> at boot, avoiding the synchronization all together.
>>
>> This is currently done for the bpf/modules, and vmalloc PGD regions.
>> Extend this scheme for the PGD regions touched by memory hotplugging.
>>
>> Prepare the RISC-V port for memory hotplug by pre-allocate
>> vmemmap/direct map entries at the PGD level. This will roughly waste
>> ~128 worth of 4K pages when memory hotplugging is enabled in the
>> kernel configuration.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>>  arch/riscv/include/asm/kasan.h | 4 ++--
>>  arch/riscv/mm/init.c   | 7 +++
>>  2 files changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
>> index 0b85e363e778..e6a0071bdb56 100644
>> --- a/arch/riscv/include/asm/kasan.h
>> +++ b/arch/riscv/include/asm/kasan.h
>> @@ -6,8 +6,6 @@
>>
>>  #ifndef __ASSEMBLY__
>>
>> -#ifdef CONFIG_KASAN
>> -
>>  /*
>>   * The following comment was copied from arm64:
>>   * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
>> @@ -34,6 +32,8 @@
>>   */
>>  #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & 
>> PGDIR_MASK)
>>  #define KASAN_SHADOW_END   MODULES_LOWEST_VADDR
>> +
>> +#ifdef CONFIG_KASAN
>>  #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>>
>>  void kasan_init(void);
>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>> index 2574f6a3b0e7..5b8cdfafb52a 100644
>> --- a/arch/riscv/mm/init.c
>> +++ b/arch/riscv/mm/init.c
>> @@ -27,6 +27,7 @@
>>
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -1488,10 +1489,16 @@ static void __init 
>> preallocate_pgd_pages_range(unsigned long start, unsigned lon
>> panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
>>  }
>>
>> +#define PAGE_END KASAN_SHADOW_START
>> +
>>  void __init pgtable_cache_init(void)
>>  {
>> preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
>> if (IS_ENABLED(CONFIG_MODULES))
>> preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, 
>> "bpf/modules");
>> +   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
>> +   preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, 
>> "vmemmap");
>> +   preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct 
>> map");
>> +   }
>>  }
>>  #endif
>> --
>> 2.40.1
>>
>
> As you asked, with
> https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexgh...@rivosinc.com/T/#u,
> you will be able to remove the usage of KASAN_SHADOW_START.

Very nice -- consistency! I'll need to respin, so I'll clean this up for
the next version.

> But anyhow, you can add:
>
> Reviewed-by: Alexandre Ghiti 


Thank you!
Björn



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> +/*
> + * Used for internal kernel memctl calls, i.e. to better support kernel 
> stacks,
> + * or to efficiently zero hugetlb pages.
> + */
> +long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
> +  struct memctl_buf *buf)
> +{
> + buf->call.func_code = func_code;
> + buf->call.addr = addr;
> + buf->call.length = length;
> + buf->call.arg = arg;
> +
> + return __memctl_vmm_call(buf);
> +}
> +EXPORT_SYMBOL(memctl_vmm_call);

You export something that is never actually called, which implies that
this is not tested at all (i.e. it is dead code.)  Please remove.

Also, why not EXPORT_SYMBOL_GPL()?   (I have to ask, sorry.)

thanks,

greg k-h



Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties

2024-05-14 Thread Greg Kroah-Hartman
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote:
> Memctl provides a way for the guest to control its physical memory
> properties, and enables optimizations and security features. For
> example, the guest can provide information to the host where parts of a
> hugepage may be unbacked, or sensitive data may not be swapped out, etc.
> 
> Memctl allows guests to manipulate its gPTE entries in the SLAT, and
> also some other properties of the memory map the back's host memory.
> This is achieved by using the KVM_CAP_SYNC_MMU capability. When this
> capability is available, the changes in the backing of the memory region
> on the host are automatically reflected into the guest. For example, an
> mmap() or madvise() that affects the region will be made visible
> immediately.
> 
> There are two components of the implementation: the guest Linux driver
> and Virtual Machine Monitor (VMM) device. A guest-allocated shared
> buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM
> device assigns a unique command for each per-cpu buffer. The guest
> writes its memctl request in the per-cpu buffer, then writes the
> corresponding command into the command register, calling into the VMM
> device to perform the memctl request.
> 
> The synchronous per-cpu shared buffer approach avoids the kick and busy
> waiting that the guest would have to do with virtio virtqueue transport.
> 
> We provide both kernel and userspace APIs
> Kernel API
> long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg,
>struct memctl_buf *buf);
> 
> Kernel drivers can take advantage of the memctl calls to provide
> paravirtualization of kernel stacks or page zeroing.
> 
> User API
> >From the userland, the memctl guest driver is controlled via ioctl(2)
> call. It requires CAP_SYS_ADMIN.
> 
> ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm);
> 
> Guest userland applications can tag VMAs and guest hugepages, or advise
> the host on how to handle sensitive guest pages.
> 
> Supported function codes and their use cases:
> MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the
> struct page and page table lookup overhead by using hugepages backed by
> smaller pages on the host. These memctl commands can allow for partial
> freeing of private guest hugepages to save memory. They also allow
> kernel memory, such as kernel stacks and task_structs to be
> paravirtualized.
> 
> MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to
> share its backing pages.
> The same with MADV_DONTDUMP, so sensitive pages are not included in a
> dump.
> MLOCK/UNLOCK can advise the host that sensitive information is not
> swapped out on the host.
> 
> MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack
> guard pages can be handled in the host and memory can be saved in the
> hugepage.
> 
> MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how
> guest memory is being mapped on the host.
> 
> Sample program making use of MEMCTL_SET_VMA_ANON_NAME and
> MEMCTL_DONTNEED:
> https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main
> https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed
> 
> The VMM implementation is being proposed for Cloud Hypervisor:
> https://github.com/Dummyc0m/cloud-hypervisor/
> 
> Cloud Hypervisor issue:
> https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318
> 
> Signed-off-by: Yuanchu Xie 
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |   2 +
>  drivers/virt/Kconfig  |   2 +
>  drivers/virt/Makefile |   1 +
>  drivers/virt/memctl/Kconfig   |  10 +
>  drivers/virt/memctl/Makefile  |   2 +
>  drivers/virt/memctl/memctl.c  | 425 ++
>  include/linux/memctl.h|  27 ++
>  include/uapi/linux/memctl.h   |  81 

You are mixing your PCI driver in with the memctl core code, is that
intentional?  Will there never be another PCI device for this type of
interface other than this one PCI device?

And if so, why export anything, why isn't this all in one body of code?

>  8 files changed, 550 insertions(+)
>  create mode 100644 drivers/virt/memctl/Kconfig
>  create mode 100644 drivers/virt/memctl/Makefile
>  create mode 100644 drivers/virt/memctl/memctl.c
>  create mode 100644 include/linux/memctl.h
>  create mode 100644 include/uapi/linux/memctl.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 457e16f06e04..789d1251c0be 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -368,6 +368,8 @@ Code  Seq#Include File
>Comments
>  0xCD  01 linux/reiserfs_fs.h
>  0xCE  01-02  uapi/linux/cxl_mem.h

Re: [PATCH v1 1/1] Input: gpio-keys - expose wakeup keys in sysfs

2024-05-14 Thread Guido Günther
Hi,
On Mon, May 13, 2024 at 03:13:53PM -0700, Dmitry Torokhov wrote:
> Hi Guido,
> 
> On Thu, May 09, 2024 at 02:00:28PM +0200, Guido Günther wrote:
> > This helps user space to figure out which keys should be used to unidle a
> > device. E.g on phones the volume rocker should usually not unblank the
> > screen.
> 
> How exactly this is supposed to be used? We have "disabled" keys and
> switches attribute because this function can be controlled at runtime
> from userspace while wakeup control is a static device setting.

Current Linux userspace usually unblanks/unidles a device on every
keypress. That is usually not the expected result on phones where often
only the power button and e.g. some home buttons should do this.

These keys usually match the keys that are used as wakeup sources to
bring a device out of suspend. So if we export the wakeup keys to
userspace we can pick some sensible defaults (overridable via hwdb¹).

> Kernel also does not really know if the screen should be unblanked or
> not, if a button or switch is configured for wake up the kernel will go
> through wakeup process all the same and then userspace can decide if it
> should stay woken up or not.

Yes, we merely want that as a hint to figure out sensible defaults in
userspace (which might be a subset of the wakeup keys).

Cherrs,
 -- Guido

¹) See 
https://gitlab.gnome.org/World/Phosh/gmobile/-/blob/main/data/61-gmobile-wakeup.hwdb?ref_type=heads#L57-L59

> 
> Thanks.
> 
> -- 
> Dmitry
> 



Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

  arch_add_memory()
   This callback is responsible for adding the physical memory into the
   direct map, and call into the memory hotplugging generic code via
   __add_pages() that adds the corresponding struct page entries, and
   updates the vmemmap mapping.

  arch_remove_memory()
   This is the inverse of the callback above.

  vmemmap_free()
   This function tears down the vmemmap mappings (if
   CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
   backing vmemmap pages. Note that for persistent memory, an
   alternative allocator for the backing pages can be used; The
   vmem_altmap. This means that when the backing pages are cleared,
   extra care is needed so that the correct deallocation method is
   used.

  arch_get_mappable_range()
   This functions returns the PA range that the direct map can map.
   Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
  arch/riscv/mm/init.c | 242 +++
  1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
  }
  #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));


If you unplug a DIMM that was added during boot (can happen on x86-64, 
can it happen on riscv?), free_pages() would not be sufficient. You'd be 
freeing a PG_reserved page that has to be freed differently.


--
Cheers,

David / dhildenb




Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel 
---


Reviewed-by: David Hildenbrand 

--
Cheers,

David / dhildenb




Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V

2024-05-14 Thread David Hildenbrand

On 14.05.24 16:04, Björn Töpel wrote:

From: Björn Töpel 

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel 
---
  drivers/virtio/Kconfig | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
  
  config VIRTIO_MEM

tristate "Virtio mem driver"
-   depends on X86_64 || ARM64
+   depends on X86_64 || ARM64 || RISCV
depends on VIRTIO
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE



Nice!

Acked-by: David Hildenbrand 
--
Cheers,

David / dhildenb




Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries

2024-05-14 Thread Alexandre Ghiti
Hi Björn,

On Tue, May 14, 2024 at 4:05 PM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
> all userland page tables, which means that if the PGD level table is
> changed, other page tables has to be updated as well.
>
> Instead of having the PGD changes ripple out to all tables, the
> synchronization can be avoided by pre-allocating the PGD entries/pages
> at boot, avoiding the synchronization all together.
>
> This is currently done for the bpf/modules, and vmalloc PGD regions.
> Extend this scheme for the PGD regions touched by memory hotplugging.
>
> Prepare the RISC-V port for memory hotplug by pre-allocate
> vmemmap/direct map entries at the PGD level. This will roughly waste
> ~128 worth of 4K pages when memory hotplugging is enabled in the
> kernel configuration.
>
> Signed-off-by: Björn Töpel 
> ---
>  arch/riscv/include/asm/kasan.h | 4 ++--
>  arch/riscv/mm/init.c   | 7 +++
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
> index 0b85e363e778..e6a0071bdb56 100644
> --- a/arch/riscv/include/asm/kasan.h
> +++ b/arch/riscv/include/asm/kasan.h
> @@ -6,8 +6,6 @@
>
>  #ifndef __ASSEMBLY__
>
> -#ifdef CONFIG_KASAN
> -
>  /*
>   * The following comment was copied from arm64:
>   * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
> @@ -34,6 +32,8 @@
>   */
>  #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & 
> PGDIR_MASK)
>  #define KASAN_SHADOW_END   MODULES_LOWEST_VADDR
> +
> +#ifdef CONFIG_KASAN
>  #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
>
>  void kasan_init(void);
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 2574f6a3b0e7..5b8cdfafb52a 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -27,6 +27,7 @@
>
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1488,10 +1489,16 @@ static void __init 
> preallocate_pgd_pages_range(unsigned long start, unsigned lon
> panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
>  }
>
> +#define PAGE_END KASAN_SHADOW_START
> +
>  void __init pgtable_cache_init(void)
>  {
> preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
> if (IS_ENABLED(CONFIG_MODULES))
> preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, 
> "bpf/modules");
> +   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
> +   preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, 
> "vmemmap");
> +   preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct 
> map");
> +   }
>  }
>  #endif
> --
> 2.40.1
>

As you asked, with
https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexgh...@rivosinc.com/T/#u,
you will be able to remove the usage of KASAN_SHADOW_START.

But anyhow, you can add:

Reviewed-by: Alexandre Ghiti 

Thanks,

Alex



[GIT PULL] OpenRISC updates for 6.10

2024-05-14 Thread Stafford Horne
Hello Linus,

Please consider for pull,

The following changes since commit 4cece764965020c22cff7665b18a012006359095:

  Linux 6.9-rc1 (2024-03-24 14:10:05 -0700)

are available in the Git repository at:

  https://github.com/openrisc/linux.git tags/for-linus

for you to fetch changes up to 4dc70e1aadfadf968676d983587c6f5d455aba85:

  openrisc: Move FPU state out of pt_regs (2024-04-15 15:20:39 +0100)


OpenRISC updates for 6.10

A few cleanups and fixups from me:

 - Add a few missing relocations to fix module loading.
 - Cleanup FPU state save and restore to be more efficient.
 - Cleanups to traps handling and logging.
 - Fix issue with poweroff being broken after recent power driver
   refactoings.


Stafford Horne (8):
  openrisc: Use do_kernel_power_off()
  openrisc: Define openrisc relocation types
  openrisc: Add support for more module relocations
  openrisc: traps: Convert printks to pr_ macros
  openrisc: traps: Remove calls to show_registers before die
  openrisc: traps: Don't send signals to kernel mode threads
  openrisc: Add FPU config
  openrisc: Move FPU state out of pt_regs

 arch/openrisc/Kconfig |   9 +++
 arch/openrisc/include/asm/fpu.h   |  22 ++
 arch/openrisc/include/asm/processor.h |   1 +
 arch/openrisc/include/asm/ptrace.h|   3 +-
 arch/openrisc/include/uapi/asm/elf.h  |  75 +++---
 arch/openrisc/kernel/entry.S  |  15 +---
 arch/openrisc/kernel/module.c |  18 -
 arch/openrisc/kernel/process.c|  13 +--
 arch/openrisc/kernel/ptrace.c |  18 ++---
 arch/openrisc/kernel/signal.c |  36 -
 arch/openrisc/kernel/traps.c  | 144 ++
 11 files changed, 243 insertions(+), 111 deletions(-)
 create mode 100644 arch/openrisc/include/asm/fpu.h



Re: [PATCH v3] module: create weak dependecies

2024-05-14 Thread Lucas De Marchi

On Fri, May 10, 2024 at 10:57:22AM GMT, Jose Ignacio Tornos Martinez wrote:

It has been seen that for some network mac drivers (i.e. lan78xx) the
related module for the phy is loaded dynamically depending on the current
hardware. In this case, the associated phy is read using mdio bus and then
the associated phy module is loaded during runtime (kernel function
phy_request_driver_module). However, no software dependency is defined, so
the user tools will no be able to get this dependency. For example, if
dracut is used and the hardware is present, lan78xx will be included but no
phy module will be added, and in the next restart the device will not work
from boot because no related phy will be found during initramfs stage.

In order to solve this, we could define a normal 'pre' software dependency
in lan78xx module with all the possible phy modules (there may be some),
but proceeding in that way, all the possible phy modules would be loaded
while only one is necessary.

The idea is to create a new type of dependency, that we are going to call
'weak' to be used only by the user tools that need to detect this situation.
In that way, for example, dracut could check the 'weak' dependency of the
modules involved in order to install these dependencies in initramfs too.
That is, for the commented lan78xx module, defining the 'weak' dependency
with the possible phy modules list, only the necessary phy would be loaded
on demand keeping the same behavior, but all the possible phy modules would
be available from initramfs.

The 'weak' dependency support has been included in kmod:
https://github.com/kmod-project/kmod/commit/05828b4a6e9327a63ef94df544a042b5e9ce4fe7
But, take into account that this can only be used if depmod is new enough.
If it isn't, depmod will have the same behavior as always (keeping backward
compatibility) and the information for the 'weak' dependency will not be
provided.

Signed-off-by: Jose Ignacio Tornos Martinez 



Reviewed-by: Lucas De Marchi 

thanks
Lucas De Marchi


---
V2 -> V3:
- Include note about backward compatibility.
- Balance the /* and */.
V1 -> V2:
- Include reference to 'weak' dependency support in kmod.

include/linux/module.h | 6 ++
1 file changed, 6 insertions(+)

diff --git a/include/linux/module.h b/include/linux/module.h
index 1153b0d99a80..2a056017df5b 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -173,6 +173,12 @@ extern void cleanup_module(void);
 */
#define MODULE_SOFTDEP(_softdep) MODULE_INFO(softdep, _softdep)

+/*
+ * Weak module dependencies. See man modprobe.d for details.
+ * Example: MODULE_WEAKDEP("module-foo")
+ */
+#define MODULE_WEAKDEP(_weakdep) MODULE_INFO(weakdep, _weakdep)
+
/*
 * MODULE_FILE is used for generating modules.builtin
 * So, make it no-op when this is being built as a module
--
2.44.0





[PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

ZONE_DEVICE pages need DEVMAP PTEs support to function
(ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit
in the PTE for DEVMAP mark, add the corresponding helpers, and enable
ARCH_HAS_PTE_DEVMAP for riscv64.

Signed-off-by: Björn Töpel 
---
 arch/riscv/Kconfig|  1 +
 arch/riscv/include/asm/pgtable-64.h   | 20 
 arch/riscv/include/asm/pgtable-bits.h |  1 +
 arch/riscv/include/asm/pgtable.h  | 15 +++
 4 files changed, 37 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index b9398b64bb69..6d426afdd904 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -36,6 +36,7 @@ config RISCV
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API
select ARCH_HAS_PREPARE_SYNC_CORE_CMD
+   select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP if MMU
select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/riscv/include/asm/pgtable-64.h 
b/arch/riscv/include/asm/pgtable-64.h
index 221a5c1ee287..c67a9bbfd010 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -400,4 +400,24 @@ static inline struct page *pgd_page(pgd_t pgd)
 #define p4d_offset p4d_offset
 p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int pte_devmap(pte_t pte);
+static inline pte_t pmd_pte(pmd_t pmd);
+
+static inline int pmd_devmap(pmd_t pmd)
+{
+   return pte_devmap(pmd_pte(pmd));
+}
+
+static inline int pud_devmap(pud_t pud)
+{
+   return 0;
+}
+
+static inline int pgd_devmap(pgd_t pgd)
+{
+   return 0;
+}
+#endif
+
 #endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable-bits.h 
b/arch/riscv/include/asm/pgtable-bits.h
index 179bd4afece4..a8f5205cea54 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -19,6 +19,7 @@
 #define _PAGE_SOFT  (3 << 8)/* Reserved for software */
 
 #define _PAGE_SPECIAL   (1 << 8)/* RSW: 0x1 */
+#define _PAGE_DEVMAP(1 << 9)/* RSW, devmap */
 #define _PAGE_TABLE _PAGE_PRESENT
 
 /*
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 7933f493db71..216de1db3cd0 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -387,6 +387,11 @@ static inline int pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
 }
 
+static inline int pte_devmap(pte_t pte)
+{
+   return pte_val(pte) & _PAGE_DEVMAP;
+}
+
 /* static inline pte_t pte_rdprotect(pte_t pte) */
 
 static inline pte_t pte_wrprotect(pte_t pte)
@@ -428,6 +433,11 @@ static inline pte_t pte_mkspecial(pte_t pte)
return __pte(pte_val(pte) | _PAGE_SPECIAL);
 }
 
+static inline pte_t pte_mkdevmap(pte_t pte)
+{
+   return __pte(pte_val(pte) | _PAGE_DEVMAP);
+}
+
 static inline pte_t pte_mkhuge(pte_t pte)
 {
return pte;
@@ -711,6 +721,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return pte_pmd(pte_mkdirty(pmd_pte(pmd)));
 }
 
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+   return pte_pmd(pte_mkdevmap(pmd_pte(pmd)));
+}
+
 static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
 {
-- 
2.40.1




[PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for
RISC-V.

Signed-off-by: Björn Töpel 
---
 arch/riscv/Kconfig | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 6bec1bce6586..b9398b64bb69 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -16,6 +16,8 @@ config RISCV
select ACPI_REDUCED_HARDWARE_ONLY if ACPI
select ARCH_DMA_DEFAULT_COHERENT
select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION
+   select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU
+   select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2
select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE
select ARCH_HAS_BINFMT_FLAT
-- 
2.40.1




[PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

Now that RISC-V has memory hotplugging support, virtio-mem can be used
on the platform.

Signed-off-by: Björn Töpel 
---
 drivers/virtio/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
index c17193544268..4e5cebf1b82a 100644
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -122,7 +122,7 @@ config VIRTIO_BALLOON
 
 config VIRTIO_MEM
tristate "Virtio mem driver"
-   depends on X86_64 || ARM64
+   depends on X86_64 || ARM64 || RISCV
depends on VIRTIO
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
-- 
2.40.1




[PATCH v2 4/8] riscv: mm: Add memory hotplugging support

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

For an architecture to support memory hotplugging, a couple of
callbacks needs to be implemented:

 arch_add_memory()
  This callback is responsible for adding the physical memory into the
  direct map, and call into the memory hotplugging generic code via
  __add_pages() that adds the corresponding struct page entries, and
  updates the vmemmap mapping.

 arch_remove_memory()
  This is the inverse of the callback above.

 vmemmap_free()
  This function tears down the vmemmap mappings (if
  CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the
  backing vmemmap pages. Note that for persistent memory, an
  alternative allocator for the backing pages can be used; The
  vmem_altmap. This means that when the backing pages are cleared,
  extra care is needed so that the correct deallocation method is
  used.

 arch_get_mappable_range()
  This functions returns the PA range that the direct map can map.
  Used by the MHP internals for sanity checks.

The page table unmap/teardown functions are heavily based on code from
the x86 tree. The same remove_pgd_mapping() function is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

Signed-off-by: Björn Töpel 
---
 arch/riscv/mm/init.c | 242 +++
 1 file changed, 242 insertions(+)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 6f72b0b2b854..7f0b921a3d3a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void)
}
 }
 #endif
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+   pte_t *pte;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PTE; i++) {
+   pte = pte_start + i;
+   if (!pte_none(*pte))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pmd_page(*pmd)), 0);
+   pmd_clear(pmd);
+}
+
+static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+   pmd_t *pmd;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PMD; i++) {
+   pmd = pmd_start + i;
+   if (!pmd_none(*pmd))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(pud_page(*pud)), 0);
+   pud_clear(pud);
+}
+
+static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   free_pages((unsigned long)page_address(p4d_page(*p4d)), 0);
+   p4d_clear(p4d);
+}
+
+static void __meminit free_vmemmap_storage(struct page *page, size_t size,
+  struct vmem_altmap *altmap)
+{
+   if (altmap)
+   vmem_altmap_free(altmap, size >> PAGE_SHIFT);
+   else
+   free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)
+{
+   unsigned long next;
+   pte_t *ptep, pte;
+
+   for (; addr < end; addr = next) {
+   next = (addr + PAGE_SIZE) & PAGE_MASK;
+   if (next > end)
+   next = end;
+
+   ptep = pte_base + pte_index(addr);
+   pte = READ_ONCE(*ptep);
+
+   if (!pte_present(*ptep))
+   continue;
+
+   pte_clear(_mm, addr, ptep);
+   if (is_vmemmap)
+   free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap);
+   }
+}
+
+static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)
+{
+   unsigned long next;
+   pte_t *pte_base;
+   pmd_t *pmdp, pmd;
+
+   for (; addr < end; addr = next) {
+   next = pmd_addr_end(addr, end);
+   pmdp = pmd_base + pmd_index(addr);
+   pmd = READ_ONCE(*pmdp);
+
+   if (!pmd_present(pmd))
+   continue;
+
+   if (pmd_leaf(pmd)) {
+   pmd_clear(pmdp);
+   if (is_vmemmap)
+   free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, 
altmap);
+   continue;
+   }
+
+   pte_base = (pte_t *)pmd_page_vaddr(*pmdp);
+   remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap);
+   free_pte_table(pte_base, pmdp);
+   }
+}
+
+static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, 
unsigned long end,
+bool is_vmemmap, struct vmem_altmap 
*altmap)

[PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

During memory hot remove, the ptdump functionality can end up touching
stale data. Avoid any potential crashes (or worse), by holding the
memory hotplug read-lock while traversing the page table.

This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm:
Hold memory hotplug lock while walking for kernel page table dump").

Signed-off-by: Björn Töpel 
---
 arch/riscv/mm/ptdump.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c
index 1289cc6d3700..9d5f657a251b 100644
--- a/arch/riscv/mm/ptdump.c
+++ b/arch/riscv/mm/ptdump.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -370,7 +371,9 @@ bool ptdump_check_wx(void)
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
+   get_online_mems();
ptdump_walk(m, m->private);
+   put_online_mems();
 
return 0;
 }
-- 
2.40.1




[PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

Add a parameter to the direct map setup function, so it can be used in
arch_add_memory() later.

Signed-off-by: Björn Töpel 
---
 arch/riscv/mm/init.c | 15 ++-
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index c969427eab88..6f72b0b2b854 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa)
 }
 
 static void __meminit create_linear_mapping_range(phys_addr_t start, 
phys_addr_t end,
- uintptr_t fixed_map_size)
+ uintptr_t fixed_map_size, 
const pgprot_t *pgprot)
 {
phys_addr_t pa;
uintptr_t va, map_size;
@@ -1238,7 +1238,7 @@ static void __meminit 
create_linear_mapping_range(phys_addr_t start, phys_addr_t
best_map_size(pa, va, end - pa);
 
create_pgd_mapping(swapper_pg_dir, va, pa, map_size,
-  pgprot_from_va(va));
+  pgprot ? *pgprot : pgprot_from_va(va));
}
 }
 
@@ -1282,22 +1282,19 @@ static void __init 
create_linear_mapping_page_table(void)
if (end >= __pa(PAGE_OFFSET) + memory_limit)
end = __pa(PAGE_OFFSET) + memory_limit;
 
-   create_linear_mapping_range(start, end, 0);
+   create_linear_mapping_range(start, end, 0, NULL);
}
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
-   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0);
-   create_linear_mapping_range(krodata_start,
-   krodata_start + krodata_size, 0);
+   create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, 
NULL);
+   create_linear_mapping_range(krodata_start, krodata_start + 
krodata_size, 0, NULL);
 
memblock_clear_nomap(ktext_start,  ktext_size);
memblock_clear_nomap(krodata_start, krodata_size);
 #endif
 
 #ifdef CONFIG_KFENCE
-   create_linear_mapping_range(kfence_pool,
-   kfence_pool + KFENCE_POOL_SIZE,
-   PAGE_SIZE);
+   create_linear_mapping_range(kfence_pool, kfence_pool + 
KFENCE_POOL_SIZE, PAGE_SIZE, NULL);
 
memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE);
 #endif
-- 
2.40.1




[PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

Prepare for memory hotplugging support by changing from __init to
__meminit for the page table functions that are used by the upcoming
architecture specific callbacks.

Changing the __init attribute to __meminit, avoids that the functions
are removed after init. The __meminit attribute makes sure the
functions are kept in the kernel text post init, but only if memory
hotplugging is enabled for the build.

Also, make sure that the altmap parameter is properly passed on to
vmemmap_populate_hugepages().

Signed-off-by: Björn Töpel 
---
 arch/riscv/include/asm/mmu.h |  4 +--
 arch/riscv/include/asm/pgtable.h |  2 +-
 arch/riscv/mm/init.c | 58 ++--
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h
index 60be458e94da..c09c3c79f496 100644
--- a/arch/riscv/include/asm/mmu.h
+++ b/arch/riscv/include/asm/mmu.h
@@ -28,8 +28,8 @@ typedef struct {
 #endif
 } mm_context_t;
 
-void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa,
-  phys_addr_t sz, pgprot_t prot);
+void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, 
phys_addr_t sz,
+ pgprot_t prot);
 #endif /* __ASSEMBLY__ */
 
 #endif /* _ASM_RISCV_MMU_H */
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 58fd7b70b903..7933f493db71 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -162,7 +162,7 @@ struct pt_alloc_ops {
 #endif
 };
 
-extern struct pt_alloc_ops pt_ops __initdata;
+extern struct pt_alloc_ops pt_ops __meminitdata;
 
 #ifdef CONFIG_MMU
 /* Number of PGD entries that a user-mode program can use */
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 5b8cdfafb52a..c969427eab88 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -295,7 +295,7 @@ static void __init setup_bootmem(void)
 }
 
 #ifdef CONFIG_MMU
-struct pt_alloc_ops pt_ops __initdata;
+struct pt_alloc_ops pt_ops __meminitdata;
 
 pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
 pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss;
@@ -357,7 +357,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t 
pa)
return (pte_t *)set_fixmap_offset(FIX_PTE, pa);
 }
 
-static inline pte_t *__init get_pte_virt_late(phys_addr_t pa)
+static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa)
 {
return (pte_t *) __va(pa);
 }
@@ -376,7 +376,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t 
va)
return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pte_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pte_late(uintptr_t va)
 {
struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
@@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va)
return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
-static void __init create_pte_mapping(pte_t *ptep,
- uintptr_t va, phys_addr_t pa,
- phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, 
phys_addr_t pa, phys_addr_t sz,
+pgprot_t prot)
 {
uintptr_t pte_idx = pte_index(va);
 
@@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa)
return (pmd_t *)set_fixmap_offset(FIX_PMD, pa);
 }
 
-static pmd_t *__init get_pmd_virt_late(phys_addr_t pa)
+static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa)
 {
return (pmd_t *) __va(pa);
 }
@@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE);
 }
 
-static phys_addr_t __init alloc_pmd_late(uintptr_t va)
+static phys_addr_t __meminit alloc_pmd_late(uintptr_t va)
 {
struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
@@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va)
return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
-static void __init create_pmd_mapping(pmd_t *pmdp,
- uintptr_t va, phys_addr_t pa,
- phys_addr_t sz, pgprot_t prot)
+static void __meminit create_pmd_mapping(pmd_t *pmdp,
+uintptr_t va, phys_addr_t pa,
+phys_addr_t sz, pgprot_t prot)
 {
pte_t *ptep;
phys_addr_t pte_phys;
@@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa)
return (pud_t *)set_fixmap_offset(FIX_PUD, pa);
 }
 
-static pud_t *__init get_pud_virt_late(phys_addr_t pa)
+static pud_t *__meminit get_pud_virt_late(phys_addr_t pa)
 {
return (pud_t *)__va(pa);
 }
@@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t 

[PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries

2024-05-14 Thread Björn Töpel
From: Björn Töpel 

The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to
all userland page tables, which means that if the PGD level table is
changed, other page tables has to be updated as well.

Instead of having the PGD changes ripple out to all tables, the
synchronization can be avoided by pre-allocating the PGD entries/pages
at boot, avoiding the synchronization all together.

This is currently done for the bpf/modules, and vmalloc PGD regions.
Extend this scheme for the PGD regions touched by memory hotplugging.

Prepare the RISC-V port for memory hotplug by pre-allocate
vmemmap/direct map entries at the PGD level. This will roughly waste
~128 worth of 4K pages when memory hotplugging is enabled in the
kernel configuration.

Signed-off-by: Björn Töpel 
---
 arch/riscv/include/asm/kasan.h | 4 ++--
 arch/riscv/mm/init.c   | 7 +++
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h
index 0b85e363e778..e6a0071bdb56 100644
--- a/arch/riscv/include/asm/kasan.h
+++ b/arch/riscv/include/asm/kasan.h
@@ -6,8 +6,6 @@
 
 #ifndef __ASSEMBLY__
 
-#ifdef CONFIG_KASAN
-
 /*
  * The following comment was copied from arm64:
  * KASAN_SHADOW_START: beginning of the kernel virtual addresses.
@@ -34,6 +32,8 @@
  */
 #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & 
PGDIR_MASK)
 #define KASAN_SHADOW_END   MODULES_LOWEST_VADDR
+
+#ifdef CONFIG_KASAN
 #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL)
 
 void kasan_init(void);
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2574f6a3b0e7..5b8cdfafb52a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -27,6 +27,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1488,10 +1489,16 @@ static void __init preallocate_pgd_pages_range(unsigned 
long start, unsigned lon
panic("Failed to pre-allocate %s pages for %s area\n", lvl, area);
 }
 
+#define PAGE_END KASAN_SHADOW_START
+
 void __init pgtable_cache_init(void)
 {
preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc");
if (IS_ENABLED(CONFIG_MODULES))
preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, 
"bpf/modules");
+   if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) {
+   preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, 
"vmemmap");
+   preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct 
map");
+   }
 }
 #endif
-- 
2.40.1




[PATCH v2 0/8] riscv: Memory Hot(Un)Plug support

2024-05-14 Thread Björn Töpel
From: Björn Töpel 


Memory Hot(Un)Plug support (and ZONE_DEVICE) for the RISC-V port


Introduction


To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series adds memory hot(un)plugging, and ZONE_DEVICE support for
the RISC-V Linux port.

I'm sending this series while LSF/MM/BPF is on-going, and with some
luck some MM person can review the series while zoning out on a talk.
;-)

MM configuration


RISC-V MM has the following configuration:

 * Memory blocks are 128M, analogous to x86-64. It uses PMD
   ("hugepage") vmemmaps. From that follows that 2M (PMD) worth of
   vmemmap spans 32768 pages á 4K which gets us 128M.

 * The pageblock size is the minimum minimum virtio_mem size, and on
   RISC-V it's 2M (2^9 * 4K).

Implementation
==

The PGD table on RISC-V is shared/copied between for all processes. To
avoid doing page table synchronization, the first patch (patch 1)
pre-allocated the PGD entries for vmemmap/direct map. By doing that
the init_mm PGD will be fixed at kernel init, and synchronization can
be avoided all together.

The following two patches (patch 2-3) does some preparations, followed
by the actual MHP implementation (patch 4-5). Then, MHP and virtio-mem
are enabled (patch 6-7), and finally ZONE_DEVICE support is added
(patch 8).

MHP and locking
===

TL;DR: The MHP does not step on any toes, except for ptdump.
Additional locking is required for ptdump.

Long version: For v2 I spent some time digging into init_mm
synchronization/update. Here are my findings, and I'd love them to be
corrected if incorrect.

It's been a gnarly path...

The `init_mm` structure is a special mm (perhaps not a "real" one).
It's a "lazy context" that tracks kernel page table resources, e.g.,
the kernel page table (swapper_pg_dir), a kernel page_table_lock (more
about the usage below), mmap_lock, and such.

`init_mm` does not track/contain any VMAs. Having the `init_mm` is
convenient, so that the regular kernel page table walk/modify
functions can be used.

Now, `init_mm` being special means that the locking for kernel page
tables are special as well.

On RISC-V the PGD (top-level page table structure), similar to x86, is
shared (copied) with user processes. If the kernel PGD is modified, it
has to be synched to user-mode processes PGDs. This is avoided by
pre-populating the PGD, so it'll be fixed from boot.

The in-kernel pgd regions are documented in
`Documentation/arch/riscv/vm-layout.rst`.

The distinct regions are:
 * vmemmap
 * vmalloc/ioremap space
 * direct mapping of all physical memory
 * kasan
 * modules, BPF
 * kernel

Memory hotplug is the process of adding/removing memory to/from the
kernel.

Adding is done in two phases:
 1. Add the memory to the kernel
 2. Online memory, making it available to the page allocator.

Step 1 is partially architecture dependent, and updates the init_mm
page table:
 * Update the direct map page tables. The direct map is a linear map,
   representing all physical memory: `virt = phys + PAGE_OFFSET`
 * Add a `struct page` for each added page of memory. Update the
   vmemmap (virtual mapping to the `struct page`, so we can easily
   transform a kernel virtual address to a `struct page *` address.

>From an MHP perspective, there are two regions of the PGD that are
updated:
 * vmemmap
 * direct mapping of all physical memory

The `struct mm_struct` has a couple of locks in play:
 * `spinlock_t page_table_lock` protects the page table, and some
counters
 * `struct rw_semaphore mmap_lock` protect an mm's VMAs

Note again that `init_mm` does not contain any VMAs, but still uses
the mmap_lock in some places.

The `page_table_lock` was originally used to to protect all pages
tables, but more recently a split page table lock has been introduced.
The split lock has a per-table lock for the PTE and PMD tables. If
split lock is disabled, all tables are guarded by
`mm->page_table_lock` (for user processes). Split page table locks are
not used for init_mm.

MHP operations is typically synchronized using
`DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock)`.

Actors
--

The following non-MHP actors in the kernel traverses (read), and/or
modifies the kernel PGD.

 * `ptdump`

   Walks the entire `init_mm`, via `ptdump_walk_pgd()` with the
   `mmap_write_lock(init_mm)` taken.

   Observation: ptdump can race with MHP, and needs additional locking
   to avoid crashes/races.

 * `set_direct_*` / `arch/riscv/mm/pageattr.c`

   The `set_direct_*` functionality is used to "synchronize" the
   direct map to other kernel mappings, e.g. modules/kernel text. The
   direct map is using "as large huge table mappings as possible",
   which means that the `set_direct_*` might need to 

[GIT PULL] Modules changes for v6.10-rc1

2024-05-14 Thread Luis Chamberlain
The following changes since commit a5131c3fdf2608f1c15f3809e201cf540eb28489:

  Merge tag 'x86-shstk-2024-05-13' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2024-05-13 19:33:23 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ 
tags/modules-6.10-rc1

for you to fetch changes up to 2c9e5d4a008293407836d29d35dfd4353615bd2f:

  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of (2024-05-14 
00:36:29 -0700)


Modules changes for v6.10-rc1

Finally something fun. Mike Rapoport does some cleanup to allow us to
take out module_alloc() out of modules into a new paint shedded execmem_alloc()
and execmem_free() so to make emphasis these helpers are actually used outside
of modules. It starts with a no-functional changes API rename / placeholders
to then allow architectures to define their requirements into a new shiny
struct execmem_info with ranges, and requirements for those ranges. Archs
now can intitialize this execmem_info as the last part of mm_core_init() if
they have to diverge from the norm. Each range is a known type clearly
articulated and spelled out in enum execmem_type.

Although a lot of this is major cleanup and prep work for future enhancements an
immediate clear gain is we get to enable KPROBES without MODULES now. That is
ultimately what motiviated to pick this work up again, now with smaller goal as
concrete stepping stone.

This has been sitting on linux-next for a little less than a month, a few issues
were found already and fixed, in particular an odd mips boot issue. Arch folks
reviewed the code too. This is ready for wider exposure and testing.


Justin Stitt (1):
  kallsyms: replace deprecated strncpy with strscpy

Mike Rapoport (IBM) (16):
  arm64: module: remove unneeded call to kasan_alloc_module_shadow()
  mips: module: rename MODULE_START to MODULES_VADDR
  nios2: define virtual address space for modules
  sparc: simplify module_alloc()
  module: make module_memory_{alloc,free} more self-contained
  mm: introduce execmem_alloc() and execmem_free()
  mm/execmem, arch: convert simple overrides of module_alloc to execmem
  mm/execmem, arch: convert remaining overrides of module_alloc to execmem
  riscv: extend execmem_params for generated code allocations
  arm64: extend execmem_info for generated code allocations
  powerpc: extend execmem_params for kprobes allocations
  arch: make execmem setup available regardless of CONFIG_MODULES
  x86/ftrace: enable dynamic ftrace without CONFIG_MODULES
  powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate
  kprobes: remove dependency on CONFIG_MODULES
  bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of

Yifan Hong (1):
  module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree.

 arch/Kconfig |  10 ++-
 arch/arm/kernel/module.c |  34 -
 arch/arm/mm/init.c   |  45 +++
 arch/arm64/Kconfig   |   1 +
 arch/arm64/kernel/module.c   | 126 --
 arch/arm64/kernel/probes/kprobes.c   |   7 --
 arch/arm64/mm/init.c | 140 ++
 arch/arm64/net/bpf_jit_comp.c|  11 ---
 arch/loongarch/kernel/module.c   |   6 --
 arch/loongarch/mm/init.c |  21 +
 arch/mips/include/asm/pgtable-64.h   |   4 +-
 arch/mips/kernel/module.c|  10 ---
 arch/mips/mm/fault.c |   4 +-
 arch/mips/mm/init.c  |  23 ++
 arch/nios2/include/asm/pgtable.h |   5 +-
 arch/nios2/kernel/module.c   |  20 -
 arch/nios2/mm/init.c |  21 +
 arch/parisc/kernel/module.c  |  12 ---
 arch/parisc/mm/init.c|  23 +-
 arch/powerpc/Kconfig |   2 +-
 arch/powerpc/include/asm/kasan.h |   2 +-
 arch/powerpc/kernel/head_8xx.S   |   4 +-
 arch/powerpc/kernel/head_book3s_32.S |   6 +-
 arch/powerpc/kernel/kprobes.c|  22 +-
 arch/powerpc/kernel/module.c |  38 --
 arch/powerpc/lib/code-patching.c |   2 +-
 arch/powerpc/mm/book3s32/mmu.c   |   2 +-
 arch/powerpc/mm/mem.c|  64 
 arch/riscv/include/asm/pgtable.h |   3 +
 arch/riscv/kernel/module.c   |  12 ---
 arch/riscv/kernel/probes/kprobes.c   |  10 ---
 arch/riscv/mm/init.c |  35 +
 arch/riscv/net/bpf_jit_core.c|  13 
 arch/s390/kernel/ftrace.c|   4 +-
 arch/s390/kernel/kprobes.c   |   4 +-
 arch/s390/kernel/module.c|  42 +-
 arch/s390/mm/init.c  |  30 
 arch/sparc/include/asm/pgtable_32.h  |   2 +
 arch/sparc/kernel/module.c   |  30 
 

WARNING: kmalloc bug in bpf_uprobe_multi_link_attach

2024-05-14 Thread Ubisectech Sirius
Hello.
We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
Recently, our team has discovered a issue in Linux kernel 6.7.  Attached to the 
email were a PoC file of the issue.

Stack dump:

loop3: detected capacity change from 0 to 8
MTD: Attempt to mount non-MTD device "/dev/loop3"
[ cut here ]
WARNING: CPU: 1 PID: 10075 at mm/util.c:632 kvmalloc_node+0x199/0x1b0 
mm/util.c:632
Modules linked in:
CPU: 1 PID: 10075 Comm: syz-executor.3 Not tainted 6.7.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
RIP: 0010:kvmalloc_node+0x199/0x1b0 mm/util.c:632
Code: 02 1d 00 eb aa e8 a7 49 c6 ff 41 81 e5 00 20 00 00 31 ff 44 89 ee e8 36 
45 c6 ff 45 85 ed 0f 85 1b ff ff ff e8 88 49 c6 ff 90 <0f> 0b 90 e9 dd fe ff ff 
66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
RSP: 0018:c90002007b60 EFLAGS: 00010212
RAX: 23e4 RBX: 0400 RCX: c90003aaa000
RDX: 0004 RSI: 81c3acc8 RDI: 0005
RBP: 0037cec8 R08: 0005 R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 88805ff6e1b8
FS:  7fc62205f640() GS:88807ec0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 001b2e026000 CR3: 5f338000 CR4: 00750ef0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
PKRU: 5554
Call Trace:
 
 kvmalloc include/linux/slab.h:738 [inline]
 kvmalloc_array include/linux/slab.h:756 [inline]
 kvcalloc include/linux/slab.h:761 [inline]
 bpf_uprobe_multi_link_attach+0x3fe/0xf60 kernel/trace/bpf_trace.c:3239
 link_create kernel/bpf/syscall.c:5012 [inline]
 __sys_bpf+0x2e85/0x4e00 kernel/bpf/syscall.c:5453
 __do_sys_bpf kernel/bpf/syscall.c:5487 [inline]
 __se_sys_bpf kernel/bpf/syscall.c:5485 [inline]
 __x64_sys_bpf+0x78/0xc0 kernel/bpf/syscall.c:5485
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x6f/0x77
RIP: 0033:0x7fc62128fd6d
Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 
c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:7fc62205f028 EFLAGS: 0246 ORIG_RAX: 0141
RAX: ffda RBX: 7fc6213cbf80 RCX: 7fc62128fd6d
RDX: 0040 RSI: 21c0 RDI: 001c
RBP: 7fc6212f14cd R08:  R09: 
R10:  R11: 0246 R12: 
R13: 000b R14: 7fc6213cbf80 R15: 7fc62203f000
 

Thank you for taking the time to read this email and we look forward to working 
with you further.





poc.c
Description: Binary data