Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
Oscar Salvador writes: > On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote: >> +static void __meminit free_vmemmap_storage(struct page *page, size_t size, >> + struct vmem_altmap *altmap) >> +{ >> +if (altmap) >> +vmem_altmap_free(altmap, size >> PAGE_SHIFT); >> +else >> +free_pages((unsigned long)page_address(page), get_order(size)); > > David already pointed this out, but can check > arch/x86/mm/init_64.c:free_pagetable(). > > You will see that we have to do some magic for bootmem memory (DIMMs > which were not hotplugged but already present) Thank you! >> +#ifdef CONFIG_SPARSEMEM_VMEMMAP >> +void __ref vmemmap_free(unsigned long start, unsigned long end, struct >> vmem_altmap *altmap) >> +{ >> +remove_pgd_mapping(start, end, true, altmap); >> +} >> +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ >> +#endif /* CONFIG_MEMORY_HOTPLUG */ > > I will comment on the patch where you add support for hotplug and the > dependency, but on a track in LSFMM today, we decided that most likely > we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP > environments. > So, since you are adding this plain fresh, please consider to tight the > hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP. > As a bonus, you will only have to maintain one flavour of functions. Ah, yeah, I saw it mentioned on the LSF/MM/BPF topics. Less is definitely more -- I'll make the next version depend on SPARSEMEM_VMEMMAP. Björn
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
Oscar Salvador writes: > On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote: >> From: Björn Töpel >> >> Prepare for memory hotplugging support by changing from __init to >> __meminit for the page table functions that are used by the upcoming >> architecture specific callbacks. >> >> Changing the __init attribute to __meminit, avoids that the functions >> are removed after init. The __meminit attribute makes sure the >> functions are kept in the kernel text post init, but only if memory >> hotplugging is enabled for the build. >> >> Also, make sure that the altmap parameter is properly passed on to >> vmemmap_populate_hugepages(). >> >> Signed-off-by: Björn Töpel > > Reviewed-by: Oscar Salvador > >> +static void __meminit create_linear_mapping_range(phys_addr_t start, >> phys_addr_t end, >> + uintptr_t fixed_map_size) >> { >> phys_addr_t pa; >> uintptr_t va, map_size; >> @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, >> unsigned long end, int node, >> * memory hotplug, we are not able to update all the page tables with >> * the new PMDs. >> */ >> -return vmemmap_populate_hugepages(start, end, node, NULL); >> +return vmemmap_populate_hugepages(start, end, node, altmap); > > I would have put this into a separate patch. Thanks for the review, Oscar! I'll split this up (also suggested by Alex!). Cheers, Björn
Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties
On Tue, May 14, 2024 at 06:21:57PM -0700, Yuanchu Xie wrote: > On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman > wrote: > > > > On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote: > > > Memctl provides a way for the guest to control its physical memory > > > properties, and enables optimizations and security features. For > > > example, the guest can provide information to the host where parts of a > > > hugepage may be unbacked, or sensitive data may not be swapped out, etc. > > >... > > Pretty generic name for a hardware-specific driver :( > It's not for real hardware btw. Its use case is similar to pvpanic > where the device is emulated by the VMM. I can change the name if it's > a problem. This file is only used for a single PCI device, that is very hardware-specific even if that hardware is "fake" :) Please make the name more specific as well. thanks, greg k-h
[PATCH] ring-buffer: Add cast to unsigned long addr passed to virt_to_page()
From: "Steven Rostedt (Google)" The sub-buffer pages are held in an unsigned long array, and when it is passed to virt_to_page() a cast is needed. Link: https://lore.kernel.org/all/20240515124808.06279...@canb.auug.org.au/ Fixes: 117c39200d9d ("ring-buffer: Introducing ring-buffer mapping functions") Reported-by: Stephen Rothwell Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index a02c7a52a0f5..7345a8b625fb 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -6283,7 +6283,7 @@ static int __rb_map_vma(struct ring_buffer_per_cpu *cpu_buffer, } while (p < nr_pages) { - struct page *page = virt_to_page(cpu_buffer->subbuf_ids[s]); + struct page *page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); int off = 0; if (WARN_ON_ONCE(s >= nr_subbufs)) { -- 2.43.0
Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support
On Wed, May 15, 2024 at 01:10:03AM +, Edgecombe, Rick P wrote: On Mon, 2024-05-13 at 15:23 -0600, Jiri Olsa wrote: so at the moment the patch 6 changes shadow stack for 1) current uretprobe which are not working at the moment and we change the top value of shadow stack with shstk_push_frame 2) optimized uretprobe which needs to push new frame on shadow stack with shstk_update_last_frame I think we should do 1) and have current uretprobe working with shadow stack, which is broken at the moment I'm ok with not using optimized uretprobe when shadow stack is detected as enabled and we go with current uretprobe in that case would this work for you? Sorry for the delay. It seems reasonable to me due to 1 being at a fixed address where 2 was arbitrary address. But Peterz might have felt the opposite earlier. Not sure. I'd also love to get some second opinions from broonie (arm shadow stack) and Deepak (riscv shadow stack). Deepak, even if riscv has a special instruction that pushes to the shadow stack, will it be ok if there is a callable operation that does the same thing? Like, aren't you relying on endbranches or the compiler or something such that arbitrary data can't be pushed via that instruction? Instruction is `sspush x1/ra`. It pushes contents of register return address (ra also called x1) onto shadow stack. `ra` is like arm's equivalent of link register. Prologue of function is supposed to have `sspush x1` to save it away. ISA doesn't allow encodings with register in risc-v GPRs (except register x5 because some embedded riscv space toolchains have used x5 as ra too). On question of callable operation, I think still need to fully understand who manages the probe and forward progress. Question, Is it kernel who is maintaining all return probes, meaning original return addresses are saved in kernel data structures on per task basis. Once uretprobe did its job then its kernel who is ensuring return to original return address ? BTW Jiri, thanks for considering shadow stack in your work.
Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties
On Tue, May 14, 2024 at 9:06 AM Greg Kroah-Hartman wrote: > > On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote: > > Memctl provides a way for the guest to control its physical memory > > properties, and enables optimizations and security features. For > > example, the guest can provide information to the host where parts of a > > hugepage may be unbacked, or sensitive data may not be swapped out, etc. > >... > Pretty generic name for a hardware-specific driver :( It's not for real hardware btw. Its use case is similar to pvpanic where the device is emulated by the VMM. I can change the name if it's a problem. > Yup, you write this to hardware, please use proper structures and types > for that, otherwise you will have problems in the near future. Thanks for the review and comments on endianness and using proper types. Will do. Thanks, Yuanchu
Re: [PATCHv5 bpf-next 6/8] x86/shstk: Add return uprobe support
On Mon, 2024-05-13 at 15:23 -0600, Jiri Olsa wrote: > so at the moment the patch 6 changes shadow stack for > > 1) current uretprobe which are not working at the moment and we change > the top value of shadow stack with shstk_push_frame > 2) optimized uretprobe which needs to push new frame on shadow stack > with shstk_update_last_frame > > I think we should do 1) and have current uretprobe working with shadow > stack, which is broken at the moment > > I'm ok with not using optimized uretprobe when shadow stack is detected > as enabled and we go with current uretprobe in that case > > would this work for you? Sorry for the delay. It seems reasonable to me due to 1 being at a fixed address where 2 was arbitrary address. But Peterz might have felt the opposite earlier. Not sure. I'd also love to get some second opinions from broonie (arm shadow stack) and Deepak (riscv shadow stack). Deepak, even if riscv has a special instruction that pushes to the shadow stack, will it be ok if there is a callable operation that does the same thing? Like, aren't you relying on endbranches or the compiler or something such that arbitrary data can't be pushed via that instruction? BTW Jiri, thanks for considering shadow stack in your work.
Re: [PATCH] sched/rt: Clean up usage of rt_task()
Hi Qais, On Wed, May 15, 2024 at 12:41:12AM +0100 Qais Yousef wrote: > rt_task() checks if a task has RT priority. But depends on your > dictionary, this could mean it belongs to RT class, or is a 'realtime' > task, which includes RT and DL classes. > > Since this has caused some confusion already on discussion [1], it > seemed a clean up is due. > > I define the usage of rt_task() to be tasks that belong to RT class. > Make sure that it returns true only for RT class and audit the users and > replace them with the new realtime_task() which returns true for RT and > DL classes - the old behavior. Introduce similar realtime_prio() to > create similar distinction to rt_prio() and update the users. I think making the difference clear is good. However, I think rt_task() is a better name. We have dl_task() still. And rt tasks are things managed by rt.c, basically. Not realtime.c :) I know that doesn't work for deadline.c and dl_ but this change would be the reverse of that pattern. > > Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. > > Document the functions to make it more obvious what is the difference > between them. PI-boosted tasks is a factor that must be taken into > account when choosing which function to use. > > Rename task_is_realtime() to task_has_realtime_policy() as the old name > is confusing against the new realtime_task(). Keeping it rt_task() above could mean this stays as it was but this change makes sense as you have written it too. Cheers, Phil > > No functional changes were intended. > > [1] > https://lore.kernel.org/lkml/20240506100509.gl40...@noisy.programming.kicks-ass.net/ > > Signed-off-by: Qais Yousef > --- > fs/select.c | 2 +- > include/linux/ioprio.h| 2 +- > include/linux/sched/deadline.h| 6 -- > include/linux/sched/prio.h| 1 + > include/linux/sched/rt.h | 27 ++- > kernel/locking/rtmutex.c | 4 ++-- > kernel/locking/rwsem.c| 4 ++-- > kernel/locking/ww_mutex.h | 2 +- > kernel/sched/core.c | 6 +++--- > kernel/time/hrtimer.c | 6 +++--- > kernel/trace/trace_sched_wakeup.c | 2 +- > mm/page-writeback.c | 4 ++-- > mm/page_alloc.c | 2 +- > 13 files changed, 48 insertions(+), 20 deletions(-) > > diff --git a/fs/select.c b/fs/select.c > index 9515c3fa1a03..8d5c1419416c 100644 > --- a/fs/select.c > +++ b/fs/select.c > @@ -82,7 +82,7 @@ u64 select_estimate_accuracy(struct timespec64 *tv) >* Realtime tasks get a slack of 0 for obvious reasons. >*/ > > - if (rt_task(current)) > + if (realtime_task(current)) > return 0; > > ktime_get_ts64(); > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h > index db1249cd9692..6c00342b6166 100644 > --- a/include/linux/ioprio.h > +++ b/include/linux/ioprio.h > @@ -40,7 +40,7 @@ static inline int task_nice_ioclass(struct task_struct > *task) > { > if (task->policy == SCHED_IDLE) > return IOPRIO_CLASS_IDLE; > - else if (task_is_realtime(task)) > + else if (task_has_realtime_policy(task)) > return IOPRIO_CLASS_RT; > else > return IOPRIO_CLASS_BE; > diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h > index df3aca89d4f5..5cb88b748ad6 100644 > --- a/include/linux/sched/deadline.h > +++ b/include/linux/sched/deadline.h > @@ -10,8 +10,6 @@ > > #include > > -#define MAX_DL_PRIO 0 > - > static inline int dl_prio(int prio) > { > if (unlikely(prio < MAX_DL_PRIO)) > @@ -19,6 +17,10 @@ static inline int dl_prio(int prio) > return 0; > } > > +/* > + * Returns true if a task has a priority that belongs to DL class. PI-boosted > + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks. > + */ > static inline int dl_task(struct task_struct *p) > { > return dl_prio(p->prio); > diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h > index ab83d85e1183..6ab43b4f72f9 100644 > --- a/include/linux/sched/prio.h > +++ b/include/linux/sched/prio.h > @@ -14,6 +14,7 @@ > */ > > #define MAX_RT_PRIO 100 > +#define MAX_DL_PRIO 0 > > #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH) > #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2) > diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h > index b2b9e6eb9683..b31be3c50152 100644 > --- a/include/linux/sched/rt.h > +++ b/include/linux/sched/rt.h > @@ -7,18 +7,43 @@ > struct task_struct; > > static inline int rt_prio(int prio) > +{ > + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO)) > + return 1; > + return 0; > +} > + > +static inline int realtime_prio(int prio) > { > if (unlikely(prio < MAX_RT_PRIO)) > return 1; > return 0; > } > > +/* > + * Returns true if a task has a
[PATCH] sched/rt: Clean up usage of rt_task()
rt_task() checks if a task has RT priority. But depends on your dictionary, this could mean it belongs to RT class, or is a 'realtime' task, which includes RT and DL classes. Since this has caused some confusion already on discussion [1], it seemed a clean up is due. I define the usage of rt_task() to be tasks that belong to RT class. Make sure that it returns true only for RT class and audit the users and replace them with the new realtime_task() which returns true for RT and DL classes - the old behavior. Introduce similar realtime_prio() to create similar distinction to rt_prio() and update the users. Move MAX_DL_PRIO to prio.h so it can be used in the new definitions. Document the functions to make it more obvious what is the difference between them. PI-boosted tasks is a factor that must be taken into account when choosing which function to use. Rename task_is_realtime() to task_has_realtime_policy() as the old name is confusing against the new realtime_task(). No functional changes were intended. [1] https://lore.kernel.org/lkml/20240506100509.gl40...@noisy.programming.kicks-ass.net/ Signed-off-by: Qais Yousef --- fs/select.c | 2 +- include/linux/ioprio.h| 2 +- include/linux/sched/deadline.h| 6 -- include/linux/sched/prio.h| 1 + include/linux/sched/rt.h | 27 ++- kernel/locking/rtmutex.c | 4 ++-- kernel/locking/rwsem.c| 4 ++-- kernel/locking/ww_mutex.h | 2 +- kernel/sched/core.c | 6 +++--- kernel/time/hrtimer.c | 6 +++--- kernel/trace/trace_sched_wakeup.c | 2 +- mm/page-writeback.c | 4 ++-- mm/page_alloc.c | 2 +- 13 files changed, 48 insertions(+), 20 deletions(-) diff --git a/fs/select.c b/fs/select.c index 9515c3fa1a03..8d5c1419416c 100644 --- a/fs/select.c +++ b/fs/select.c @@ -82,7 +82,7 @@ u64 select_estimate_accuracy(struct timespec64 *tv) * Realtime tasks get a slack of 0 for obvious reasons. */ - if (rt_task(current)) + if (realtime_task(current)) return 0; ktime_get_ts64(); diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index db1249cd9692..6c00342b6166 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -40,7 +40,7 @@ static inline int task_nice_ioclass(struct task_struct *task) { if (task->policy == SCHED_IDLE) return IOPRIO_CLASS_IDLE; - else if (task_is_realtime(task)) + else if (task_has_realtime_policy(task)) return IOPRIO_CLASS_RT; else return IOPRIO_CLASS_BE; diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h index df3aca89d4f5..5cb88b748ad6 100644 --- a/include/linux/sched/deadline.h +++ b/include/linux/sched/deadline.h @@ -10,8 +10,6 @@ #include -#define MAX_DL_PRIO0 - static inline int dl_prio(int prio) { if (unlikely(prio < MAX_DL_PRIO)) @@ -19,6 +17,10 @@ static inline int dl_prio(int prio) return 0; } +/* + * Returns true if a task has a priority that belongs to DL class. PI-boosted + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks. + */ static inline int dl_task(struct task_struct *p) { return dl_prio(p->prio); diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h index ab83d85e1183..6ab43b4f72f9 100644 --- a/include/linux/sched/prio.h +++ b/include/linux/sched/prio.h @@ -14,6 +14,7 @@ */ #define MAX_RT_PRIO100 +#define MAX_DL_PRIO0 #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH) #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2) diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h index b2b9e6eb9683..b31be3c50152 100644 --- a/include/linux/sched/rt.h +++ b/include/linux/sched/rt.h @@ -7,18 +7,43 @@ struct task_struct; static inline int rt_prio(int prio) +{ + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO)) + return 1; + return 0; +} + +static inline int realtime_prio(int prio) { if (unlikely(prio < MAX_RT_PRIO)) return 1; return 0; } +/* + * Returns true if a task has a priority that belongs to RT class. PI-boosted + * tasks will return true. Use rt_policy() to ignore PI-boosted tasks. + */ static inline int rt_task(struct task_struct *p) { return rt_prio(p->prio); } -static inline bool task_is_realtime(struct task_struct *tsk) +/* + * Returns true if a task has a priority that belongs to RT or DL classes. + * PI-boosted tasks will return true. Use task_has_realtime_policy() to ignore + * PI-boosted tasks. + */ +static inline int realtime_task(struct task_struct *p) +{ + return realtime_prio(p->prio); +} + +/* + * Returns true if a task has a policy that belongs to RT or DL classes. + * PI-boosted tasks will return false. + */ +static inline bool
Re: [PATCH v2 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option
Hi Kris, kernel test robot noticed the following build warnings: [auto build test WARNING on dd5a440a31fae6e459c0d627162825505361] url: https://github.com/intel-lab-lkp/linux/commits/Kris-Van-Hees/kbuild-add-modules-builtin-objs/20240512-065954 base: dd5a440a31fae6e459c0d627162825505361 patch link: https://lore.kernel.org/r/20240511224035.27775-3-kris.van.hees%40oracle.com patch subject: [PATCH v2 2/6] trace: add CONFIG_BUILTIN_MODULE_RANGES option config: arc-kismet-CONFIG_VMLINUX_MAP-CONFIG_BUILTIN_MODULE_RANGES-0-0 (https://download.01.org/0day-ci/archive/20240515/202405150623.lms5svhm-...@intel.com/config) reproduce: (https://download.01.org/0day-ci/archive/20240515/202405150623.lms5svhm-...@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202405150623.lms5svhm-...@intel.com/ kismet warnings: (new ones prefixed by >>) >> kismet: WARNING: unmet direct dependencies detected for VMLINUX_MAP when >> selected by BUILTIN_MODULE_RANGES WARNING: unmet direct dependencies detected for VMLINUX_MAP Depends on [n]: EXPERT [=n] Selected by [y]: - BUILTIN_MODULE_RANGES [=y] && FTRACE [=y] -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On Tue, May 14, 2024 at 04:04:44PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for > RISC-V. > > Signed-off-by: Björn Töpel > --- > arch/riscv/Kconfig | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index 6bec1bce6586..b9398b64bb69 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -16,6 +16,8 @@ config RISCV > select ACPI_REDUCED_HARDWARE_ONLY if ACPI > select ARCH_DMA_DEFAULT_COHERENT > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU Hopefully this should be SPARSEMEM_VMEMMAP. We are trying to deprecate memory-hotplug on !SPARSEMEM_VMEMMAP. And it is always easier to do it now that when the code goes already in, so please consider if you really need SPARSEMEM and why (I do not think you do). -- Oscar Salvador SUSE Labs
Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
On Tue, May 14, 2024 at 04:04:43PM +0200, Björn Töpel wrote: > From: Björn Töpel > > During memory hot remove, the ptdump functionality can end up touching > stale data. Avoid any potential crashes (or worse), by holding the > memory hotplug read-lock while traversing the page table. > > This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: > Hold memory hotplug lock while walking for kernel page table dump"). > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador funny enough, it seems arm64 and riscv are the only ones holding the hotplug lock here. I think we have the same problem on the other arches as well (at least on x86_64 that I can see). If we happen to finally need the lock in those, I would rather have a centric function in the generic mm code with the locking and then calling an arch specific ptdump_show function, so the lock is not scattered. But that is another story. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
On Tue, May 14, 2024 at 04:04:42PM +0200, Björn Töpel wrote: > +static void __meminit free_vmemmap_storage(struct page *page, size_t size, > +struct vmem_altmap *altmap) > +{ > + if (altmap) > + vmem_altmap_free(altmap, size >> PAGE_SHIFT); > + else > + free_pages((unsigned long)page_address(page), get_order(size)); David already pointed this out, but can check arch/x86/mm/init_64.c:free_pagetable(). You will see that we have to do some magic for bootmem memory (DIMMs which were not hotplugged but already present) > +#ifdef CONFIG_SPARSEMEM_VMEMMAP > +void __ref vmemmap_free(unsigned long start, unsigned long end, struct > vmem_altmap *altmap) > +{ > + remove_pgd_mapping(start, end, true, altmap); > +} > +#endif /* CONFIG_SPARSEMEM_VMEMMAP */ > +#endif /* CONFIG_MEMORY_HOTPLUG */ I will comment on the patch where you add support for hotplug and the dependency, but on a track in LSFMM today, we decided that most likely we will drop memory-hotplug support for !CONFIG_SPARSEMEM_VMEMMAP environments. So, since you are adding this plain fresh, please consider to tight the hotplug dependency to CONFIG_SPARSEMEM_VMEMMAP. As a bonus, you will only have to maintain one flavour of functions. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On 14.05.24 20:17, Björn Töpel wrote: Alexandre Ghiti writes: On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: From: Björn Töpel Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for RISC-V. Signed-off-by: Björn Töpel --- arch/riscv/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 6bec1bce6586..b9398b64bb69 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -16,6 +16,8 @@ config RISCV select ACPI_REDUCED_HARDWARE_ONLY if ACPI select ARCH_DMA_DEFAULT_COHERENT select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU I think this should be SPARSEMEM_VMEMMAP here. Hmm, care to elaborate? I thought that was optional. There was a discussion at LSF/MM today to maybe require SPARSEMEM_VMEMMAP for hotplug. Would that work here as well? -- Cheers, David / dhildenb
Re: [PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel During memory hot remove, the ptdump functionality can end up touching stale data. Avoid any potential crashes (or worse), by holding the memory hotplug read-lock while traversing the page table. This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for kernel page table dump"). Signed-off-by: Björn Töpel --- arch/riscv/mm/ptdump.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 1289cc6d3700..9d5f657a251b 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -370,7 +371,9 @@ bool ptdump_check_wx(void) static int ptdump_show(struct seq_file *m, void *v) { + get_online_mems(); ptdump_walk(m, m->private); + put_online_mems(); return 0; } Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
On Tue, May 14, 2024 at 04:04:41PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Add a parameter to the direct map setup function, so it can be used in > arch_add_memory() later. > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador > --- > arch/riscv/mm/init.c | 15 ++- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index c969427eab88..6f72b0b2b854 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa) > } > > static void __meminit create_linear_mapping_range(phys_addr_t start, > phys_addr_t end, > - uintptr_t fixed_map_size) > + uintptr_t fixed_map_size, > const pgprot_t *pgprot) > { > phys_addr_t pa; > uintptr_t va, map_size; > @@ -1238,7 +1238,7 @@ static void __meminit > create_linear_mapping_range(phys_addr_t start, phys_addr_t > best_map_size(pa, va, end - pa); > > create_pgd_mapping(swapper_pg_dir, va, pa, map_size, > -pgprot_from_va(va)); > +pgprot ? *pgprot : pgprot_from_va(va)); > } > } > > @@ -1282,22 +1282,19 @@ static void __init > create_linear_mapping_page_table(void) > if (end >= __pa(PAGE_OFFSET) + memory_limit) > end = __pa(PAGE_OFFSET) + memory_limit; > > - create_linear_mapping_range(start, end, 0); > + create_linear_mapping_range(start, end, 0, NULL); > } > > #ifdef CONFIG_STRICT_KERNEL_RWX > - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0); > - create_linear_mapping_range(krodata_start, > - krodata_start + krodata_size, 0); > + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, > NULL); > + create_linear_mapping_range(krodata_start, krodata_start + > krodata_size, 0, NULL); > > memblock_clear_nomap(ktext_start, ktext_size); > memblock_clear_nomap(krodata_start, krodata_size); > #endif > > #ifdef CONFIG_KFENCE > - create_linear_mapping_range(kfence_pool, > - kfence_pool + KFENCE_POOL_SIZE, > - PAGE_SIZE); > + create_linear_mapping_range(kfence_pool, kfence_pool + > KFENCE_POOL_SIZE, PAGE_SIZE, NULL); > > memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE); > #endif > -- > 2.40.1 > -- Oscar Salvador SUSE Labs
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
On Tue, May 14, 2024 at 04:04:40PM +0200, Björn Töpel wrote: > From: Björn Töpel > > Prepare for memory hotplugging support by changing from __init to > __meminit for the page table functions that are used by the upcoming > architecture specific callbacks. > > Changing the __init attribute to __meminit, avoids that the functions > are removed after init. The __meminit attribute makes sure the > functions are kept in the kernel text post init, but only if memory > hotplugging is enabled for the build. > > Also, make sure that the altmap parameter is properly passed on to > vmemmap_populate_hugepages(). > > Signed-off-by: Björn Töpel Reviewed-by: Oscar Salvador > +static void __meminit create_linear_mapping_range(phys_addr_t start, > phys_addr_t end, > + uintptr_t fixed_map_size) > { > phys_addr_t pa; > uintptr_t va, map_size; > @@ -1435,7 +1429,7 @@ int __meminit vmemmap_populate(unsigned long start, > unsigned long end, int node, >* memory hotplug, we are not able to update all the page tables with >* the new PMDs. >*/ > - return vmemmap_populate_hugepages(start, end, node, NULL); > + return vmemmap_populate_hugepages(start, end, node, altmap); I would have put this into a separate patch. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
Alexandre Ghiti writes: > On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: >> +int __ref arch_add_memory(int nid, u64 start, u64 size, struct mhp_params >> *params) >> +{ >> + int ret; >> + >> + create_linear_mapping_range(start, start + size, 0, >pgprot); >> + flush_tlb_all(); >> + ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT, >> params); >> + if (ret) { >> + remove_linear_mapping(start, size); >> + return ret; >> + } >> + > > You need to flush the TLB here too since __add_pages() populates the > page table with the new vmemmap mapping (only because riscv allows to > cache invalid entries, I'll adapt this in my next version of Svvptc > support). > >> + max_pfn = PFN_UP(start + size); >> + max_low_pfn = max_pfn; >> + return 0; >> +} >> + >> +void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap >> *altmap) >> +{ >> + __remove_pages(start >> PAGE_SHIFT, size >> PAGE_SHIFT, altmap); >> + remove_linear_mapping(start, size); > > You need to flush the TLB here too. I'll address all of the above in the next version. Thanks for reviewing the series! Björn
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On Tue, May 14, 2024 at 8:17 PM Björn Töpel wrote: > > Alexandre Ghiti writes: > > > On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > >> > >> From: Björn Töpel > >> > >> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for > >> RISC-V. > >> > >> Signed-off-by: Björn Töpel > >> --- > >> arch/riscv/Kconfig | 2 ++ > >> 1 file changed, 2 insertions(+) > >> > >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > >> index 6bec1bce6586..b9398b64bb69 100644 > >> --- a/arch/riscv/Kconfig > >> +++ b/arch/riscv/Kconfig > >> @@ -16,6 +16,8 @@ config RISCV > >> select ACPI_REDUCED_HARDWARE_ONLY if ACPI > >> select ARCH_DMA_DEFAULT_COHERENT > >> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > >> + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU > > > > I think this should be SPARSEMEM_VMEMMAP here. > > Hmm, care to elaborate? I thought that was optional. My bad, I thought VMEMMAP was required in your patchset. Sorry for the noise!
Re: [GIT PULL] OpenRISC updates for 6.10
The pull request you sent on Tue, 14 May 2024 16:34:42 +0100: > https://github.com/openrisc/linux.git tags/for-linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/590103732442b4bb83886f03f2ddd39d129c3289 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
Alexandre Ghiti writes: > On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: >> >> From: Björn Töpel >> >> Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for >> RISC-V. >> >> Signed-off-by: Björn Töpel >> --- >> arch/riscv/Kconfig | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig >> index 6bec1bce6586..b9398b64bb69 100644 >> --- a/arch/riscv/Kconfig >> +++ b/arch/riscv/Kconfig >> @@ -16,6 +16,8 @@ config RISCV >> select ACPI_REDUCED_HARDWARE_ONLY if ACPI >> select ARCH_DMA_DEFAULT_COHERENT >> select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION >> + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU > > I think this should be SPARSEMEM_VMEMMAP here. Hmm, care to elaborate? I thought that was optional.
Re: [PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > > From: Björn Töpel > > Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for > RISC-V. > > Signed-off-by: Björn Töpel > --- > arch/riscv/Kconfig | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > index 6bec1bce6586..b9398b64bb69 100644 > --- a/arch/riscv/Kconfig > +++ b/arch/riscv/Kconfig > @@ -16,6 +16,8 @@ config RISCV > select ACPI_REDUCED_HARDWARE_ONLY if ACPI > select ARCH_DMA_DEFAULT_COHERENT > select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION > + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU I think this should be SPARSEMEM_VMEMMAP here. > + select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG > select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 > select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE > select ARCH_HAS_BINFMT_FLAT > -- > 2.40.1 >
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > > From: Björn Töpel > > For an architecture to support memory hotplugging, a couple of > callbacks needs to be implemented: > > arch_add_memory() > This callback is responsible for adding the physical memory into the > direct map, and call into the memory hotplugging generic code via > __add_pages() that adds the corresponding struct page entries, and > updates the vmemmap mapping. > > arch_remove_memory() > This is the inverse of the callback above. > > vmemmap_free() > This function tears down the vmemmap mappings (if > CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the > backing vmemmap pages. Note that for persistent memory, an > alternative allocator for the backing pages can be used; The > vmem_altmap. This means that when the backing pages are cleared, > extra care is needed so that the correct deallocation method is > used. > > arch_get_mappable_range() > This functions returns the PA range that the direct map can map. > Used by the MHP internals for sanity checks. > > The page table unmap/teardown functions are heavily based on code from > the x86 tree. The same remove_pgd_mapping() function is used in both > vmemmap_free() and arch_remove_memory(), but in the latter function > the backing pages are not removed. > > Signed-off-by: Björn Töpel > --- > arch/riscv/mm/init.c | 242 +++ > 1 file changed, 242 insertions(+) > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index 6f72b0b2b854..7f0b921a3d3a 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void) > } > } > #endif > + > +#ifdef CONFIG_MEMORY_HOTPLUG > +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) > +{ > + pte_t *pte; > + int i; > + > + for (i = 0; i < PTRS_PER_PTE; i++) { > + pte = pte_start + i; > + if (!pte_none(*pte)) > + return; > + } > + > + free_pages((unsigned long)page_address(pmd_page(*pmd)), 0); > + pmd_clear(pmd); > +} > + > +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) > +{ > + pmd_t *pmd; > + int i; > + > + for (i = 0; i < PTRS_PER_PMD; i++) { > + pmd = pmd_start + i; > + if (!pmd_none(*pmd)) > + return; > + } > + > + free_pages((unsigned long)page_address(pud_page(*pud)), 0); > + pud_clear(pud); > +} > + > +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d) > +{ > + pud_t *pud; > + int i; > + > + for (i = 0; i < PTRS_PER_PUD; i++) { > + pud = pud_start + i; > + if (!pud_none(*pud)) > + return; > + } > + > + free_pages((unsigned long)page_address(p4d_page(*p4d)), 0); > + p4d_clear(p4d); > +} > + > +static void __meminit free_vmemmap_storage(struct page *page, size_t size, > + struct vmem_altmap *altmap) > +{ > + if (altmap) > + vmem_altmap_free(altmap, size >> PAGE_SHIFT); > + else > + free_pages((unsigned long)page_address(page), > get_order(size)); > +} > + > +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long > addr, unsigned long end, > +bool is_vmemmap, struct vmem_altmap > *altmap) > +{ > + unsigned long next; > + pte_t *ptep, pte; > + > + for (; addr < end; addr = next) { > + next = (addr + PAGE_SIZE) & PAGE_MASK; > + if (next > end) > + next = end; > + > + ptep = pte_base + pte_index(addr); > + pte = READ_ONCE(*ptep); > + > + if (!pte_present(*ptep)) > + continue; > + > + pte_clear(_mm, addr, ptep); > + if (is_vmemmap) > + free_vmemmap_storage(pte_page(pte), PAGE_SIZE, > altmap); > + } > +} > + > +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long > addr, unsigned long end, > +bool is_vmemmap, struct vmem_altmap > *altmap) > +{ > + unsigned long next; > + pte_t *pte_base; > + pmd_t *pmdp, pmd; > + > + for (; addr < end; addr = next) { > + next = pmd_addr_end(addr, end); > + pmdp = pmd_base + pmd_index(addr); > + pmd = READ_ONCE(*pmdp); > + > + if (!pmd_present(pmd)) > + continue; > + > + if (pmd_leaf(pmd)) { > + pmd_clear(pmdp); > + if (is_vmemmap) > + free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, > altmap); > + continue; > + } > + > + pte_base =
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
Alexandre Ghiti writes: > On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: >> >> From: Björn Töpel >> >> Prepare for memory hotplugging support by changing from __init to >> __meminit for the page table functions that are used by the upcoming >> architecture specific callbacks. >> >> Changing the __init attribute to __meminit, avoids that the functions >> are removed after init. The __meminit attribute makes sure the >> functions are kept in the kernel text post init, but only if memory >> hotplugging is enabled for the build. >> >> Also, make sure that the altmap parameter is properly passed on to >> vmemmap_populate_hugepages(). >> >> Signed-off-by: Björn Töpel >> --- >> arch/riscv/include/asm/mmu.h | 4 +-- >> arch/riscv/include/asm/pgtable.h | 2 +- >> arch/riscv/mm/init.c | 58 ++-- >> 3 files changed, 29 insertions(+), 35 deletions(-) >> >> diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h >> index 60be458e94da..c09c3c79f496 100644 >> --- a/arch/riscv/include/asm/mmu.h >> +++ b/arch/riscv/include/asm/mmu.h >> @@ -28,8 +28,8 @@ typedef struct { >> #endif >> } mm_context_t; >> >> -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, >> - phys_addr_t sz, pgprot_t prot); >> +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t >> pa, phys_addr_t sz, >> + pgprot_t prot); >> #endif /* __ASSEMBLY__ */ >> >> #endif /* _ASM_RISCV_MMU_H */ >> diff --git a/arch/riscv/include/asm/pgtable.h >> b/arch/riscv/include/asm/pgtable.h >> index 58fd7b70b903..7933f493db71 100644 >> --- a/arch/riscv/include/asm/pgtable.h >> +++ b/arch/riscv/include/asm/pgtable.h >> @@ -162,7 +162,7 @@ struct pt_alloc_ops { >> #endif >> }; >> >> -extern struct pt_alloc_ops pt_ops __initdata; >> +extern struct pt_alloc_ops pt_ops __meminitdata; >> >> #ifdef CONFIG_MMU >> /* Number of PGD entries that a user-mode program can use */ >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c >> index 5b8cdfafb52a..c969427eab88 100644 >> --- a/arch/riscv/mm/init.c >> +++ b/arch/riscv/mm/init.c >> @@ -295,7 +295,7 @@ static void __init setup_bootmem(void) >> } >> >> #ifdef CONFIG_MMU >> -struct pt_alloc_ops pt_ops __initdata; >> +struct pt_alloc_ops pt_ops __meminitdata; >> >> pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss; >> pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss; >> @@ -357,7 +357,7 @@ static inline pte_t *__init >> get_pte_virt_fixmap(phys_addr_t pa) >> return (pte_t *)set_fixmap_offset(FIX_PTE, pa); >> } >> >> -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa) >> +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa) >> { >> return (pte_t *) __va(pa); >> } >> @@ -376,7 +376,7 @@ static inline phys_addr_t __init >> alloc_pte_fixmap(uintptr_t va) >> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); >> } >> >> -static phys_addr_t __init alloc_pte_late(uintptr_t va) >> +static phys_addr_t __meminit alloc_pte_late(uintptr_t va) >> { >> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, >> 0); >> >> @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va) >> return __pa((pte_t *)ptdesc_address(ptdesc)); >> } >> >> -static void __init create_pte_mapping(pte_t *ptep, >> - uintptr_t va, phys_addr_t pa, >> - phys_addr_t sz, pgprot_t prot) >> +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, >> phys_addr_t pa, phys_addr_t sz, >> +pgprot_t prot) >> { >> uintptr_t pte_idx = pte_index(va); >> >> @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa) >> return (pmd_t *)set_fixmap_offset(FIX_PMD, pa); >> } >> >> -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa) >> +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa) >> { >> return (pmd_t *) __va(pa); >> } >> @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va) >> return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); >> } >> >> -static phys_addr_t __init alloc_pmd_late(uintptr_t va) >> +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va) >> { >> struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, >> 0); >> >> @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va) >> return __pa((pmd_t *)ptdesc_address(ptdesc)); >> } >> >> -static void __init create_pmd_mapping(pmd_t *pmdp, >> - uintptr_t va, phys_addr_t pa, >> - phys_addr_t sz, pgprot_t prot) >> +static void __meminit create_pmd_mapping(pmd_t *pmdp, >> +uintptr_t va, phys_addr_t pa, >> +phys_addr_t sz,
Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > > From: Björn Töpel > > Add a parameter to the direct map setup function, so it can be used in > arch_add_memory() later. > > Signed-off-by: Björn Töpel > --- > arch/riscv/mm/init.c | 15 ++- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index c969427eab88..6f72b0b2b854 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa) > } > > static void __meminit create_linear_mapping_range(phys_addr_t start, > phys_addr_t end, > - uintptr_t fixed_map_size) > + uintptr_t fixed_map_size, > const pgprot_t *pgprot) > { > phys_addr_t pa; > uintptr_t va, map_size; > @@ -1238,7 +1238,7 @@ static void __meminit > create_linear_mapping_range(phys_addr_t start, phys_addr_t > best_map_size(pa, va, end - pa); > > create_pgd_mapping(swapper_pg_dir, va, pa, map_size, > - pgprot_from_va(va)); > + pgprot ? *pgprot : pgprot_from_va(va)); > } > } > > @@ -1282,22 +1282,19 @@ static void __init > create_linear_mapping_page_table(void) > if (end >= __pa(PAGE_OFFSET) + memory_limit) > end = __pa(PAGE_OFFSET) + memory_limit; > > - create_linear_mapping_range(start, end, 0); > + create_linear_mapping_range(start, end, 0, NULL); > } > > #ifdef CONFIG_STRICT_KERNEL_RWX > - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0); > - create_linear_mapping_range(krodata_start, > - krodata_start + krodata_size, 0); > + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, > NULL); > + create_linear_mapping_range(krodata_start, krodata_start + > krodata_size, 0, NULL); > > memblock_clear_nomap(ktext_start, ktext_size); > memblock_clear_nomap(krodata_start, krodata_size); > #endif > > #ifdef CONFIG_KFENCE > - create_linear_mapping_range(kfence_pool, > - kfence_pool + KFENCE_POOL_SIZE, > - PAGE_SIZE); > + create_linear_mapping_range(kfence_pool, kfence_pool + > KFENCE_POOL_SIZE, PAGE_SIZE, NULL); > > memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE); > #endif > -- > 2.40.1 > You can add: Reviewed-by: Alexandre Ghiti Thanks, Alex
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > > From: Björn Töpel > > Prepare for memory hotplugging support by changing from __init to > __meminit for the page table functions that are used by the upcoming > architecture specific callbacks. > > Changing the __init attribute to __meminit, avoids that the functions > are removed after init. The __meminit attribute makes sure the > functions are kept in the kernel text post init, but only if memory > hotplugging is enabled for the build. > > Also, make sure that the altmap parameter is properly passed on to > vmemmap_populate_hugepages(). > > Signed-off-by: Björn Töpel > --- > arch/riscv/include/asm/mmu.h | 4 +-- > arch/riscv/include/asm/pgtable.h | 2 +- > arch/riscv/mm/init.c | 58 ++-- > 3 files changed, 29 insertions(+), 35 deletions(-) > > diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h > index 60be458e94da..c09c3c79f496 100644 > --- a/arch/riscv/include/asm/mmu.h > +++ b/arch/riscv/include/asm/mmu.h > @@ -28,8 +28,8 @@ typedef struct { > #endif > } mm_context_t; > > -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, > - phys_addr_t sz, pgprot_t prot); > +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, > phys_addr_t sz, > + pgprot_t prot); > #endif /* __ASSEMBLY__ */ > > #endif /* _ASM_RISCV_MMU_H */ > diff --git a/arch/riscv/include/asm/pgtable.h > b/arch/riscv/include/asm/pgtable.h > index 58fd7b70b903..7933f493db71 100644 > --- a/arch/riscv/include/asm/pgtable.h > +++ b/arch/riscv/include/asm/pgtable.h > @@ -162,7 +162,7 @@ struct pt_alloc_ops { > #endif > }; > > -extern struct pt_alloc_ops pt_ops __initdata; > +extern struct pt_alloc_ops pt_ops __meminitdata; > > #ifdef CONFIG_MMU > /* Number of PGD entries that a user-mode program can use */ > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index 5b8cdfafb52a..c969427eab88 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -295,7 +295,7 @@ static void __init setup_bootmem(void) > } > > #ifdef CONFIG_MMU > -struct pt_alloc_ops pt_ops __initdata; > +struct pt_alloc_ops pt_ops __meminitdata; > > pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss; > pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss; > @@ -357,7 +357,7 @@ static inline pte_t *__init > get_pte_virt_fixmap(phys_addr_t pa) > return (pte_t *)set_fixmap_offset(FIX_PTE, pa); > } > > -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa) > +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa) > { > return (pte_t *) __va(pa); > } > @@ -376,7 +376,7 @@ static inline phys_addr_t __init > alloc_pte_fixmap(uintptr_t va) > return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); > } > > -static phys_addr_t __init alloc_pte_late(uintptr_t va) > +static phys_addr_t __meminit alloc_pte_late(uintptr_t va) > { > struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, > 0); > > @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va) > return __pa((pte_t *)ptdesc_address(ptdesc)); > } > > -static void __init create_pte_mapping(pte_t *ptep, > - uintptr_t va, phys_addr_t pa, > - phys_addr_t sz, pgprot_t prot) > +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, > phys_addr_t pa, phys_addr_t sz, > +pgprot_t prot) > { > uintptr_t pte_idx = pte_index(va); > > @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa) > return (pmd_t *)set_fixmap_offset(FIX_PMD, pa); > } > > -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa) > +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa) > { > return (pmd_t *) __va(pa); > } > @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va) > return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); > } > > -static phys_addr_t __init alloc_pmd_late(uintptr_t va) > +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va) > { > struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, > 0); > > @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va) > return __pa((pmd_t *)ptdesc_address(ptdesc)); > } > > -static void __init create_pmd_mapping(pmd_t *pmdp, > - uintptr_t va, phys_addr_t pa, > - phys_addr_t sz, pgprot_t prot) > +static void __meminit create_pmd_mapping(pmd_t *pmdp, > +uintptr_t va, phys_addr_t pa, > +phys_addr_t sz, pgprot_t prot) > { > pte_t *ptep; > phys_addr_t pte_phys; > @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa) >
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
David Hildenbrand writes: > On 14.05.24 16:04, Björn Töpel wrote: >> From: Björn Töpel >> >> For an architecture to support memory hotplugging, a couple of >> callbacks needs to be implemented: >> >> arch_add_memory() >>This callback is responsible for adding the physical memory into the >>direct map, and call into the memory hotplugging generic code via >>__add_pages() that adds the corresponding struct page entries, and >>updates the vmemmap mapping. >> >> arch_remove_memory() >>This is the inverse of the callback above. >> >> vmemmap_free() >>This function tears down the vmemmap mappings (if >>CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the >>backing vmemmap pages. Note that for persistent memory, an >>alternative allocator for the backing pages can be used; The >>vmem_altmap. This means that when the backing pages are cleared, >>extra care is needed so that the correct deallocation method is >>used. >> >> arch_get_mappable_range() >>This functions returns the PA range that the direct map can map. >>Used by the MHP internals for sanity checks. >> >> The page table unmap/teardown functions are heavily based on code from >> the x86 tree. The same remove_pgd_mapping() function is used in both >> vmemmap_free() and arch_remove_memory(), but in the latter function >> the backing pages are not removed. >> >> Signed-off-by: Björn Töpel >> --- >> arch/riscv/mm/init.c | 242 +++ >> 1 file changed, 242 insertions(+) >> >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c >> index 6f72b0b2b854..7f0b921a3d3a 100644 >> --- a/arch/riscv/mm/init.c >> +++ b/arch/riscv/mm/init.c >> @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void) >> } >> } >> #endif >> + >> +#ifdef CONFIG_MEMORY_HOTPLUG >> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) >> +{ >> +pte_t *pte; >> +int i; >> + >> +for (i = 0; i < PTRS_PER_PTE; i++) { >> +pte = pte_start + i; >> +if (!pte_none(*pte)) >> +return; >> +} >> + >> +free_pages((unsigned long)page_address(pmd_page(*pmd)), 0); >> +pmd_clear(pmd); >> +} >> + >> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) >> +{ >> +pmd_t *pmd; >> +int i; >> + >> +for (i = 0; i < PTRS_PER_PMD; i++) { >> +pmd = pmd_start + i; >> +if (!pmd_none(*pmd)) >> +return; >> +} >> + >> +free_pages((unsigned long)page_address(pud_page(*pud)), 0); >> +pud_clear(pud); >> +} >> + >> +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d) >> +{ >> +pud_t *pud; >> +int i; >> + >> +for (i = 0; i < PTRS_PER_PUD; i++) { >> +pud = pud_start + i; >> +if (!pud_none(*pud)) >> +return; >> +} >> + >> +free_pages((unsigned long)page_address(p4d_page(*p4d)), 0); >> +p4d_clear(p4d); >> +} >> + >> +static void __meminit free_vmemmap_storage(struct page *page, size_t size, >> + struct vmem_altmap *altmap) >> +{ >> +if (altmap) >> +vmem_altmap_free(altmap, size >> PAGE_SHIFT); >> +else >> +free_pages((unsigned long)page_address(page), get_order(size)); > > If you unplug a DIMM that was added during boot (can happen on x86-64, > can it happen on riscv?), free_pages() would not be sufficient. You'd be > freeing a PG_reserved page that has to be freed differently. I'd say if it can happen on x86-64, it probably can on RISC-V. I'll look into this for the next spin! Thanks for spending time on the series! Cheers, Björn
Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
Alexandre Ghiti writes: > Hi Björn, > > On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: >> >> From: Björn Töpel >> >> The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to >> all userland page tables, which means that if the PGD level table is >> changed, other page tables has to be updated as well. >> >> Instead of having the PGD changes ripple out to all tables, the >> synchronization can be avoided by pre-allocating the PGD entries/pages >> at boot, avoiding the synchronization all together. >> >> This is currently done for the bpf/modules, and vmalloc PGD regions. >> Extend this scheme for the PGD regions touched by memory hotplugging. >> >> Prepare the RISC-V port for memory hotplug by pre-allocate >> vmemmap/direct map entries at the PGD level. This will roughly waste >> ~128 worth of 4K pages when memory hotplugging is enabled in the >> kernel configuration. >> >> Signed-off-by: Björn Töpel >> --- >> arch/riscv/include/asm/kasan.h | 4 ++-- >> arch/riscv/mm/init.c | 7 +++ >> 2 files changed, 9 insertions(+), 2 deletions(-) >> >> diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h >> index 0b85e363e778..e6a0071bdb56 100644 >> --- a/arch/riscv/include/asm/kasan.h >> +++ b/arch/riscv/include/asm/kasan.h >> @@ -6,8 +6,6 @@ >> >> #ifndef __ASSEMBLY__ >> >> -#ifdef CONFIG_KASAN >> - >> /* >> * The following comment was copied from arm64: >> * KASAN_SHADOW_START: beginning of the kernel virtual addresses. >> @@ -34,6 +32,8 @@ >> */ >> #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & >> PGDIR_MASK) >> #define KASAN_SHADOW_END MODULES_LOWEST_VADDR >> + >> +#ifdef CONFIG_KASAN >> #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL) >> >> void kasan_init(void); >> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c >> index 2574f6a3b0e7..5b8cdfafb52a 100644 >> --- a/arch/riscv/mm/init.c >> +++ b/arch/riscv/mm/init.c >> @@ -27,6 +27,7 @@ >> >> #include >> #include >> +#include >> #include >> #include >> #include >> @@ -1488,10 +1489,16 @@ static void __init >> preallocate_pgd_pages_range(unsigned long start, unsigned lon >> panic("Failed to pre-allocate %s pages for %s area\n", lvl, area); >> } >> >> +#define PAGE_END KASAN_SHADOW_START >> + >> void __init pgtable_cache_init(void) >> { >> preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc"); >> if (IS_ENABLED(CONFIG_MODULES)) >> preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, >> "bpf/modules"); >> + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) { >> + preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, >> "vmemmap"); >> + preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct >> map"); >> + } >> } >> #endif >> -- >> 2.40.1 >> > > As you asked, with > https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexgh...@rivosinc.com/T/#u, > you will be able to remove the usage of KASAN_SHADOW_START. Very nice -- consistency! I'll need to respin, so I'll clean this up for the next version. > But anyhow, you can add: > > Reviewed-by: Alexandre Ghiti Thank you! Björn
Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote: > +/* > + * Used for internal kernel memctl calls, i.e. to better support kernel > stacks, > + * or to efficiently zero hugetlb pages. > + */ > +long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg, > + struct memctl_buf *buf) > +{ > + buf->call.func_code = func_code; > + buf->call.addr = addr; > + buf->call.length = length; > + buf->call.arg = arg; > + > + return __memctl_vmm_call(buf); > +} > +EXPORT_SYMBOL(memctl_vmm_call); You export something that is never actually called, which implies that this is not tested at all (i.e. it is dead code.) Please remove. Also, why not EXPORT_SYMBOL_GPL()? (I have to ask, sorry.) thanks, greg k-h
Re: [RFC PATCH v1 1/2] virt: memctl: control guest physical memory properties
On Mon, May 13, 2024 at 07:03:00PM -0700, Yuanchu Xie wrote: > Memctl provides a way for the guest to control its physical memory > properties, and enables optimizations and security features. For > example, the guest can provide information to the host where parts of a > hugepage may be unbacked, or sensitive data may not be swapped out, etc. > > Memctl allows guests to manipulate its gPTE entries in the SLAT, and > also some other properties of the memory map the back's host memory. > This is achieved by using the KVM_CAP_SYNC_MMU capability. When this > capability is available, the changes in the backing of the memory region > on the host are automatically reflected into the guest. For example, an > mmap() or madvise() that affects the region will be made visible > immediately. > > There are two components of the implementation: the guest Linux driver > and Virtual Machine Monitor (VMM) device. A guest-allocated shared > buffer is negotiated per-cpu through a few PCI MMIO registers, the VMM > device assigns a unique command for each per-cpu buffer. The guest > writes its memctl request in the per-cpu buffer, then writes the > corresponding command into the command register, calling into the VMM > device to perform the memctl request. > > The synchronous per-cpu shared buffer approach avoids the kick and busy > waiting that the guest would have to do with virtio virtqueue transport. > > We provide both kernel and userspace APIs > Kernel API > long memctl_vmm_call(__u64 func_code, __u64 addr, __u64 length, __u64 arg, >struct memctl_buf *buf); > > Kernel drivers can take advantage of the memctl calls to provide > paravirtualization of kernel stacks or page zeroing. > > User API > >From the userland, the memctl guest driver is controlled via ioctl(2) > call. It requires CAP_SYS_ADMIN. > > ioctl(fd, MEMCTL_IOCTL, union memctl_vmm *memctl_vmm); > > Guest userland applications can tag VMAs and guest hugepages, or advise > the host on how to handle sensitive guest pages. > > Supported function codes and their use cases: > MEMCTL_FREE/REMOVE/DONTNEED/PAGEOUT. For the guest. One can reduce the > struct page and page table lookup overhead by using hugepages backed by > smaller pages on the host. These memctl commands can allow for partial > freeing of private guest hugepages to save memory. They also allow > kernel memory, such as kernel stacks and task_structs to be > paravirtualized. > > MEMCTL_UNMERGEABLE is useful for security, when the VM does not want to > share its backing pages. > The same with MADV_DONTDUMP, so sensitive pages are not included in a > dump. > MLOCK/UNLOCK can advise the host that sensitive information is not > swapped out on the host. > > MEMCTL_MPROTECT_NONE/R/W/RW. For guest stacks backed by hugepages, stack > guard pages can be handled in the host and memory can be saved in the > hugepage. > > MEMCTL_SET_VMA_ANON_NAME is useful for observability and debugging how > guest memory is being mapped on the host. > > Sample program making use of MEMCTL_SET_VMA_ANON_NAME and > MEMCTL_DONTNEED: > https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/main > https://github.com/Dummyc0m/memctl-set-anon-vma-name/tree/dontneed > > The VMM implementation is being proposed for Cloud Hypervisor: > https://github.com/Dummyc0m/cloud-hypervisor/ > > Cloud Hypervisor issue: > https://github.com/cloud-hypervisor/cloud-hypervisor/issues/6318 > > Signed-off-by: Yuanchu Xie > --- > .../userspace-api/ioctl/ioctl-number.rst | 2 + > drivers/virt/Kconfig | 2 + > drivers/virt/Makefile | 1 + > drivers/virt/memctl/Kconfig | 10 + > drivers/virt/memctl/Makefile | 2 + > drivers/virt/memctl/memctl.c | 425 ++ > include/linux/memctl.h| 27 ++ > include/uapi/linux/memctl.h | 81 You are mixing your PCI driver in with the memctl core code, is that intentional? Will there never be another PCI device for this type of interface other than this one PCI device? And if so, why export anything, why isn't this all in one body of code? > 8 files changed, 550 insertions(+) > create mode 100644 drivers/virt/memctl/Kconfig > create mode 100644 drivers/virt/memctl/Makefile > create mode 100644 drivers/virt/memctl/memctl.c > create mode 100644 include/linux/memctl.h > create mode 100644 include/uapi/linux/memctl.h > > diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst > b/Documentation/userspace-api/ioctl/ioctl-number.rst > index 457e16f06e04..789d1251c0be 100644 > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst > @@ -368,6 +368,8 @@ Code Seq#Include File >Comments > 0xCD 01 linux/reiserfs_fs.h > 0xCE 01-02 uapi/linux/cxl_mem.h
Re: [PATCH v1 1/1] Input: gpio-keys - expose wakeup keys in sysfs
Hi, On Mon, May 13, 2024 at 03:13:53PM -0700, Dmitry Torokhov wrote: > Hi Guido, > > On Thu, May 09, 2024 at 02:00:28PM +0200, Guido Günther wrote: > > This helps user space to figure out which keys should be used to unidle a > > device. E.g on phones the volume rocker should usually not unblank the > > screen. > > How exactly this is supposed to be used? We have "disabled" keys and > switches attribute because this function can be controlled at runtime > from userspace while wakeup control is a static device setting. Current Linux userspace usually unblanks/unidles a device on every keypress. That is usually not the expected result on phones where often only the power button and e.g. some home buttons should do this. These keys usually match the keys that are used as wakeup sources to bring a device out of suspend. So if we export the wakeup keys to userspace we can pick some sensible defaults (overridable via hwdb¹). > Kernel also does not really know if the screen should be unblanked or > not, if a button or switch is configured for wake up the kernel will go > through wakeup process all the same and then userspace can decide if it > should stay woken up or not. Yes, we merely want that as a hint to figure out sensible defaults in userspace (which might be a subset of the wakeup keys). Cherrs, -- Guido ¹) See https://gitlab.gnome.org/World/Phosh/gmobile/-/blob/main/data/61-gmobile-wakeup.hwdb?ref_type=heads#L57-L59 > > Thanks. > > -- > Dmitry >
Re: [PATCH v2 4/8] riscv: mm: Add memory hotplugging support
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel For an architecture to support memory hotplugging, a couple of callbacks needs to be implemented: arch_add_memory() This callback is responsible for adding the physical memory into the direct map, and call into the memory hotplugging generic code via __add_pages() that adds the corresponding struct page entries, and updates the vmemmap mapping. arch_remove_memory() This is the inverse of the callback above. vmemmap_free() This function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used; The vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. arch_get_mappable_range() This functions returns the PA range that the direct map can map. Used by the MHP internals for sanity checks. The page table unmap/teardown functions are heavily based on code from the x86 tree. The same remove_pgd_mapping() function is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. Signed-off-by: Björn Töpel --- arch/riscv/mm/init.c | 242 +++ 1 file changed, 242 insertions(+) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 6f72b0b2b854..7f0b921a3d3a 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void) } } #endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) +{ + pte_t *pte; + int i; + + for (i = 0; i < PTRS_PER_PTE; i++) { + pte = pte_start + i; + if (!pte_none(*pte)) + return; + } + + free_pages((unsigned long)page_address(pmd_page(*pmd)), 0); + pmd_clear(pmd); +} + +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) +{ + pmd_t *pmd; + int i; + + for (i = 0; i < PTRS_PER_PMD; i++) { + pmd = pmd_start + i; + if (!pmd_none(*pmd)) + return; + } + + free_pages((unsigned long)page_address(pud_page(*pud)), 0); + pud_clear(pud); +} + +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d) +{ + pud_t *pud; + int i; + + for (i = 0; i < PTRS_PER_PUD; i++) { + pud = pud_start + i; + if (!pud_none(*pud)) + return; + } + + free_pages((unsigned long)page_address(p4d_page(*p4d)), 0); + p4d_clear(p4d); +} + +static void __meminit free_vmemmap_storage(struct page *page, size_t size, + struct vmem_altmap *altmap) +{ + if (altmap) + vmem_altmap_free(altmap, size >> PAGE_SHIFT); + else + free_pages((unsigned long)page_address(page), get_order(size)); If you unplug a DIMM that was added during boot (can happen on x86-64, can it happen on riscv?), free_pages() would not be sufficient. You'd be freeing a PG_reserved page that has to be freed differently. -- Cheers, David / dhildenb
Re: [PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Add a parameter to the direct map setup function, so it can be used in arch_add_memory() later. Signed-off-by: Björn Töpel --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Prepare for memory hotplugging support by changing from __init to __meminit for the page table functions that are used by the upcoming architecture specific callbacks. Changing the __init attribute to __meminit, avoids that the functions are removed after init. The __meminit attribute makes sure the functions are kept in the kernel text post init, but only if memory hotplugging is enabled for the build. Also, make sure that the altmap parameter is properly passed on to vmemmap_populate_hugepages(). Signed-off-by: Björn Töpel --- Reviewed-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V
On 14.05.24 16:04, Björn Töpel wrote: From: Björn Töpel Now that RISC-V has memory hotplugging support, virtio-mem can be used on the platform. Signed-off-by: Björn Töpel --- drivers/virtio/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index c17193544268..4e5cebf1b82a 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -122,7 +122,7 @@ config VIRTIO_BALLOON config VIRTIO_MEM tristate "Virtio mem driver" - depends on X86_64 || ARM64 + depends on X86_64 || ARM64 || RISCV depends on VIRTIO depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE Nice! Acked-by: David Hildenbrand -- Cheers, David / dhildenb
Re: [PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
Hi Björn, On Tue, May 14, 2024 at 4:05 PM Björn Töpel wrote: > > From: Björn Töpel > > The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to > all userland page tables, which means that if the PGD level table is > changed, other page tables has to be updated as well. > > Instead of having the PGD changes ripple out to all tables, the > synchronization can be avoided by pre-allocating the PGD entries/pages > at boot, avoiding the synchronization all together. > > This is currently done for the bpf/modules, and vmalloc PGD regions. > Extend this scheme for the PGD regions touched by memory hotplugging. > > Prepare the RISC-V port for memory hotplug by pre-allocate > vmemmap/direct map entries at the PGD level. This will roughly waste > ~128 worth of 4K pages when memory hotplugging is enabled in the > kernel configuration. > > Signed-off-by: Björn Töpel > --- > arch/riscv/include/asm/kasan.h | 4 ++-- > arch/riscv/mm/init.c | 7 +++ > 2 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h > index 0b85e363e778..e6a0071bdb56 100644 > --- a/arch/riscv/include/asm/kasan.h > +++ b/arch/riscv/include/asm/kasan.h > @@ -6,8 +6,6 @@ > > #ifndef __ASSEMBLY__ > > -#ifdef CONFIG_KASAN > - > /* > * The following comment was copied from arm64: > * KASAN_SHADOW_START: beginning of the kernel virtual addresses. > @@ -34,6 +32,8 @@ > */ > #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & > PGDIR_MASK) > #define KASAN_SHADOW_END MODULES_LOWEST_VADDR > + > +#ifdef CONFIG_KASAN > #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL) > > void kasan_init(void); > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index 2574f6a3b0e7..5b8cdfafb52a 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -27,6 +27,7 @@ > > #include > #include > +#include > #include > #include > #include > @@ -1488,10 +1489,16 @@ static void __init > preallocate_pgd_pages_range(unsigned long start, unsigned lon > panic("Failed to pre-allocate %s pages for %s area\n", lvl, area); > } > > +#define PAGE_END KASAN_SHADOW_START > + > void __init pgtable_cache_init(void) > { > preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc"); > if (IS_ENABLED(CONFIG_MODULES)) > preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, > "bpf/modules"); > + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) { > + preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, > "vmemmap"); > + preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct > map"); > + } > } > #endif > -- > 2.40.1 > As you asked, with https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexgh...@rivosinc.com/T/#u, you will be able to remove the usage of KASAN_SHADOW_START. But anyhow, you can add: Reviewed-by: Alexandre Ghiti Thanks, Alex
[GIT PULL] OpenRISC updates for 6.10
Hello Linus, Please consider for pull, The following changes since commit 4cece764965020c22cff7665b18a012006359095: Linux 6.9-rc1 (2024-03-24 14:10:05 -0700) are available in the Git repository at: https://github.com/openrisc/linux.git tags/for-linus for you to fetch changes up to 4dc70e1aadfadf968676d983587c6f5d455aba85: openrisc: Move FPU state out of pt_regs (2024-04-15 15:20:39 +0100) OpenRISC updates for 6.10 A few cleanups and fixups from me: - Add a few missing relocations to fix module loading. - Cleanup FPU state save and restore to be more efficient. - Cleanups to traps handling and logging. - Fix issue with poweroff being broken after recent power driver refactoings. Stafford Horne (8): openrisc: Use do_kernel_power_off() openrisc: Define openrisc relocation types openrisc: Add support for more module relocations openrisc: traps: Convert printks to pr_ macros openrisc: traps: Remove calls to show_registers before die openrisc: traps: Don't send signals to kernel mode threads openrisc: Add FPU config openrisc: Move FPU state out of pt_regs arch/openrisc/Kconfig | 9 +++ arch/openrisc/include/asm/fpu.h | 22 ++ arch/openrisc/include/asm/processor.h | 1 + arch/openrisc/include/asm/ptrace.h| 3 +- arch/openrisc/include/uapi/asm/elf.h | 75 +++--- arch/openrisc/kernel/entry.S | 15 +--- arch/openrisc/kernel/module.c | 18 - arch/openrisc/kernel/process.c| 13 +-- arch/openrisc/kernel/ptrace.c | 18 ++--- arch/openrisc/kernel/signal.c | 36 - arch/openrisc/kernel/traps.c | 144 ++ 11 files changed, 243 insertions(+), 111 deletions(-) create mode 100644 arch/openrisc/include/asm/fpu.h
Re: [PATCH v3] module: create weak dependecies
On Fri, May 10, 2024 at 10:57:22AM GMT, Jose Ignacio Tornos Martinez wrote: It has been seen that for some network mac drivers (i.e. lan78xx) the related module for the phy is loaded dynamically depending on the current hardware. In this case, the associated phy is read using mdio bus and then the associated phy module is loaded during runtime (kernel function phy_request_driver_module). However, no software dependency is defined, so the user tools will no be able to get this dependency. For example, if dracut is used and the hardware is present, lan78xx will be included but no phy module will be added, and in the next restart the device will not work from boot because no related phy will be found during initramfs stage. In order to solve this, we could define a normal 'pre' software dependency in lan78xx module with all the possible phy modules (there may be some), but proceeding in that way, all the possible phy modules would be loaded while only one is necessary. The idea is to create a new type of dependency, that we are going to call 'weak' to be used only by the user tools that need to detect this situation. In that way, for example, dracut could check the 'weak' dependency of the modules involved in order to install these dependencies in initramfs too. That is, for the commented lan78xx module, defining the 'weak' dependency with the possible phy modules list, only the necessary phy would be loaded on demand keeping the same behavior, but all the possible phy modules would be available from initramfs. The 'weak' dependency support has been included in kmod: https://github.com/kmod-project/kmod/commit/05828b4a6e9327a63ef94df544a042b5e9ce4fe7 But, take into account that this can only be used if depmod is new enough. If it isn't, depmod will have the same behavior as always (keeping backward compatibility) and the information for the 'weak' dependency will not be provided. Signed-off-by: Jose Ignacio Tornos Martinez Reviewed-by: Lucas De Marchi thanks Lucas De Marchi --- V2 -> V3: - Include note about backward compatibility. - Balance the /* and */. V1 -> V2: - Include reference to 'weak' dependency support in kmod. include/linux/module.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/include/linux/module.h b/include/linux/module.h index 1153b0d99a80..2a056017df5b 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -173,6 +173,12 @@ extern void cleanup_module(void); */ #define MODULE_SOFTDEP(_softdep) MODULE_INFO(softdep, _softdep) +/* + * Weak module dependencies. See man modprobe.d for details. + * Example: MODULE_WEAKDEP("module-foo") + */ +#define MODULE_WEAKDEP(_weakdep) MODULE_INFO(weakdep, _weakdep) + /* * MODULE_FILE is used for generating modules.builtin * So, make it no-op when this is being built as a module -- 2.44.0
[PATCH v2 8/8] riscv: mm: Add support for ZONE_DEVICE
From: Björn Töpel ZONE_DEVICE pages need DEVMAP PTEs support to function (ARCH_HAS_PTE_DEVMAP). Claim another RSW (reserved for software) bit in the PTE for DEVMAP mark, add the corresponding helpers, and enable ARCH_HAS_PTE_DEVMAP for riscv64. Signed-off-by: Björn Töpel --- arch/riscv/Kconfig| 1 + arch/riscv/include/asm/pgtable-64.h | 20 arch/riscv/include/asm/pgtable-bits.h | 1 + arch/riscv/include/asm/pgtable.h | 15 +++ 4 files changed, 37 insertions(+) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index b9398b64bb69..6d426afdd904 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -36,6 +36,7 @@ config RISCV select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API select ARCH_HAS_PREPARE_SYNC_CORE_CMD + select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_SET_DIRECT_MAP if MMU select ARCH_HAS_SET_MEMORY if MMU diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h index 221a5c1ee287..c67a9bbfd010 100644 --- a/arch/riscv/include/asm/pgtable-64.h +++ b/arch/riscv/include/asm/pgtable-64.h @@ -400,4 +400,24 @@ static inline struct page *pgd_page(pgd_t pgd) #define p4d_offset p4d_offset p4d_t *p4d_offset(pgd_t *pgd, unsigned long address); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline int pte_devmap(pte_t pte); +static inline pte_t pmd_pte(pmd_t pmd); + +static inline int pmd_devmap(pmd_t pmd) +{ + return pte_devmap(pmd_pte(pmd)); +} + +static inline int pud_devmap(pud_t pud) +{ + return 0; +} + +static inline int pgd_devmap(pgd_t pgd) +{ + return 0; +} +#endif + #endif /* _ASM_RISCV_PGTABLE_64_H */ diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h index 179bd4afece4..a8f5205cea54 100644 --- a/arch/riscv/include/asm/pgtable-bits.h +++ b/arch/riscv/include/asm/pgtable-bits.h @@ -19,6 +19,7 @@ #define _PAGE_SOFT (3 << 8)/* Reserved for software */ #define _PAGE_SPECIAL (1 << 8)/* RSW: 0x1 */ +#define _PAGE_DEVMAP(1 << 9)/* RSW, devmap */ #define _PAGE_TABLE _PAGE_PRESENT /* diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 7933f493db71..216de1db3cd0 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -387,6 +387,11 @@ static inline int pte_special(pte_t pte) return pte_val(pte) & _PAGE_SPECIAL; } +static inline int pte_devmap(pte_t pte) +{ + return pte_val(pte) & _PAGE_DEVMAP; +} + /* static inline pte_t pte_rdprotect(pte_t pte) */ static inline pte_t pte_wrprotect(pte_t pte) @@ -428,6 +433,11 @@ static inline pte_t pte_mkspecial(pte_t pte) return __pte(pte_val(pte) | _PAGE_SPECIAL); } +static inline pte_t pte_mkdevmap(pte_t pte) +{ + return __pte(pte_val(pte) | _PAGE_DEVMAP); +} + static inline pte_t pte_mkhuge(pte_t pte) { return pte; @@ -711,6 +721,11 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd) return pte_pmd(pte_mkdirty(pmd_pte(pmd))); } +static inline pmd_t pmd_mkdevmap(pmd_t pmd) +{ + return pte_pmd(pte_mkdevmap(pmd_pte(pmd))); +} + static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd) { -- 2.40.1
[PATCH v2 6/8] riscv: Enable memory hotplugging for RISC-V
From: Björn Töpel Enable ARCH_ENABLE_MEMORY_HOTPLUG and ARCH_ENABLE_MEMORY_HOTREMOVE for RISC-V. Signed-off-by: Björn Töpel --- arch/riscv/Kconfig | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index 6bec1bce6586..b9398b64bb69 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -16,6 +16,8 @@ config RISCV select ACPI_REDUCED_HARDWARE_ONLY if ACPI select ARCH_DMA_DEFAULT_COHERENT select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION + select ARCH_ENABLE_MEMORY_HOTPLUG if SPARSEMEM && 64BIT && MMU + select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG select ARCH_ENABLE_SPLIT_PMD_PTLOCK if PGTABLE_LEVELS > 2 select ARCH_ENABLE_THP_MIGRATION if TRANSPARENT_HUGEPAGE select ARCH_HAS_BINFMT_FLAT -- 2.40.1
[PATCH v2 7/8] virtio-mem: Enable virtio-mem for RISC-V
From: Björn Töpel Now that RISC-V has memory hotplugging support, virtio-mem can be used on the platform. Signed-off-by: Björn Töpel --- drivers/virtio/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index c17193544268..4e5cebf1b82a 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -122,7 +122,7 @@ config VIRTIO_BALLOON config VIRTIO_MEM tristate "Virtio mem driver" - depends on X86_64 || ARM64 + depends on X86_64 || ARM64 || RISCV depends on VIRTIO depends on MEMORY_HOTPLUG depends on MEMORY_HOTREMOVE -- 2.40.1
[PATCH v2 4/8] riscv: mm: Add memory hotplugging support
From: Björn Töpel For an architecture to support memory hotplugging, a couple of callbacks needs to be implemented: arch_add_memory() This callback is responsible for adding the physical memory into the direct map, and call into the memory hotplugging generic code via __add_pages() that adds the corresponding struct page entries, and updates the vmemmap mapping. arch_remove_memory() This is the inverse of the callback above. vmemmap_free() This function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is enabled), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used; The vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. arch_get_mappable_range() This functions returns the PA range that the direct map can map. Used by the MHP internals for sanity checks. The page table unmap/teardown functions are heavily based on code from the x86 tree. The same remove_pgd_mapping() function is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. Signed-off-by: Björn Töpel --- arch/riscv/mm/init.c | 242 +++ 1 file changed, 242 insertions(+) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 6f72b0b2b854..7f0b921a3d3a 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1493,3 +1493,245 @@ void __init pgtable_cache_init(void) } } #endif + +#ifdef CONFIG_MEMORY_HOTPLUG +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd) +{ + pte_t *pte; + int i; + + for (i = 0; i < PTRS_PER_PTE; i++) { + pte = pte_start + i; + if (!pte_none(*pte)) + return; + } + + free_pages((unsigned long)page_address(pmd_page(*pmd)), 0); + pmd_clear(pmd); +} + +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud) +{ + pmd_t *pmd; + int i; + + for (i = 0; i < PTRS_PER_PMD; i++) { + pmd = pmd_start + i; + if (!pmd_none(*pmd)) + return; + } + + free_pages((unsigned long)page_address(pud_page(*pud)), 0); + pud_clear(pud); +} + +static void __meminit free_pud_table(pud_t *pud_start, p4d_t *p4d) +{ + pud_t *pud; + int i; + + for (i = 0; i < PTRS_PER_PUD; i++) { + pud = pud_start + i; + if (!pud_none(*pud)) + return; + } + + free_pages((unsigned long)page_address(p4d_page(*p4d)), 0); + p4d_clear(p4d); +} + +static void __meminit free_vmemmap_storage(struct page *page, size_t size, + struct vmem_altmap *altmap) +{ + if (altmap) + vmem_altmap_free(altmap, size >> PAGE_SHIFT); + else + free_pages((unsigned long)page_address(page), get_order(size)); +} + +static void __meminit remove_pte_mapping(pte_t *pte_base, unsigned long addr, unsigned long end, +bool is_vmemmap, struct vmem_altmap *altmap) +{ + unsigned long next; + pte_t *ptep, pte; + + for (; addr < end; addr = next) { + next = (addr + PAGE_SIZE) & PAGE_MASK; + if (next > end) + next = end; + + ptep = pte_base + pte_index(addr); + pte = READ_ONCE(*ptep); + + if (!pte_present(*ptep)) + continue; + + pte_clear(_mm, addr, ptep); + if (is_vmemmap) + free_vmemmap_storage(pte_page(pte), PAGE_SIZE, altmap); + } +} + +static void __meminit remove_pmd_mapping(pmd_t *pmd_base, unsigned long addr, unsigned long end, +bool is_vmemmap, struct vmem_altmap *altmap) +{ + unsigned long next; + pte_t *pte_base; + pmd_t *pmdp, pmd; + + for (; addr < end; addr = next) { + next = pmd_addr_end(addr, end); + pmdp = pmd_base + pmd_index(addr); + pmd = READ_ONCE(*pmdp); + + if (!pmd_present(pmd)) + continue; + + if (pmd_leaf(pmd)) { + pmd_clear(pmdp); + if (is_vmemmap) + free_vmemmap_storage(pmd_page(pmd), PMD_SIZE, altmap); + continue; + } + + pte_base = (pte_t *)pmd_page_vaddr(*pmdp); + remove_pte_mapping(pte_base, addr, next, is_vmemmap, altmap); + free_pte_table(pte_base, pmdp); + } +} + +static void __meminit remove_pud_mapping(pud_t *pud_base, unsigned long addr, unsigned long end, +bool is_vmemmap, struct vmem_altmap *altmap)
[PATCH v2 5/8] riscv: mm: Take memory hotplug read-lock during kernel page table dump
From: Björn Töpel During memory hot remove, the ptdump functionality can end up touching stale data. Avoid any potential crashes (or worse), by holding the memory hotplug read-lock while traversing the page table. This change is analogous to arm64's commit bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for kernel page table dump"). Signed-off-by: Björn Töpel --- arch/riscv/mm/ptdump.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/riscv/mm/ptdump.c b/arch/riscv/mm/ptdump.c index 1289cc6d3700..9d5f657a251b 100644 --- a/arch/riscv/mm/ptdump.c +++ b/arch/riscv/mm/ptdump.c @@ -6,6 +6,7 @@ #include #include #include +#include #include #include @@ -370,7 +371,9 @@ bool ptdump_check_wx(void) static int ptdump_show(struct seq_file *m, void *v) { + get_online_mems(); ptdump_walk(m, m->private); + put_online_mems(); return 0; } -- 2.40.1
[PATCH v2 3/8] riscv: mm: Refactor create_linear_mapping_range() for memory hot add
From: Björn Töpel Add a parameter to the direct map setup function, so it can be used in arch_add_memory() later. Signed-off-by: Björn Töpel --- arch/riscv/mm/init.c | 15 ++- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index c969427eab88..6f72b0b2b854 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1227,7 +1227,7 @@ asmlinkage void __init setup_vm(uintptr_t dtb_pa) } static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t end, - uintptr_t fixed_map_size) + uintptr_t fixed_map_size, const pgprot_t *pgprot) { phys_addr_t pa; uintptr_t va, map_size; @@ -1238,7 +1238,7 @@ static void __meminit create_linear_mapping_range(phys_addr_t start, phys_addr_t best_map_size(pa, va, end - pa); create_pgd_mapping(swapper_pg_dir, va, pa, map_size, - pgprot_from_va(va)); + pgprot ? *pgprot : pgprot_from_va(va)); } } @@ -1282,22 +1282,19 @@ static void __init create_linear_mapping_page_table(void) if (end >= __pa(PAGE_OFFSET) + memory_limit) end = __pa(PAGE_OFFSET) + memory_limit; - create_linear_mapping_range(start, end, 0); + create_linear_mapping_range(start, end, 0, NULL); } #ifdef CONFIG_STRICT_KERNEL_RWX - create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0); - create_linear_mapping_range(krodata_start, - krodata_start + krodata_size, 0); + create_linear_mapping_range(ktext_start, ktext_start + ktext_size, 0, NULL); + create_linear_mapping_range(krodata_start, krodata_start + krodata_size, 0, NULL); memblock_clear_nomap(ktext_start, ktext_size); memblock_clear_nomap(krodata_start, krodata_size); #endif #ifdef CONFIG_KFENCE - create_linear_mapping_range(kfence_pool, - kfence_pool + KFENCE_POOL_SIZE, - PAGE_SIZE); + create_linear_mapping_range(kfence_pool, kfence_pool + KFENCE_POOL_SIZE, PAGE_SIZE, NULL); memblock_clear_nomap(kfence_pool, KFENCE_POOL_SIZE); #endif -- 2.40.1
[PATCH v2 2/8] riscv: mm: Change attribute from __init to __meminit for page functions
From: Björn Töpel Prepare for memory hotplugging support by changing from __init to __meminit for the page table functions that are used by the upcoming architecture specific callbacks. Changing the __init attribute to __meminit, avoids that the functions are removed after init. The __meminit attribute makes sure the functions are kept in the kernel text post init, but only if memory hotplugging is enabled for the build. Also, make sure that the altmap parameter is properly passed on to vmemmap_populate_hugepages(). Signed-off-by: Björn Töpel --- arch/riscv/include/asm/mmu.h | 4 +-- arch/riscv/include/asm/pgtable.h | 2 +- arch/riscv/mm/init.c | 58 ++-- 3 files changed, 29 insertions(+), 35 deletions(-) diff --git a/arch/riscv/include/asm/mmu.h b/arch/riscv/include/asm/mmu.h index 60be458e94da..c09c3c79f496 100644 --- a/arch/riscv/include/asm/mmu.h +++ b/arch/riscv/include/asm/mmu.h @@ -28,8 +28,8 @@ typedef struct { #endif } mm_context_t; -void __init create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, - phys_addr_t sz, pgprot_t prot); +void __meminit create_pgd_mapping(pgd_t *pgdp, uintptr_t va, phys_addr_t pa, phys_addr_t sz, + pgprot_t prot); #endif /* __ASSEMBLY__ */ #endif /* _ASM_RISCV_MMU_H */ diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h index 58fd7b70b903..7933f493db71 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -162,7 +162,7 @@ struct pt_alloc_ops { #endif }; -extern struct pt_alloc_ops pt_ops __initdata; +extern struct pt_alloc_ops pt_ops __meminitdata; #ifdef CONFIG_MMU /* Number of PGD entries that a user-mode program can use */ diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 5b8cdfafb52a..c969427eab88 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -295,7 +295,7 @@ static void __init setup_bootmem(void) } #ifdef CONFIG_MMU -struct pt_alloc_ops pt_ops __initdata; +struct pt_alloc_ops pt_ops __meminitdata; pgd_t swapper_pg_dir[PTRS_PER_PGD] __page_aligned_bss; pgd_t trampoline_pg_dir[PTRS_PER_PGD] __page_aligned_bss; @@ -357,7 +357,7 @@ static inline pte_t *__init get_pte_virt_fixmap(phys_addr_t pa) return (pte_t *)set_fixmap_offset(FIX_PTE, pa); } -static inline pte_t *__init get_pte_virt_late(phys_addr_t pa) +static inline pte_t *__meminit get_pte_virt_late(phys_addr_t pa) { return (pte_t *) __va(pa); } @@ -376,7 +376,7 @@ static inline phys_addr_t __init alloc_pte_fixmap(uintptr_t va) return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); } -static phys_addr_t __init alloc_pte_late(uintptr_t va) +static phys_addr_t __meminit alloc_pte_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); @@ -384,9 +384,8 @@ static phys_addr_t __init alloc_pte_late(uintptr_t va) return __pa((pte_t *)ptdesc_address(ptdesc)); } -static void __init create_pte_mapping(pte_t *ptep, - uintptr_t va, phys_addr_t pa, - phys_addr_t sz, pgprot_t prot) +static void __meminit create_pte_mapping(pte_t *ptep, uintptr_t va, phys_addr_t pa, phys_addr_t sz, +pgprot_t prot) { uintptr_t pte_idx = pte_index(va); @@ -440,7 +439,7 @@ static pmd_t *__init get_pmd_virt_fixmap(phys_addr_t pa) return (pmd_t *)set_fixmap_offset(FIX_PMD, pa); } -static pmd_t *__init get_pmd_virt_late(phys_addr_t pa) +static pmd_t *__meminit get_pmd_virt_late(phys_addr_t pa) { return (pmd_t *) __va(pa); } @@ -457,7 +456,7 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va) return memblock_phys_alloc(PAGE_SIZE, PAGE_SIZE); } -static phys_addr_t __init alloc_pmd_late(uintptr_t va) +static phys_addr_t __meminit alloc_pmd_late(uintptr_t va) { struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0); @@ -465,9 +464,9 @@ static phys_addr_t __init alloc_pmd_late(uintptr_t va) return __pa((pmd_t *)ptdesc_address(ptdesc)); } -static void __init create_pmd_mapping(pmd_t *pmdp, - uintptr_t va, phys_addr_t pa, - phys_addr_t sz, pgprot_t prot) +static void __meminit create_pmd_mapping(pmd_t *pmdp, +uintptr_t va, phys_addr_t pa, +phys_addr_t sz, pgprot_t prot) { pte_t *ptep; phys_addr_t pte_phys; @@ -503,7 +502,7 @@ static pud_t *__init get_pud_virt_fixmap(phys_addr_t pa) return (pud_t *)set_fixmap_offset(FIX_PUD, pa); } -static pud_t *__init get_pud_virt_late(phys_addr_t pa) +static pud_t *__meminit get_pud_virt_late(phys_addr_t pa) { return (pud_t *)__va(pa); } @@ -521,7 +520,7 @@ static phys_addr_t __init alloc_pud_fixmap(uintptr_t
[PATCH v2 1/8] riscv: mm: Pre-allocate vmemmap/direct map PGD entries
From: Björn Töpel The RISC-V port copies the PGD table from init_mm/swapper_pg_dir to all userland page tables, which means that if the PGD level table is changed, other page tables has to be updated as well. Instead of having the PGD changes ripple out to all tables, the synchronization can be avoided by pre-allocating the PGD entries/pages at boot, avoiding the synchronization all together. This is currently done for the bpf/modules, and vmalloc PGD regions. Extend this scheme for the PGD regions touched by memory hotplugging. Prepare the RISC-V port for memory hotplug by pre-allocate vmemmap/direct map entries at the PGD level. This will roughly waste ~128 worth of 4K pages when memory hotplugging is enabled in the kernel configuration. Signed-off-by: Björn Töpel --- arch/riscv/include/asm/kasan.h | 4 ++-- arch/riscv/mm/init.c | 7 +++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/riscv/include/asm/kasan.h b/arch/riscv/include/asm/kasan.h index 0b85e363e778..e6a0071bdb56 100644 --- a/arch/riscv/include/asm/kasan.h +++ b/arch/riscv/include/asm/kasan.h @@ -6,8 +6,6 @@ #ifndef __ASSEMBLY__ -#ifdef CONFIG_KASAN - /* * The following comment was copied from arm64: * KASAN_SHADOW_START: beginning of the kernel virtual addresses. @@ -34,6 +32,8 @@ */ #define KASAN_SHADOW_START ((KASAN_SHADOW_END - KASAN_SHADOW_SIZE) & PGDIR_MASK) #define KASAN_SHADOW_END MODULES_LOWEST_VADDR + +#ifdef CONFIG_KASAN #define KASAN_SHADOW_OFFSET_AC(CONFIG_KASAN_SHADOW_OFFSET, UL) void kasan_init(void); diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 2574f6a3b0e7..5b8cdfafb52a 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -27,6 +27,7 @@ #include #include +#include #include #include #include @@ -1488,10 +1489,16 @@ static void __init preallocate_pgd_pages_range(unsigned long start, unsigned lon panic("Failed to pre-allocate %s pages for %s area\n", lvl, area); } +#define PAGE_END KASAN_SHADOW_START + void __init pgtable_cache_init(void) { preallocate_pgd_pages_range(VMALLOC_START, VMALLOC_END, "vmalloc"); if (IS_ENABLED(CONFIG_MODULES)) preallocate_pgd_pages_range(MODULES_VADDR, MODULES_END, "bpf/modules"); + if (IS_ENABLED(CONFIG_MEMORY_HOTPLUG)) { + preallocate_pgd_pages_range(VMEMMAP_START, VMEMMAP_END, "vmemmap"); + preallocate_pgd_pages_range(PAGE_OFFSET, PAGE_END, "direct map"); + } } #endif -- 2.40.1
[PATCH v2 0/8] riscv: Memory Hot(Un)Plug support
From: Björn Töpel Memory Hot(Un)Plug support (and ZONE_DEVICE) for the RISC-V port Introduction To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory hot(un)plug allows for increasing and decreasing the size of physical memory available to a machine at runtime." This series adds memory hot(un)plugging, and ZONE_DEVICE support for the RISC-V Linux port. I'm sending this series while LSF/MM/BPF is on-going, and with some luck some MM person can review the series while zoning out on a talk. ;-) MM configuration RISC-V MM has the following configuration: * Memory blocks are 128M, analogous to x86-64. It uses PMD ("hugepage") vmemmaps. From that follows that 2M (PMD) worth of vmemmap spans 32768 pages á 4K which gets us 128M. * The pageblock size is the minimum minimum virtio_mem size, and on RISC-V it's 2M (2^9 * 4K). Implementation == The PGD table on RISC-V is shared/copied between for all processes. To avoid doing page table synchronization, the first patch (patch 1) pre-allocated the PGD entries for vmemmap/direct map. By doing that the init_mm PGD will be fixed at kernel init, and synchronization can be avoided all together. The following two patches (patch 2-3) does some preparations, followed by the actual MHP implementation (patch 4-5). Then, MHP and virtio-mem are enabled (patch 6-7), and finally ZONE_DEVICE support is added (patch 8). MHP and locking === TL;DR: The MHP does not step on any toes, except for ptdump. Additional locking is required for ptdump. Long version: For v2 I spent some time digging into init_mm synchronization/update. Here are my findings, and I'd love them to be corrected if incorrect. It's been a gnarly path... The `init_mm` structure is a special mm (perhaps not a "real" one). It's a "lazy context" that tracks kernel page table resources, e.g., the kernel page table (swapper_pg_dir), a kernel page_table_lock (more about the usage below), mmap_lock, and such. `init_mm` does not track/contain any VMAs. Having the `init_mm` is convenient, so that the regular kernel page table walk/modify functions can be used. Now, `init_mm` being special means that the locking for kernel page tables are special as well. On RISC-V the PGD (top-level page table structure), similar to x86, is shared (copied) with user processes. If the kernel PGD is modified, it has to be synched to user-mode processes PGDs. This is avoided by pre-populating the PGD, so it'll be fixed from boot. The in-kernel pgd regions are documented in `Documentation/arch/riscv/vm-layout.rst`. The distinct regions are: * vmemmap * vmalloc/ioremap space * direct mapping of all physical memory * kasan * modules, BPF * kernel Memory hotplug is the process of adding/removing memory to/from the kernel. Adding is done in two phases: 1. Add the memory to the kernel 2. Online memory, making it available to the page allocator. Step 1 is partially architecture dependent, and updates the init_mm page table: * Update the direct map page tables. The direct map is a linear map, representing all physical memory: `virt = phys + PAGE_OFFSET` * Add a `struct page` for each added page of memory. Update the vmemmap (virtual mapping to the `struct page`, so we can easily transform a kernel virtual address to a `struct page *` address. >From an MHP perspective, there are two regions of the PGD that are updated: * vmemmap * direct mapping of all physical memory The `struct mm_struct` has a couple of locks in play: * `spinlock_t page_table_lock` protects the page table, and some counters * `struct rw_semaphore mmap_lock` protect an mm's VMAs Note again that `init_mm` does not contain any VMAs, but still uses the mmap_lock in some places. The `page_table_lock` was originally used to to protect all pages tables, but more recently a split page table lock has been introduced. The split lock has a per-table lock for the PTE and PMD tables. If split lock is disabled, all tables are guarded by `mm->page_table_lock` (for user processes). Split page table locks are not used for init_mm. MHP operations is typically synchronized using `DEFINE_STATIC_PERCPU_RWSEM(mem_hotplug_lock)`. Actors -- The following non-MHP actors in the kernel traverses (read), and/or modifies the kernel PGD. * `ptdump` Walks the entire `init_mm`, via `ptdump_walk_pgd()` with the `mmap_write_lock(init_mm)` taken. Observation: ptdump can race with MHP, and needs additional locking to avoid crashes/races. * `set_direct_*` / `arch/riscv/mm/pageattr.c` The `set_direct_*` functionality is used to "synchronize" the direct map to other kernel mappings, e.g. modules/kernel text. The direct map is using "as large huge table mappings as possible", which means that the `set_direct_*` might need to
[GIT PULL] Modules changes for v6.10-rc1
The following changes since commit a5131c3fdf2608f1c15f3809e201cf540eb28489: Merge tag 'x86-shstk-2024-05-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip (2024-05-13 19:33:23 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/ tags/modules-6.10-rc1 for you to fetch changes up to 2c9e5d4a008293407836d29d35dfd4353615bd2f: bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of (2024-05-14 00:36:29 -0700) Modules changes for v6.10-rc1 Finally something fun. Mike Rapoport does some cleanup to allow us to take out module_alloc() out of modules into a new paint shedded execmem_alloc() and execmem_free() so to make emphasis these helpers are actually used outside of modules. It starts with a no-functional changes API rename / placeholders to then allow architectures to define their requirements into a new shiny struct execmem_info with ranges, and requirements for those ranges. Archs now can intitialize this execmem_info as the last part of mm_core_init() if they have to diverge from the norm. Each range is a known type clearly articulated and spelled out in enum execmem_type. Although a lot of this is major cleanup and prep work for future enhancements an immediate clear gain is we get to enable KPROBES without MODULES now. That is ultimately what motiviated to pick this work up again, now with smaller goal as concrete stepping stone. This has been sitting on linux-next for a little less than a month, a few issues were found already and fixed, in particular an odd mips boot issue. Arch folks reviewed the code too. This is ready for wider exposure and testing. Justin Stitt (1): kallsyms: replace deprecated strncpy with strscpy Mike Rapoport (IBM) (16): arm64: module: remove unneeded call to kasan_alloc_module_shadow() mips: module: rename MODULE_START to MODULES_VADDR nios2: define virtual address space for modules sparc: simplify module_alloc() module: make module_memory_{alloc,free} more self-contained mm: introduce execmem_alloc() and execmem_free() mm/execmem, arch: convert simple overrides of module_alloc to execmem mm/execmem, arch: convert remaining overrides of module_alloc to execmem riscv: extend execmem_params for generated code allocations arm64: extend execmem_info for generated code allocations powerpc: extend execmem_params for kprobes allocations arch: make execmem setup available regardless of CONFIG_MODULES x86/ftrace: enable dynamic ftrace without CONFIG_MODULES powerpc: use CONFIG_EXECMEM instead of CONFIG_MODULES where appropriate kprobes: remove dependency on CONFIG_MODULES bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of Yifan Hong (1): module: allow UNUSED_KSYMS_WHITELIST to be relative against objtree. arch/Kconfig | 10 ++- arch/arm/kernel/module.c | 34 - arch/arm/mm/init.c | 45 +++ arch/arm64/Kconfig | 1 + arch/arm64/kernel/module.c | 126 -- arch/arm64/kernel/probes/kprobes.c | 7 -- arch/arm64/mm/init.c | 140 ++ arch/arm64/net/bpf_jit_comp.c| 11 --- arch/loongarch/kernel/module.c | 6 -- arch/loongarch/mm/init.c | 21 + arch/mips/include/asm/pgtable-64.h | 4 +- arch/mips/kernel/module.c| 10 --- arch/mips/mm/fault.c | 4 +- arch/mips/mm/init.c | 23 ++ arch/nios2/include/asm/pgtable.h | 5 +- arch/nios2/kernel/module.c | 20 - arch/nios2/mm/init.c | 21 + arch/parisc/kernel/module.c | 12 --- arch/parisc/mm/init.c| 23 +- arch/powerpc/Kconfig | 2 +- arch/powerpc/include/asm/kasan.h | 2 +- arch/powerpc/kernel/head_8xx.S | 4 +- arch/powerpc/kernel/head_book3s_32.S | 6 +- arch/powerpc/kernel/kprobes.c| 22 +- arch/powerpc/kernel/module.c | 38 -- arch/powerpc/lib/code-patching.c | 2 +- arch/powerpc/mm/book3s32/mmu.c | 2 +- arch/powerpc/mm/mem.c| 64 arch/riscv/include/asm/pgtable.h | 3 + arch/riscv/kernel/module.c | 12 --- arch/riscv/kernel/probes/kprobes.c | 10 --- arch/riscv/mm/init.c | 35 + arch/riscv/net/bpf_jit_core.c| 13 arch/s390/kernel/ftrace.c| 4 +- arch/s390/kernel/kprobes.c | 4 +- arch/s390/kernel/module.c| 42 +- arch/s390/mm/init.c | 30 arch/sparc/include/asm/pgtable_32.h | 2 + arch/sparc/kernel/module.c | 30
WARNING: kmalloc bug in bpf_uprobe_multi_link_attach
Hello. We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. Recently, our team has discovered a issue in Linux kernel 6.7. Attached to the email were a PoC file of the issue. Stack dump: loop3: detected capacity change from 0 to 8 MTD: Attempt to mount non-MTD device "/dev/loop3" [ cut here ] WARNING: CPU: 1 PID: 10075 at mm/util.c:632 kvmalloc_node+0x199/0x1b0 mm/util.c:632 Modules linked in: CPU: 1 PID: 10075 Comm: syz-executor.3 Not tainted 6.7.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 RIP: 0010:kvmalloc_node+0x199/0x1b0 mm/util.c:632 Code: 02 1d 00 eb aa e8 a7 49 c6 ff 41 81 e5 00 20 00 00 31 ff 44 89 ee e8 36 45 c6 ff 45 85 ed 0f 85 1b ff ff ff e8 88 49 c6 ff 90 <0f> 0b 90 e9 dd fe ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 RSP: 0018:c90002007b60 EFLAGS: 00010212 RAX: 23e4 RBX: 0400 RCX: c90003aaa000 RDX: 0004 RSI: 81c3acc8 RDI: 0005 RBP: 0037cec8 R08: 0005 R09: R10: R11: R12: R13: R14: R15: 88805ff6e1b8 FS: 7fc62205f640() GS:88807ec0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 001b2e026000 CR3: 5f338000 CR4: 00750ef0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 PKRU: 5554 Call Trace: kvmalloc include/linux/slab.h:738 [inline] kvmalloc_array include/linux/slab.h:756 [inline] kvcalloc include/linux/slab.h:761 [inline] bpf_uprobe_multi_link_attach+0x3fe/0xf60 kernel/trace/bpf_trace.c:3239 link_create kernel/bpf/syscall.c:5012 [inline] __sys_bpf+0x2e85/0x4e00 kernel/bpf/syscall.c:5453 __do_sys_bpf kernel/bpf/syscall.c:5487 [inline] __se_sys_bpf kernel/bpf/syscall.c:5485 [inline] __x64_sys_bpf+0x78/0xc0 kernel/bpf/syscall.c:5485 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fc62128fd6d Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48 RSP: 002b:7fc62205f028 EFLAGS: 0246 ORIG_RAX: 0141 RAX: ffda RBX: 7fc6213cbf80 RCX: 7fc62128fd6d RDX: 0040 RSI: 21c0 RDI: 001c RBP: 7fc6212f14cd R08: R09: R10: R11: 0246 R12: R13: 000b R14: 7fc6213cbf80 R15: 7fc62203f000 Thank you for taking the time to read this email and we look forward to working with you further. poc.c Description: Binary data