On Thu, Jun 05, 2025 at 08:56:59AM +0200, Vlastimil Babka wrote: > On 6/4/25 16:19, Lorenzo Stoakes wrote: > > The walk_page_range_novma() function is rather confusing - it supports two > > modes, one used often, the other used only for debugging. > > > > The first mode is the common case of traversal of kernel page tables, which > > is what nearly all callers use this for. > > > > Secondly it provides an unusual debugging interface that allows for the > > traversal of page tables in a userland range of memory even for that memory > > which is not described by a VMA. > > > > It is far from certain that such page tables should even exist, but perhaps > > this is precisely why it is useful as a debugging mechanism. > > > > As a result, this is utilised by ptdump only. Historically, things were > > reversed - ptdump was the only user, and other parts of the kernel evolved > > to use the kernel page table walking here. > > > > Since we have some complicated and confusing locking rules for the novma > > case, it makes sense to separate the two usages into their own functions. > > > > Doing this also provide self-documentation as to the intent of the caller - > > are they doing something rather unusual or are they simply doing a standard > > kernel page table walk? > > > > We therefore establish two separate functions - walk_page_range_debug() for > > this single usage, and walk_kernel_page_table_range() for general kernel > > page table walking. > > > > We additionally make walk_page_range_debug() internal to mm. > > > > Note that ptdump uses the precise same function for kernel walking as a > > IMHO it's not clear at this point what "the precise same function" means. > > > convenience, so we permit this but make it very explicit by having > > walk_page_range_novma() invoke walk_kernel_page_table_range() in this case. > > ^ walk_page_range_debug()
Oops will fix. > > Maybe this could be reworded in the sense (AFAIU) that > walk_page_range_debug() can be used for both user space page table walking > or kernel depending on what mm is passed, so in the case of init_mm it > invokes walk_kernel_page_table_range() internally. Sure. > > > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoa...@oracle.com> > > Acked-by: Mike Rapoport (Microsoft) <r...@kernel.org> > > --- > > v2: > > * Renamed walk_page_range_novma() to walk_page_range_debug() as per David. > > * Moved walk_page_range_debug() definition to mm/internal.h as per Mike. > > * Renamed walk_page_range_kernel() to walk_kernel_page_table_range() as > > per David. > > > > v1 resend: > > * Actually cc'd lists... > > * Fixed mistake in walk_page_range_novma() not handling kernel mappings and > > update commit message to referene. > > * Added Mike's off-list Acked-by. > > * Fixed up comments as per Mike. > > * Add some historic flavour to the commit message as per Mike. > > https://lore.kernel.org/all/20250603192213.182931-1-lorenzo.stoa...@oracle.com/ > > > > v1: > > (accidentally sent off-list due to error in scripting) > > > > arch/loongarch/mm/pageattr.c | 2 +- > > arch/openrisc/kernel/dma.c | 4 +- > > arch/riscv/mm/pageattr.c | 8 +-- > > include/linux/pagewalk.h | 7 ++- > > mm/hugetlb_vmemmap.c | 2 +- > > mm/internal.h | 4 ++ > > mm/pagewalk.c | 98 ++++++++++++++++++++++++------------ > > mm/ptdump.c | 3 +- > > 8 files changed, 82 insertions(+), 46 deletions(-) > > > > diff --git a/arch/loongarch/mm/pageattr.c b/arch/loongarch/mm/pageattr.c > > index 99165903908a..f5e910b68229 100644 > > --- a/arch/loongarch/mm/pageattr.c > > +++ b/arch/loongarch/mm/pageattr.c > > @@ -118,7 +118,7 @@ static int __set_memory(unsigned long addr, int > > numpages, pgprot_t set_mask, pgp > > return 0; > > > > mmap_write_lock(&init_mm); > > - ret = walk_page_range_novma(&init_mm, start, end, &pageattr_ops, NULL, > > &masks); > > + ret = walk_kernel_page_table_range(start, end, &pageattr_ops, NULL, > > &masks); > > mmap_write_unlock(&init_mm); > > You've removed init_mm from walk_page_range_novma() but I see most callers > do the locking of init_mm immediately around it. This suggests a version > handling that automatically? A bit complicated by the read/write > possibilities, so maybe not worth wrapping? Just a thought, as David says ;) Most callers write lock interestingly, but then one read lock's, so we can't just assume and would need to pass a boolean which would kind of suck. Also other walkers assume the caller has the lock so it's consistent to keep it this way. > > > > > flush_tlb_kernel_range(start, end); > > diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c > > index 3a7b5baaa450..af932a4ad306 100644 > > --- a/arch/openrisc/kernel/dma.c > > +++ b/arch/openrisc/kernel/dma.c > > @@ -72,7 +72,7 @@ void *arch_dma_set_uncached(void *cpu_addr, size_t size) > > * them and setting the cache-inhibit bit. > > */ > > mmap_write_lock(&init_mm); > > - error = walk_page_range_novma(&init_mm, va, va + size, > > + error = walk_kernel_page_table_range(va, va + size, > > &set_nocache_walk_ops, NULL, NULL); > > mmap_write_unlock(&init_mm); > > > > @@ -87,7 +87,7 @@ void arch_dma_clear_uncached(void *cpu_addr, size_t size) > > > > mmap_write_lock(&init_mm); > > /* walk_page_range shouldn't be able to fail here */ > > - WARN_ON(walk_page_range_novma(&init_mm, va, va + size, > > + WARN_ON(walk_kernel_page_table_range(va, va + size, > > &clear_nocache_walk_ops, NULL, NULL)); > > mmap_write_unlock(&init_mm); > > } > > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > > index d815448758a1..3f76db3d2769 100644 > > --- a/arch/riscv/mm/pageattr.c > > +++ b/arch/riscv/mm/pageattr.c > > @@ -299,7 +299,7 @@ static int __set_memory(unsigned long addr, int > > numpages, pgprot_t set_mask, > > if (ret) > > goto unlock; > > > > - ret = walk_page_range_novma(&init_mm, lm_start, lm_end, > > + ret = walk_kernel_page_table_range(lm_start, lm_end, > > &pageattr_ops, NULL, > > &masks); > > Note this and other places break the second line's arguments alignment on > the opening bracket. Maybe it just shows it's a bit fragile style... > > Yeah I know :) I know you won't believe this coming from me, but I was trying to minimise the churn :P