date:20240621

Re: [PATCH 01/15] ftruncate: pass a signed offset

2024-06-21 Thread Christian Brauner

On Thu, Jun 20, 2024 at 06:23:02PM GMT, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The old ftruncate() syscall, using the 32-bit off_t misses a sign
> extension when called in compat mode on 64-bit architectures.  As a
> result, passing a negative length accidentally succeeds in truncating
> to file size between 2GiB and 4GiB.
> 
> Changing the type of the compat syscall to the signed compat_off_t
> changes the behavior so it instead returns -EINVAL.
> 
> The native entry point, the truncate() syscall and the corresponding
> loff_t based variants are all correct already and do not suffer
> from this mistake.
> 
> Fixes: 3f6d078d4acc ("fix compat truncate/ftruncate")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 
> ---

Looks good to me,
Reviewed-by: Christian Brauner

Re: [PATCH] powerpc/pseries: Whitelist dtl slub object for copying to userspace

2024-06-21 Thread Anjali K

Hi Michael

On 18/06/24 12:41, Michael Ellerman wrote:
> I guess there isn't a kmem_cache_create_user_readonly() ?
Thank you for your review.    
My understanding of the question is whether there's a way to whitelist a   
region such that it can be copied to userspace, but not written to using   
copy_from_user().     
No, we don't have a function to whitelist only for copy_to_user() and not  
copy_from_user().

Thank you
Anjali K

Re: [PATCH 03/15] mips: fix compat_sys_lseek syscall

2024-06-21 Thread Thomas Bogendoerfer

On Thu, Jun 20, 2024 at 06:23:04PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> This is almost compatible, but passing a negative offset should result
> in a EINVAL error, but on mips o32 compat mode would seek to a large
> 32-bit byte offset.
> 
> Use compat_sys_lseek() to correctly sign-extend the argument.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl 
> b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index 85751c9b9cdb..2439a2491cff 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -27,7 +27,7 @@
>  17   o32 break   sys_ni_syscall
>  # 18 was sys_stat
>  18   o32 unused18sys_ni_syscall
> -19   o32 lseek   sys_lseek
> +19   o32 lseek   sys_lseek   
> compat_sys_lseek
>  20   o32 getpid  sys_getpid
>  21   o32 mount   sys_mount
>  22   o32 umount  sys_oldumount
> -- 
> 2.39.2

applied to mips-fixes.

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]

Re: [Patch v4 08/10] mtd: rawnand: lpx32xx: Request DMA channels using DT entries

2024-06-21 Thread Miquel Raynal

Hi Piotr,

piotr.wojtaszc...@timesys.com wrote on Thu, 20 Jun 2024 19:56:39 +0200:

> Move away from pl08x platform data towards device tree.
> 
> Signed-off-by: Piotr Wojtaszczyk 

I don't see any change regarding the NAND controller node in the device
tree, is there any dependency with other patches from the same patchset
or may I apply this directly to nand/next?

Thanks,
Miquèl

Re: [PATCH 09/15] sh: rework sync_file_range ABI

2024-06-21 Thread John Paul Adrian Glaubitz

Hi Arnd,

thanks for your patch!

On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The unusual function calling conventions on superh ended up causing
  ^^
   It's spelled SuperH

> sync_file_range to have the wrong argument order, with the 'flags'
> argument getting sorted before 'nbytes' by the compiler.
> 
> In userspace, I found that musl, glibc, uclibc and strace all expect the
> normal calling conventions with 'nbytes' last, so changing the kernel
> to match them should make all of those work.
> 
> In order to be able to also fix libc implementations to work with existing
> kernels, they need to be able to tell which ABI is used. An easy way
> to do this is to add yet another system call using the sync_file_range2
> ABI that works the same on all architectures.
> 
> Old user binaries can now work on new kernels, and new binaries can
> try the new sync_file_range2() to work with new kernels or fall back
> to the old sync_file_range() version if that doesn't exist.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 75c92acdd5b1 ("sh: Wire up new syscalls.")
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/sh/kernel/sys_sh32.c   | 11 +++
>  arch/sh/kernel/syscalls/syscall.tbl |  3 ++-
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c
> index 9dca568509a5..d5a4f7c697d8 100644
> --- a/arch/sh/kernel/sys_sh32.c
> +++ b/arch/sh/kernel/sys_sh32.c
> @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 
> offset0, u32 offset1,
>(u64)len0 << 32 | len1, advice);
>  #endif
>  }
> +
> +/*
> + * swap the arguments the way that libc wants it instead of

I think "swap the arguments to the order that libc wants them" would
be easier to understand here.

> + * moving flags ahead of the 64-bit nbytes argument
> + */
> +SYSCALL_DEFINE6(sh_sync_file_range6, int, fd, SC_ARG64(offset),
> +SC_ARG64(nbytes), unsigned int, flags)
> +{
> +return ksys_sync_file_range(fd, SC_VAL64(loff_t, offset),
> +SC_VAL64(loff_t, nbytes), flags);
> +}
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl 
> b/arch/sh/kernel/syscalls/syscall.tbl
> index bbf83a2db986..c55fd7696d40 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -321,7 +321,7 @@
>  311  common  set_robust_list sys_set_robust_list
>  312  common  get_robust_list sys_get_robust_list
>  313  common  splice  sys_splice
> -314  common  sync_file_range sys_sync_file_range
> +314  common  sync_file_range sys_sh_sync_file_range6
 ^^ Why the 
suffix 6 here?

>  315  common  tee sys_tee
>  316  common  vmsplicesys_vmsplice
>  317  common  move_pages  sys_move_pages
> @@ -395,6 +395,7 @@
>  385  common  pkey_alloc  sys_pkey_alloc
>  386  common  pkey_free   sys_pkey_free
>  387  common  rseqsys_rseq
> +388  common  sync_file_range2sys_sync_file_range2
>  # room for arch specific syscalls
>  393  common  semget  sys_semget
>  394  common  semctl  sys_semctl

I wonder how you discovered this bug. Did you look up the calling convention on 
SuperH
and compare the argument order for the sys_sync_file_range system call 
documented there
with the order in the kernel?

Did you also check what order libc uses? I would expect libc on SuperH 
misordering the
arguments as well unless I am missing something. Or do we know that the code is 
actually
currently broken?

Thanks,
Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread John Paul Adrian Glaubitz

Hi Helge and Arnd,

On Thu, 2024-06-20 at 23:21 +0200, Helge Deller wrote:
> The patch looks good at first sight.
> I'll pick it up in my parisc git tree and will do some testing the
> next few days and then push forward for 6.11 when it opens

Isn't this supposed to go in as one series or can arch maintainers actually
pick the patches for their architecture and merge them individually?

If yes, I would prefer to do that for the SuperH patch as well as I usually
prefer merging SuperH patches in my own tree.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread John Paul Adrian Glaubitz

Hi,

On Fri, 2024-06-21 at 08:28 +0200, Arnd Bergmann wrote:
> It's more likely to be related to the upward growing stack.
> I checked the gcc sources and found that out of the 50 supported
> architectures, ARGS_GROW_DOWNWARD is set on everything except
> for gcn, stormy16 and  32-bit parisc. The other two are
> little-endian though. STACK_GROWS_DOWNWARD in turn is set on
> everything other than parisc (both 32-bit and 64-bit).

Wait a second! Does that mean that on 64-bit PA-RISC, the stack is
actually growing downwards? If yes, that would be a strong argument
for creating a 64-bit PA-RISC port in Debian and replacing the 32-bit
port.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread Arnd Bergmann

On Fri, Jun 21, 2024, at 10:52, John Paul Adrian Glaubitz wrote:
> Hi Helge and Arnd,
>
> On Thu, 2024-06-20 at 23:21 +0200, Helge Deller wrote:
>> The patch looks good at first sight.
>> I'll pick it up in my parisc git tree and will do some testing the
>> next few days and then push forward for 6.11 when it opens
>
> Isn't this supposed to go in as one series or can arch maintainers actually
> pick the patches for their architecture and merge them individually?
>
> If yes, I would prefer to do that for the SuperH patch as well as I usually
> prefer merging SuperH patches in my own tree.

The patches are all independent of one another, except for a couple
of context changes where multiple patches touch the same lines.

Feel free to pick up the sh patch directly, I'll just merge whatever
is left in the end. I mainly want to ensure we can get all the bugfixes
done for v6.10 so I can build my longer cleanup series on top of it
for 6.11.

   Arnd

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread John Paul Adrian Glaubitz

On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote:
> The patches are all independent of one another, except for a couple
> of context changes where multiple patches touch the same lines.

OK.

> Feel free to pick up the sh patch directly, I'll just merge whatever
> is left in the end. I mainly want to ensure we can get all the bugfixes
> done for v6.10 so I can build my longer cleanup series on top of it
> for 6.11.

This series is still for 6.10?

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913

Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback

2024-06-21 Thread Uladzislau Rezki

On Wed, Jun 19, 2024 at 11:28:13AM +0200, Vlastimil Babka wrote:
> On 6/18/24 7:53 PM, Paul E. McKenney wrote:
> > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote:
> >> On 6/18/24 6:48 PM, Paul E. McKenney wrote:
> >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote:
> >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote:
> >> >> > >> +
> >> >> > >> +  s = container_of(work, struct kmem_cache, async_destroy_work);
> >> >> > >> +
> >> >> > >> +  // XXX use the real kmem_cache_free_barrier() or similar thing 
> >> >> > >> here
> >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new 
> >> >> > > API, which i
> >> >> > > wanted to avoid initially.
> >> >> > 
> >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this 
> >> >> > would
> >> >> > be achieved. The barrier is used internally so I don't consider that 
> >> >> > an
> >> >> > API to avoid. How difficult is the implementation is another question,
> >> >> > depending on how the current batching works. Once (if) we have sheaves
> >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might
> >> >> > also look different and hopefully easier. So maybe it's not worth to
> >> >> > invest too much into that barrier and just go for the potentially
> >> >> > longer, but easier to implement?
> >> >> > 
> >> >> Right. I agree here. If the cache is not empty, OK, we just defer the
> >> >> work, even we can use a big 21 seconds delay, after that we just "warn"
> >> >> if it is still not empty and leave it as it is, i.e. emit a warning and
> >> >> we are done.
> >> >> 
> >> >> Destroying the cache is not something that must happen right away. 
> >> > 
> >> > OK, I have to ask...
> >> > 
> >> > Suppose that the cache is created and destroyed by a module and
> >> > init/cleanup time, respectively.  Suppose that this module is rmmod'ed
> >> > then very quickly insmod'ed.
> >> > 
> >> > Do we need to fail the insmod if the kmem_cache has not yet been fully
> >> > cleaned up?
> >> 
> >> We don't have any such link between kmem_cache and module to detect that, 
> >> so
> >> we would have to start tracking that. Probably not worth the trouble.
> > 
> > Fair enough!
> > 
> >> >  If not, do we have two versions of the same kmem_cache in
> >> > /proc during the overlap time?
> >> 
> >> Hm could happen in /proc/slabinfo but without being harmful other than
> >> perhaps confusing someone. We could filter out the caches being destroyed
> >> trivially.
> > 
> > Or mark them in /proc/slabinfo?  Yet another column, yay!!!  Or script
> > breakage from flagging the name somehow, for example, trailing "/"
> > character.
> 
> Yeah I've been resisting such changes to the layout and this wouldn't be
> worth it, apart from changing the name itself but not in a dangerous way
> like with "/" :)
> 
> >> Sysfs and debugfs might be more problematic as I suppose directory names
> >> would clash. I'll have to check... might be even happening now when we do
> >> detect leaked objects and just leave the cache around... thanks for the
> >> question.
> > 
> > "It is a service that I provide."  ;-)
> > 
> > But yes, we might be living with it already and there might already
> > be ways people deal with it.
> 
> So it seems if the sysfs/debugfs directories already exist, they will
> silently not be created. Wonder if we have such cases today already because
> caches with same name exist. I think we do with the zsmalloc using 32 caches
> with same name that we discussed elsewhere just recently.
> 
> Also indeed if the cache has leaked objects and won't be thus destroyed,
> these directories indeed stay around, as well as the slabinfo entry, and can
> prevent new ones from being created (slabinfo lines with same name are not
> prevented).
> 
> But it wouldn't be great to introduce this possibility to happen for the
> temporarily delayed removal due to kfree_rcu() and a module re-insert, since
> that's a legitimate case and not buggy state due to leaks.
> 
> The debugfs directory we could remove immediately before handing over to the
> scheduled workfn, but if it turns out there was a leak and the workfn leaves
> the cache around, debugfs dir will be gone and we can't check the
> alloc_traces/free_traces files there (but we have the per-object info
> including the traces in the dmesg splat).
> 
> The sysfs directory is currently removed only with the whole cache being
> destryed due to sysfs/kobject lifetime model. I'd love to untangle it for
> other reasons too, but haven't investigated it yet. But again it might be
> useful for sysfs dir to stay around for inspection, as for the debugfs.
> 
> We could rename the sysfs/debugfs directories before queuing the work? Add
> some prefix like GOING_AWAY-$name. If leak is detected and cache stays
> forever, another rename to LEAKED-$name. (and same for the slabinfo). But
> multiple ones with same name might pile up, so try adding a counter then?
> Probably messy to implement,

Re: [PATCH 09/15] sh: rework sync_file_range ABI

2024-06-21 Thread Arnd Bergmann

On Fri, Jun 21, 2024, at 10:44, John Paul Adrian Glaubitz wrote:
> On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote:
>> From: Arnd Bergmann 
>> 
>> The unusual function calling conventions on superh ended up causing
>   ^^
>It's spelled SuperH

Fixed now.

>> diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c
>> index 9dca568509a5..d5a4f7c697d8 100644
>> --- a/arch/sh/kernel/sys_sh32.c
>> +++ b/arch/sh/kernel/sys_sh32.c
>> @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 
>> offset0, u32 offset1,
>>   (u64)len0 << 32 | len1, advice);
>>  #endif
>>  }
>> +
>> +/*
>> + * swap the arguments the way that libc wants it instead of
>
> I think "swap the arguments to the order that libc wants them" would
> be easier to understand here.

Done

>> diff --git a/arch/sh/kernel/syscalls/syscall.tbl 
>> b/arch/sh/kernel/syscalls/syscall.tbl
>> index bbf83a2db986..c55fd7696d40 100644
>> --- a/arch/sh/kernel/syscalls/syscall.tbl
>> +++ b/arch/sh/kernel/syscalls/syscall.tbl
>> @@ -321,7 +321,7 @@
>>  311 common  set_robust_list sys_set_robust_list
>>  312 common  get_robust_list sys_get_robust_list
>>  313 common  splice  sys_splice
>> -314 common  sync_file_range sys_sync_file_range
>> +314 common  sync_file_range sys_sh_sync_file_range6
>  ^^ 
> Why the suffix 6 here?

In a later part of my cleanup, I'm consolidating all the
copies of this function (arm64, mips, parisc, powerpc,
s390, sh, sparc, x86) and picked the name
sys_sync_file_range6() for common implementation.

I end up with four entry points here, so the naming is a bit
confusing:

- sys_sync_file_range() is only used on 64-bit architectures,
  on x32 and on mips-n32. This uses four arguments, including
  two 64-bit wide ones.

- sys_sync_file_range2() continues to be used on arm, powerpc,
  xtensa and now on sh, hexagon and csky. I change the
  implementation to take six 32-bit arguments, but the ABI
  remains the same as before, with the flags before offset.

- sys_sync_file_range6() is used for most other 32-bit ABIs:
  arc, m68k, microblaze, nios2, openrisc, parisc, s390, sh, sparc
  and x86. This also has six 32-bit arguments but in the
  default order (fd, offset, nbytes, flags).

- sys_sync_file_range7() is exclusive to mips-o32, this one
  has an unused argument and is otherwise the same as
  sys_sync_file_range6().

My plan is to then have some infrastructure to ensure
userspace tools (libc, strace, qemu, rust, ...) use the
same calling conventions as the kernel. I'm doing the
same thing for all other syscalls that have architecture
specific calling conventions, so far I'm using

fadvise64_64_7
fanotify_mark6
truncate3
truncate4
ftruncate3
ftruncate4
fallocate6
pread5
pread6
pwrite5
pwrite6
preadv5
preadv6
pwritev5
pwritev6
sync_file_range6
fadvise64_64_2
fadvise64_64_6
fadvise64_5
fadvise64_6
readahead4
readahead5

The last number here is usually the number of 32-bit
arguments, except for fadvise64_64_2 that uses the
same argument reordering trick as sync_file_range2.

I'm not too happy with the naming but couldn't come up with
anything clearer either, so let me know if you have any
ideas there.

>>  315 common  tee sys_tee
>>  316 common  vmsplicesys_vmsplice
>>  317 common  move_pages  sys_move_pages
>> @@ -395,6 +395,7 @@
>>  385 common  pkey_alloc  sys_pkey_alloc
>>  386 common  pkey_free   sys_pkey_free
>>  387 common  rseqsys_rseq
>> +388 common  sync_file_range2sys_sync_file_range2
>>  # room for arch specific syscalls
>>  393 common  semget  sys_semget
>>  394 common  semctl  sys_semctl
>
> I wonder how you discovered this bug. Did you look up the calling 
> convention on SuperH
> and compare the argument order for the sys_sync_file_range system call 
> documented there
> with the order in the kernel?

I had to categorize all architectures based on their calling
conventions to see if 64-bit arguments need aligned pairs or
not, so I wrote a set of simple C files that I compiled for
all architectures to see in which cases they insert unused
arguments or swap the order of the upper and lower halves.

SuperH, parisc and s390 are each slightly different from all the
others here, so I ended up reading the ELF psABI docs and/or
the compiler sources to be sure.
I also a lot of git history.

> Did you also check what order libc uses? I would expect libc on SuperH 
> misordering the
> arguments as well unless I am missing something. Or do we know that the 
> code is actually
> currently broken?

Yes, I checked glibc, musl and uclibc-ng for all

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread Arnd Bergmann

On Fri, Jun 21, 2024, at 11:03, John Paul Adrian Glaubitz wrote:
> On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote:
>> Feel free to pick up the sh patch directly, I'll just merge whatever
>> is left in the end. I mainly want to ensure we can get all the bugfixes
>> done for v6.10 so I can build my longer cleanup series on top of it
>> for 6.11.
>
> This series is still for 6.10?

Yes, these are all the bugfixes that I think we want to backport
to stable kernels, so it makes sense to merge them as quickly as
possible. The actual stuff I'm working on will come as soon as
I have it in a state for public review and won't need to be
backported.

 Arnd

Re: [PATCH] powerpc/pseries: Whitelist dtl slub object for copying to userspace

2024-06-21 Thread Michael Ellerman

Anjali K  writes:
> Hi Michael
>
> On 18/06/24 12:41, Michael Ellerman wrote:
>> I guess there isn't a kmem_cache_create_user_readonly() ?

> Thank you for your review.    
> 
> My understanding of the question is whether there's a way to whitelist a   
> region such that it can be copied to userspace, but not written to using   
> copy_from_user().     

Yes that's what I meant, and I pretty much worked that out from looking
at the implementation, but was hoping Kees would tell me it was there
somewhere, or implement it :)  Apologies for being cryptic.

> No, we don't have a function to whitelist only for copy_to_user() and not  
> copy_from_user().

Yep. I'll take this patch as-is, I think we've established that it's
pretty low risk to whitelist the whole cache.

cheers

Re: [Patch v4 10/10] i2x: pnx: Use threaded irq to fix warning from del_timer_sync()

2024-06-21 Thread Piotr Wojtaszczyk

Hi Andi,

On Fri, Jun 21, 2024 at 12:57 AM Andi Shyti  wrote:
> On Thu, Jun 20, 2024 at 07:56:41PM GMT, Piotr Wojtaszczyk wrote:
> > When del_timer_sync() is called in an interrupt context it throws a warning
> > because of potential deadlock. Threaded irq handler fixes the potential
> > problem.
> >
> > Signed-off-by: Piotr Wojtaszczyk 
>
> did you run into a lockdep splat?
>
> Anything against using del_timer(), instead? Have you tried?

I didn't get a lockdep splat but console was flooded with warnings from
https://github.com/torvalds/linux/blob/v6.10-rc4/kernel/time/timer.c#L1655
In the linux kernel v5.15 I didn't see these warnings.

I'm not a maintainer of the driver and I didn't do any research on
what kind of impact
would have using del_timer() instad. Maybe Vladimir Zapolskiy will know that.

-- 
Piotr Wojtaszczyk
Timesys

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread John David Anglin


On 2024-06-21 4:54 a.m., John Paul Adrian Glaubitz wrote:

Hi,

On Fri, 2024-06-21 at 08:28 +0200, Arnd Bergmann wrote:

It's more likely to be related to the upward growing stack.
I checked the gcc sources and found that out of the 50 supported
architectures, ARGS_GROW_DOWNWARD is set on everything except
for gcn, stormy16 and  32-bit parisc. The other two are
little-endian though. STACK_GROWS_DOWNWARD in turn is set on
everything other than parisc (both 32-bit and 64-bit).

Wait a second! Does that mean that on 64-bit PA-RISC, the stack is
actually growing downwards? If yes, that would be a strong argument
for creating a 64-bit PA-RISC port in Debian and replacing the 32-bit
port.

No, the stack grows upward on both 32 and 64-bit parisc.  But stack arguments
grow upwards on 64-bit parisc.  The argument pointer is needed to access these
arguments.  In 32-bit parisc, the argument pointer is at a fixed offset 
relative to the
stack pointer and it can be eliminated.

Dave

--
John David Anglin  dave.ang...@bell.net

Re: [Patch v4 08/10] mtd: rawnand: lpx32xx: Request DMA channels using DT entries

2024-06-21 Thread Piotr Wojtaszczyk

On Fri, Jun 21, 2024 at 10:30 AM Miquel Raynal
 wrote:
>
> Hi Piotr,
>
> piotr.wojtaszc...@timesys.com wrote on Thu, 20 Jun 2024 19:56:39 +0200:
>
> > Move away from pl08x platform data towards device tree.
> >
> > Signed-off-by: Piotr Wojtaszczyk 
>
> I don't see any change regarding the NAND controller node in the device
> tree, is there any dependency with other patches from the same patchset
> or may I apply this directly to nand/next?
>
> Thanks,
> Miquèl

Yes, this patch depends on "[v4,04/10] ARM: dts: lpc32xx: Add missing
dma and i2s properties"
which will be splitted into two or more separate patches per request
in the comments.
I'd like to keep driver changes and corresponding changes in DTS in
the same patch
but I've made a separate patch for DTS per request from v2 of the patch set.

--
Piotr Wojtaszczyk
Timesys

Re: [PATCH 12/15] s390: remove native mmap2() syscall

2024-06-21 Thread Heiko Carstens

On Thu, Jun 20, 2024 at 06:23:13PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The mmap2() syscall has never been used on 64-bit s390x and should
> have been removed as part of 5a79859ae0f3 ("s390: remove 31 bit
> support").
> 
> Remove it now.
> 
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/s390/kernel/syscall.c | 27 ---
>  1 file changed, 27 deletions(-)

Acked-by: Heiko Carstens

Re: [PATCH 02/15] syscalls: fix compat_sys_io_pgetevents_time64 usage

2024-06-21 Thread Heiko Carstens

On Thu, Jun 20, 2024 at 06:23:03PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Using sys_io_pgetevents() as the entry point for compat mode tasks
> works almost correctly, but misses the sign extension for the min_nr
> and nr arguments.
> 
> This was addressed on parisc by switching to
> compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc:
> io_pgetevents_time64() needs compat syscall in 32-bit compat mode"),
> as well as by using more sophisticated system call wrappers on x86 and
> s390. However, arm64, mips, powerpc, sparc and riscv still have the
> same bug.
> 
> Changes all of them over to use compat_sys_io_pgetevents_time64()
> like parisc already does. This was clearly the intention when the
> function was originally added, but it got hooked up incorrectly in
> the tables.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit 
> architectures")
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/arm64/include/asm/unistd32.h | 2 +-
>  arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +-
>  arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
>  arch/powerpc/kernel/syscalls/syscall.tbl  | 2 +-
>  arch/s390/kernel/syscalls/syscall.tbl | 2 +-
>  arch/sparc/kernel/syscalls/syscall.tbl| 2 +-
>  arch/x86/entry/syscalls/syscall_32.tbl| 2 +-
>  include/uapi/asm-generic/unistd.h | 2 +-
>  8 files changed, 8 insertions(+), 8 deletions(-)

Acked-by: Heiko Carstens  # s390

[PATCH 0/7] mm/mprotect: Fix dax puds

2024-06-21 Thread Peter Xu

[Based on mm-unstable, commit a53138cdbe3e]

Dax supports pud pages for a while, but mprotect on puds was missing since
the start.  This series tries to fix that by providing pud handling in
mprotect(), while my real goal is adding more types of pud mappings like
hugetlb or pfnmaps, it's just that we probably want pud to work already and
build the rest on top.

Considering nobody reported this until when I looked at those other types
of pud mappings, I am thinking maybe it doesn't need to be a fix for stable
and this may not need to be backported.  I would guess whoever cares about
mprotect() won't care 1G dax puds yet, vice versa.  I hope fixing that in
new kernels would be fine, but I'm open to suggestions.

There're quite a few small things changed here and there to teach mprotect
work on PUDs. E.g. it will start with dropping NUMA_HUGE_PTE_UPDATES which
may stop making much sense when there can be more than one type of huge pte
(meanwhile it doesn't sound right at all to account non-numa operations
too.. more in the commit message of the relevant patch).  OTOH, we'll also
need to push the mmu notifiers from pmd to pud layers, which might need
some attention but so far I think it's safe.  For these small details,
please refer to each patch's commit message.

The mprotect() pud process is hopefully straightforward enough, as I kept
it as simple as possible.  There's no NUMA handled as dax simply doesn't
support that. There's also no userfault involvements as file memory (even
if work with userfault-wp async mode) will need to split a pud, so pud
entry doesn't need to yet know userfault's existance (but hugetlb entries
will; that's also for later).

Tests
=

What I did test:

- cross-build tests that I normally cover [1], except an known issue
  elsewhere on hugetlb [2]

- smoke tested on x86_64 the simplest program [3] on dev_dax 1G PUD
  mprotect() using QEMU's nvdimm emulations [4] and ndctl to create
  namespaces with proper alignments, which used to throw "bad pud" but now
  it'll run through all fine.  Also I checked sigbus happens if with
  illegal access on protected puds.

What I didn't test:

- fsdax: I wanted to also give it a shot, but only until then I noticed it
  doesn't seem to be supported (according to dax_iomap_fault(), which will
  always fallback on PUD_ORDER).  I did remember it was supported before, I
  could miss something important there.. please shoot if so.

- userfault wp-async: I also wanted to test userfault-wp async be able to
  split huge puds (here it's simply a clear_pud.. though), but it won't
  work for devdax anyway due to not allowed to do smaller than 1G faults in
  this case. So skip too.

- Power, as no hardware on hand.

Thanks,

[1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json
[2] https://lore.kernel.org/all/202406190956.9j1ucie5-...@intel.com
[2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c
[3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt

Peter Xu (7):
  mm/dax: Dump start address in fault handler
  mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
  mm/mprotect: Push mmu notifier to PUDs
  mm/powerpc: Add missing pud helpers
  mm/x86: Make pud_leaf() only cares about PSE bit
  mm/x86: Add missing pud helpers
  mm/mprotect: fix dax pud handlings

 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +
 arch/powerpc/mm/book3s64/pgtable.c   | 20 ++
 arch/x86/include/asm/pgtable.h   | 39 ++-
 arch/x86/mm/pgtable.c| 11 +++
 drivers/dax/device.c |  6 +-
 include/linux/huge_mm.h  | 24 +++
 include/linux/vm_event_item.h|  1 -
 mm/huge_memory.c | 52 ++
 mm/mprotect.c| 74 
 mm/vmstat.c  |  1 -
 10 files changed, 195 insertions(+), 36 deletions(-)

-- 
2.45.0

[PATCH 1/7] mm/dax: Dump start address in fault handler

2024-06-21 Thread Peter Xu

Currently the dax fault handler dumps the vma range when dynamic debugging
enabled.  That's mostly not useful.  Dump the (aligned) address instead
with the order info.

Signed-off-by: Peter Xu 
---
 drivers/dax/device.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index eb61598247a9..714174844ca5 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, 
unsigned int order)
int id;
struct dev_dax *dev_dax = filp->private_data;
 
-   dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm,
-   (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
-   vmf->vma->vm_start, vmf->vma->vm_end, order);
+   dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm,
+   (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read",
+   vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order);
 
id = dax_read_lock();
if (order == 0)
-- 
2.45.0

[PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES

2024-06-21 Thread Peter Xu

In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages
altered by protection changes") introduced "numa_huge_pte_updates" vmstat
entry, trying to capture how many huge ptes (in reality, PMD thps at that
time) are marked by NUMA balancing.

This patch proposes to remove it for some reasons.

Firstly, the name is misleading. We can have more than one way to have a
"huge pte" at least nowadays, and that's also the major goal of this patch,
where it paves way for PUD handling in change protection code paths.

PUDs are coming not only for dax (which has already came and yet broken..),
but also for pfnmaps and hugetlb pages.  The name will simply stop making
sense when PUD will start to be involved in mprotect() world.

It'll also make it not reasonable either if we boost the counter for both
pmd/puds.  In short, current accounting won't be right when PUD comes, so
the scheme was only suitable at that point in time where PUD wasn't even
possible.

Secondly, the accounting was simply not right from the start as long as it
was also affected by other call sites besides NUMA.  mprotect() is one,
while userfaultfd-wp also leverages change protection path to modify
pgtables.  If it wants to do right it needs to check the caller but it
never did; at least mprotect() should be there even in 2013.

It gives me the impression that nobody is seriously using this field, and
it's also impossible to be serious.

We may want to do it right if any NUMA developers would like it to exist,
but we should do that with all above resolved, on both considering PUDs,
but also on correct accountings.  That should be able to be done on top
when there's a real need of such.

Cc: Huang Ying 
Cc: Mel Gorman 
Cc: Alex Thorlton 
Cc: Rik van Riel 
Signed-off-by: Peter Xu 
---
 include/linux/vm_event_item.h | 1 -
 mm/mprotect.c | 8 +---
 mm/vmstat.c   | 1 -
 3 files changed, 1 insertion(+), 9 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 73fa5fbf33a3..a76b75d6a8f4 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -59,7 +59,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
OOM_KILL,
 #ifdef CONFIG_NUMA_BALANCING
NUMA_PTE_UPDATES,
-   NUMA_HUGE_PTE_UPDATES,
NUMA_HINT_FAULTS,
NUMA_HINT_FAULTS_LOCAL,
NUMA_PAGE_MIGRATE,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 222ab434da54..21172272695e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,7 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd_t *pmd;
unsigned long next;
long pages = 0;
-   unsigned long nr_huge_updates = 0;
struct mmu_notifier_range range;
 
range.start = 0;
@@ -411,11 +410,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
ret = change_huge_pmd(tlb, vma, pmd,
addr, newprot, cp_flags);
if (ret) {
-   if (ret == HPAGE_PMD_NR) {
+   if (ret == HPAGE_PMD_NR)
pages += HPAGE_PMD_NR;
-   nr_huge_updates++;
-   }
-
/* huge pmd was handled */
goto next;
}
@@ -435,8 +431,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (range.start)
mmu_notifier_invalidate_range_end(&range);
 
-   if (nr_huge_updates)
-   count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
return pages;
 }
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6e3347789eb2..be774893f69e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = {
 
 #ifdef CONFIG_NUMA_BALANCING
"numa_pte_updates",
-   "numa_huge_pte_updates",
"numa_hint_faults",
"numa_hint_faults_local",
"numa_pages_migrated",
-- 
2.45.0

[PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs

2024-06-21 Thread Peter Xu

mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
change_pmd_range").

At that time, the issue was that NUMA balancing can be applied on a huge
range of VM memory, even if nothing was populated.  The notification can be
avoided in this case if no valid pmd detected, which includes either THP or
a PTE pgtable page.

Now to pave way for PUD handling, this isn't enough.  We need to generate
mmu notifications even on PUD entries properly.  mprotect() is currently
broken on PUD (e.g., one can easily trigger kernel error with dax 1G
mappings already), this is the start to fix it.

To fix that, this patch proposes to push such notifications to the PUD
layers.

There is risk on regressing the problem Rik wanted to resolve before, but I
think it shouldn't really happen, and I still chose this solution because
of a few reasons:

  1) Consider a large VM that should definitely contain more than GBs of
  memory, it's highly likely that PUDs are also none.  In this case there
  will have no regression.

  2) KVM has evolved a lot over the years to get rid of rmap walks, which
  might be the major cause of the previous soft-lockup.  At least TDP MMU
  already got rid of rmap as long as not nested (which should be the major
  use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
  pgtable (e.g. EPT on x86), the invalidation of a full empty region in
  most cases could be pretty fast now, comparing to 2014.

  3) KVM has explicit code paths now to even give way for mmu notifiers
  just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry
  fault before acquiring mmu_lock if mapping is changing").  It'll also
  avoid contentions that may also contribute to a soft-lockup.

  4) Stick with PMD layer simply don't work when PUD is there...  We need
  one way or another to fix PUD mappings on mprotect().

Pushing it to PUD should be the safest approach as of now, e.g. there's yet
no sign of huge P4D coming on any known archs.

Cc: k...@vger.kernel.org
Cc: Sean Christopherson 
Cc: Paolo Bonzini 
Cc: David Rientjes 
Cc: Rik van Riel 
Signed-off-by: Peter Xu 
---
 mm/mprotect.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 21172272695e..fb8bf3ff7cd9 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd_t *pmd;
unsigned long next;
long pages = 0;
-   struct mmu_notifier_range range;
-
-   range.start = 0;
 
pmd = pmd_offset(pud, addr);
do {
@@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
if (pmd_none(*pmd))
goto next;
 
-   /* invoke the mmu notifier if the pmd is populated */
-   if (!range.start) {
-   mmu_notifier_range_init(&range,
-   MMU_NOTIFY_PROTECTION_VMA, 0,
-   vma->vm_mm, addr, end);
-   mmu_notifier_invalidate_range_start(&range);
-   }
-
_pmd = pmdp_get_lockless(pmd);
if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || 
pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
@@ -428,9 +417,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
cond_resched();
} while (pmd++, addr = next, addr != end);
 
-   if (range.start)
-   mmu_notifier_invalidate_range_end(&range);
-
return pages;
 }
 
@@ -438,10 +424,13 @@ static inline long change_pud_range(struct mmu_gather 
*tlb,
struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+   struct mmu_notifier_range range;
pud_t *pud;
unsigned long next;
long pages = 0, ret;
 
+   range.start = 0;
+
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
@@ -450,10 +439,19 @@ static inline long change_pud_range(struct mmu_gather 
*tlb,
return ret;
if (pud_none_or_clear_bad(pud))
continue;
+   if (!range.start) {
+   mmu_notifier_range_init(&range,
+   MMU_NOTIFY_PROTECTION_VMA, 0,
+   vma->vm_mm, addr, end);
+   mmu_notifier_invalidate_range_start(&range);
+   }
pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
  cp_flags);
} while (pud++, addr = next, addr != end);
 
+   if (range.start)
+   mmu_notifier_invalidate_range_end(&range);
+
return pages;

[PATCH 4/7] mm/powerpc: Add missing pud helpers

2024-06-21 Thread Peter Xu

These new helpers will be needed for pud entry updates soon.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Michael Ellerman 
Cc: Nicholas Piggin 
Cc: Christophe Leroy 
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Aneesh Kumar K.V 
Signed-off-by: Peter Xu 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +++
 arch/powerpc/mm/book3s64/pgtable.c   | 20 
 2 files changed, 23 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 8f9432e3855a..fc628a984669 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1108,6 +1108,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
 extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
+extern pud_t pud_modify(pud_t pud, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
 extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
@@ -1368,6 +1369,8 @@ static inline pgtable_t 
pgtable_trans_huge_withdraw(struct mm_struct *mm,
 #define __HAVE_ARCH_PMDP_INVALIDATE
 extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 pmd_t *pmdp);
+extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pud_t *pudp);
 
 #define pmd_move_must_withdraw pmd_move_must_withdraw
 struct spinlock;
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 2975ea0841ba..c6ae969020e0 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned 
long address,
return __pmd(old_pmd);
 }
 
+pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp)
+{
+   unsigned long old_pud;
+
+   VM_WARN_ON_ONCE(!pmd_present(*pmdp));
+   old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 
_PAGE_INVALID);
+   flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+   return __pmd(old_pmd);
+}
+
 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
   unsigned long addr, pmd_t *pmdp, int full)
 {
@@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
pmdv &= _HPAGE_CHG_MASK;
return pmd_set_protbits(__pmd(pmdv), newprot);
 }
+
+pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+   unsigned long pudv;
+
+   pudv = pud_val(pud);
+   pudv &= _HPAGE_CHG_MASK;
+   return pud_set_protbits(__pud(pudv), newprot);
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 /* For use by kexec, called with MMU off */
-- 
2.45.0

[PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Peter Xu

These new helpers will be needed for pud entry updates soon.  Namely:

- pudp_invalidate()
- pud_modify()

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Signed-off-by: Peter Xu 
---
 arch/x86/include/asm/pgtable.h | 36 ++
 arch/x86/mm/pgtable.c  | 11 +++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 25fc6d809572..3c23077adca6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -775,6 +775,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd)
  __pgprot(pmd_flags(pmd) & 
~(_PAGE_PRESENT|_PAGE_PROTNONE)));
 }
 
+static inline pud_t pud_mkinvalid(pud_t pud)
+{
+   return pfn_pud(pud_pfn(pud),
+  __pgprot(pud_flags(pud) & 
~(_PAGE_PRESENT|_PAGE_PROTNONE)));
+}
+
 static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
 
 static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
@@ -839,6 +845,21 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
return pmd_result;
 }
 
+static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
+{
+   pudval_t val = pud_val(pud), oldval = val;
+
+   /*
+* NOTE: no need to consider shadow stack complexities because it
+* doesn't support 1G mappings.
+*/
+   val &= _HPAGE_CHG_MASK;
+   val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+   val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
+
+   return __pud(val);
+}
+
 /*
  * mprotect needs to preserve PAT and encryption bits when updating
  * vm_page_prot
@@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
vm_area_struct *vma,
 }
 #endif
 
+static inline pud_t pudp_establish(struct vm_area_struct *vma,
+   unsigned long address, pud_t *pudp, pud_t pud)
+{
+   if (IS_ENABLED(CONFIG_SMP)) {
+   return xchg(pudp, pud);
+   } else {
+   pud_t old = *pudp;
+   WRITE_ONCE(*pudp, pud);
+   return old;
+   }
+}
+
 #define __HAVE_ARCH_PMDP_INVALIDATE_AD
 extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp);
 
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pud_t *pudp);
+
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 93e54ba91fbf..4e245a1526ad 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -641,6 +641,17 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, 
unsigned long address,
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pud_t *pudp)
+{
+   VM_WARN_ON_ONCE(!pud_present(*pudp));
+   pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
+   flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+   return old;
+}
+#endif
+
 /**
  * reserve_top_address - reserves a hole in the top of kernel address space
  * @reserve - size of hole to reserve
-- 
2.45.0

[PATCH 7/7] mm/mprotect: fix dax pud handlings

2024-06-21 Thread Peter Xu

This is only relevant to the two archs that support PUD dax, aka, x86_64
and ppc64.  PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not
count in this case.

DAX have had PUD mappings for years, but change protection path never
worked.  When the path is triggered in any form (a simple test program
would be: call mprotect() on a 1G dev_dax mapping), the kernel will report
"bad pud".  This patch should fix that.

The new change_huge_pud() tries to keep everything simple.  For example, it
doesn't optimize write bit as that will need even more PUD helpers.  It's
not too bad anyway to have one more write fault in the worst case once for
1G range; may be a bigger thing for each PAGE_SIZE, though.  Neither does
it support userfault-wp bits, as there isn't such PUD mappings that is
supported; file mappings always need a split there.

The same to TLB shootdown: the pmd path (which was for x86 only) has the
trick of using _ad() version of pmdp_invalidate*() which can avoid one
redundant TLB, but let's also leave that for later.  Again, the larger the
mapping, the smaller of such effect.

Another thing worth mention is this path needs to be careful on handling
"retry" event for change_huge_pud() (where it can return 0): it isn't like
change_huge_pmd(), as the pmd version is safe with all conditions handled
in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock().
In short, change_pte_range() is simply smarter than change_pmd_range() now
after the shmem thp collapse rework.  For that reason, change_pud_range()
will need proper retry if it races with something else when a huge PUD
changed from under us.

Cc: Dan Williams 
Cc: Matthew Wilcox 
Cc: Dave Jiang 
Cc: Hugh Dickins 
Cc: Kirill A. Shutemov 
Cc: Vlastimil Babka 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: Michael Ellerman 
Cc: Aneesh Kumar K.V 
Cc: Oscar Salvador 
Cc: x...@kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages")
Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Peter Xu 
---
 include/linux/huge_mm.h | 24 +++
 mm/huge_memory.c| 52 +
 mm/mprotect.c   | 40 ---
 3 files changed, 108 insertions(+), 8 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 212cca384d7e..b46673689f19 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -345,6 +345,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, 
unsigned long address,
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
unsigned long address);
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+   pud_t *pudp, unsigned long addr, pgprot_t newprot,
+   unsigned long cp_flags);
+#else
+static inline int
+change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+   pud_t *pudp, unsigned long addr, pgprot_t newprot,
+   unsigned long cp_flags) { return 0; }
+#endif
+
 #define split_huge_pud(__vma, __pud, __address)
\
do {\
pud_t *pud = (__pud);   \
@@ -588,6 +599,19 @@ static inline int next_order(unsigned long *orders, int 
prev)
 {
return 0;
 }
+
+static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
+   unsigned long address)
+{
+}
+
+static inline int change_huge_pud(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pudp,
+ unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags)
+{
+   return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 static inline int split_folio_to_list_to_order(struct folio *folio,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0fffaa58a47a..af8d83f4ce02 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2091,6 +2091,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
return ret;
 }
 
+/*
+ * Returns:
+ *
+ * - 0: if pud leaf changed from under us
+ * - 1: if pud can be skipped
+ * - HPAGE_PUD_NR: if pud was successfully processed
+ */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
+   pud_t *pudp, unsigned long addr, pgprot_t newprot,
+   unsigned long cp_flags)
+{
+   struct mm_struct *mm = vma->vm_mm;
+   pud_t oldpud, entry;
+   spinlock_t *ptl;
+
+   tlb_change_page_size(tlb, HPAGE_PUD_SIZE);
+
+   /* NUMA balancing doesn't apply to dax */
+   if (cp_flags & MM_CP_PROT_NUMA)
+

[PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit

2024-06-21 Thread Peter Xu

An entry should be reported as PUD leaf even if it's PROT_NONE, in which
case PRESENT bit isn't there. I hit bad pud without this when testing dax
1G on zapping a PROT_NONE PUD.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: x...@kernel.org
Signed-off-by: Peter Xu 
---
 arch/x86/include/asm/pgtable.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..25fc6d809572 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1073,8 +1073,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
 #define pud_leaf pud_leaf
 static inline bool pud_leaf(pud_t pud)
 {
-   return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
-   (_PAGE_PSE | _PAGE_PRESENT);
+   return pud_val(pud) & _PAGE_PSE;
 }
 
 static inline int pud_bad(pud_t pud)
-- 
2.45.0

Re: [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit

2024-06-21 Thread Dave Hansen

On 6/21/24 07:25, Peter Xu wrote:
> An entry should be reported as PUD leaf even if it's PROT_NONE, in which
> case PRESENT bit isn't there. I hit bad pud without this when testing dax
> 1G on zapping a PROT_NONE PUD.

Yeah, looks like pmd_leaf() got fixed up because of its involvement in
THP, but PUDs got missed.  This patch also realigns pmd_leaf() and
pud_leaf() behavior, which is important.

Acked-by: Dave Hansen

Re: [PATCH] tools/perf: Handle perftool-testsuite_probe testcases fail when kernel debuginfo is not present

2024-06-21 Thread Namhyung Kim

On Mon, 17 Jun 2024 17:51:21 +0530, Athira Rajeev wrote:

> Running "perftool-testsuite_probe" fails as below:
> 
>   ./perf test -v "perftool-testsuite_probe"
>   83: perftool-testsuite_probe  : FAILED
> 
> There are three fails:
> 
> [...]

Applied to perf-tools-next, thanks!

Best regards,
Namhyung

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Dave Hansen

On 6/21/24 07:25, Peter Xu wrote:
> These new helpers will be needed for pud entry updates soon.  Namely:
> 
> - pudp_invalidate()
> - pud_modify()

I think it's also definitely worth noting where you got this code from.
Presumably you copied, pasted and modified the PMD code.  That's fine,
but it should be called out.

...
> +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> +{
> + pudval_t val = pud_val(pud), oldval = val;
> +
> + /*
> +  * NOTE: no need to consider shadow stack complexities because it
> +  * doesn't support 1G mappings.
> +  */
> + val &= _HPAGE_CHG_MASK;
> + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> +
> + return __pud(val);
> +}

First of all, the comment to explain what you didn't do here is as many
lines as the code to _actually_ implement it.

Second, I believe this might have missed the purpose of the "shadow
stack complexities".  The pmd/pte code is there not to support modifying
shadow stack mappings, it's there to avoid inadvertent shadow stack
mapping creation.

That "NOTE:" is ambiguous as to whether the shadow stacks aren't
supported on 1G mappings in Linux or the hardware (I just checked the
hardware docs and don't see anything making 1G mappings special, btw).

But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
to make it Dirty=1,Write=0?  What prevents that from being
misinterpreted by the hardware as being a valid 1G shadow stack mapping?

>  /*
>   * mprotect needs to preserve PAT and encryption bits when updating
>   * vm_page_prot
> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
> vm_area_struct *vma,
>  }
>  #endif
>  
> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> + unsigned long address, pud_t *pudp, pud_t pud)
> +{
> + if (IS_ENABLED(CONFIG_SMP)) {
> + return xchg(pudp, pud);
> + } else {
> + pud_t old = *pudp;
> + WRITE_ONCE(*pudp, pud);
> + return old;
> + }
> +}

Why is there no:

page_table_check_pud_set(vma->vm_mm, pudp, pud);

?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
 So leaving it out creates a divergence that honestly can only serve to
bite us in the future and will create a head-scratching delta for anyone
that is comparing PUD and PMD implementations in the future.

Re: [PATCH 4/7] mm/powerpc: Add missing pud helpers

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 10:25:01AM -0400, Peter Xu wrote:
> +pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +   pud_t *pudp)
> +{
> + unsigned long old_pud;
> +
> + VM_WARN_ON_ONCE(!pmd_present(*pmdp));
> + old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 
> _PAGE_INVALID);
> + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + return __pmd(old_pmd);
> +}

I'll need an amend at least here, and my test harness won't catch it even
if it's a mistake as silly as it could be..  My apologies for such noise.
Below is what I plan to squash into this patch when I repost v2.

===8<===
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index c6ae969020e0..ea2c83634434 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -176,15 +176,15 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, 
unsigned long address,
return __pmd(old_pmd);
 }
 
-pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
  pud_t *pudp)
 {
unsigned long old_pud;
 
-   VM_WARN_ON_ONCE(!pmd_present(*pmdp));
-   old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 
_PAGE_INVALID);
-   flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
-   return __pmd(old_pmd);
+   VM_WARN_ON_ONCE(!pud_present(*pudp));
+   old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, 
_PAGE_INVALID);
+   flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
+   return __pud(old_pud);
 }
 
 pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
===8<===

-- 
Peter Xu

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
> On 6/21/24 07:25, Peter Xu wrote:
> > These new helpers will be needed for pud entry updates soon.  Namely:
> > 
> > - pudp_invalidate()
> > - pud_modify()
> 
> I think it's also definitely worth noting where you got this code from.
> Presumably you copied, pasted and modified the PMD code.  That's fine,
> but it should be called out.

Yes that's from PMD ones.  Sure, I will add that.

> 
> ...
> > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> > +{
> > +   pudval_t val = pud_val(pud), oldval = val;
> > +
> > +   /*
> > +* NOTE: no need to consider shadow stack complexities because it
> > +* doesn't support 1G mappings.
> > +*/
> > +   val &= _HPAGE_CHG_MASK;
> > +   val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> > +   val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> > +
> > +   return __pud(val);
> > +}
> 
> First of all, the comment to explain what you didn't do here is as many
> lines as the code to _actually_ implement it.
> 
> Second, I believe this might have missed the purpose of the "shadow
> stack complexities".  The pmd/pte code is there not to support modifying
> shadow stack mappings, it's there to avoid inadvertent shadow stack
> mapping creation.
> 
> That "NOTE:" is ambiguous as to whether the shadow stacks aren't
> supported on 1G mappings in Linux or the hardware (I just checked the
> hardware docs and don't see anything making 1G mappings special, btw).

Right this could be ambiguous indeed; I was trying to refer to the fact
where shadow stack is only supported on anon, and anon doesn't support 1G.
But looks like I'm more than wrong than that..

> 
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0?  What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?

Thanks for pointing that out.  I think I was thinking it will only take
effect on VM_SHADOW_STACK first, so it's not?

I was indeed trying to find more information on shadow stack at that time
but I can't find as much on the pgtable implications, on e.g. whether "D=1
+ W=0" globally will be recognized as shadow stack.  At least on SDM March
2024 version Vol3 Chap4 pgtable entries still don't explain these details,
or maybe I missed it.  Please let me know if there's suggestion on what I
can read before I post a v2.

So if it's globally taking effect, indeed we'll need to handle them in PUDs
too.

Asides, not sure whether it's off-topic to ask here, but... why shadow
stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
stack ptes" when designing the spec?  Now it consumed bit 58 anyway for
caching dirty. IIUC we can avoid all these "move back and forth" issue on
dirty bit if so.

> 
> >  /*
> >   * mprotect needs to preserve PAT and encryption bits when updating
> >   * vm_page_prot
> > @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
> > vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> > +   unsigned long address, pud_t *pudp, pud_t pud)
> > +{
> > +   if (IS_ENABLED(CONFIG_SMP)) {
> > +   return xchg(pudp, pud);
> > +   } else {
> > +   pud_t old = *pudp;
> > +   WRITE_ONCE(*pudp, pud);
> > +   return old;
> > +   }
> > +}
> 
> Why is there no:
> 
>   page_table_check_pud_set(vma->vm_mm, pudp, pud);
> 
> ?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
>  So leaving it out creates a divergence that honestly can only serve to
> bite us in the future and will create a head-scratching delta for anyone
> that is comparing PUD and PMD implementations in the future.

Good question, I really don't remember why I didn't have that, since I
should have referenced the pmd helper.  I'll add them and see whether I'll
hit something otherwise.

Thanks for the review.

-- 
Peter Xu

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Dave Hansen

On 6/21/24 08:45, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote:
...
>> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
>> to make it Dirty=1,Write=0?  What prevents that from being
>> misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> 
> Thanks for pointing that out.  I think I was thinking it will only take
> effect on VM_SHADOW_STACK first, so it's not?
> 
> I was indeed trying to find more information on shadow stack at that time
> but I can't find as much on the pgtable implications, on e.g. whether "D=1
> + W=0" globally will be recognized as shadow stack.  At least on SDM March
> 2024 version Vol3 Chap4 pgtable entries still don't explain these details,
> or maybe I missed it.  Please let me know if there's suggestion on what I
> can read before I post a v2.

It's in the "Determination of Access Rights" section.

A linear address is a shadow-stack address if the following are
true of the translation of the linear address: (1) the R/W flag
(bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging-
structure entry that maps the page containing the linear
address; and (2) the R/W flag is 1 in every other paging-
structure entry controlling the translation of the linear
address.

> So if it's globally taking effect, indeed we'll need to handle them in PUDs
> too.
> 
> Asides, not sure whether it's off-topic to ask here, but... why shadow
> stack doesn't reuse an old soft-bit to explicitly mark "this is shadow
> stack ptes" when designing the spec?  Now it consumed bit 58 anyway for
> caching dirty. IIUC we can avoid all these "move back and forth" issue on
> dirty bit if so.

The design accommodates "other" OSes that are using all the software
bits for other things.

For Linux, you're right, we just ended up consuming a software bit
_anyway_ so we got all the complexity of the goofy permissions *AND*
lost a bit in the end.  Lose, lose.

>>>  /*
>>>   * mprotect needs to preserve PAT and encryption bits when updating
>>>   * vm_page_prot
>>> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct 
>>> vm_area_struct *vma,
>>>  }
>>>  #endif
>>>  
>>> +static inline pud_t pudp_establish(struct vm_area_struct *vma,
>>> +   unsigned long address, pud_t *pudp, pud_t pud)
>>> +{
>>> +   if (IS_ENABLED(CONFIG_SMP)) {
>>> +   return xchg(pudp, pud);
>>> +   } else {
>>> +   pud_t old = *pudp;
>>> +   WRITE_ONCE(*pudp, pud);
>>> +   return old;
>>> +   }
>>> +}
>>
>> Why is there no:
>>
>>  page_table_check_pud_set(vma->vm_mm, pudp, pud);
>>
>> ?  Sure, it doesn't _do_ anything today.  But the PMD code has it today.
>>  So leaving it out creates a divergence that honestly can only serve to
>> bite us in the future and will create a head-scratching delta for anyone
>> that is comparing PUD and PMD implementations in the future.
> 
> Good question, I really don't remember why I didn't have that, since I
> should have referenced the pmd helper.  I'll add them and see whether I'll
> hit something otherwise.
> 
> Thanks for the review.

One big thing I did in this review was make sure that the PMD and PUD
helpers were doing the same thing.  Would you mind circling back and
double-checking the same before you repost this?

Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation

2024-06-21 Thread Helge Deller


On 6/21/24 11:52, Arnd Bergmann wrote:

On Fri, Jun 21, 2024, at 11:03, John Paul Adrian Glaubitz wrote:

On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote:

Feel free to pick up the sh patch directly, I'll just merge whatever
is left in the end. I mainly want to ensure we can get all the bugfixes
done for v6.10 so I can build my longer cleanup series on top of it
for 6.11.


This series is still for 6.10?


Yes, these are all the bugfixes that I think we want to backport
to stable kernels, so it makes sense to merge them as quickly as
possible. The actual stuff I'm working on will come as soon as
I have it in a state for public review and won't need to be
backported.


Ah, OK in that case would you please keep the two parisc
patches in your git tree? I didn't plan to send a new pull
request during v6.10, so it's easier for me if you keep them
and send them together with your other remaining patches.
(I'll drop them now from the parisc tree)

I tested both patches, so you may add:
Tested-by: Helge Deller 
Acked-by: Helge Deller 

Thank you!
Helge

Re: [PATCH v4 15/29] arm64: handle PKEY/POE faults

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:33PM +0100, Joey Gouly wrote:
> @@ -529,6 +547,8 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   unsigned int mm_flags = FAULT_FLAG_DEFAULT;
>   unsigned long addr = untagged_addr(far);
>   struct vm_area_struct *vma;
> + bool pkey_fault = false;
> + int pkey = -1;
>  
>   if (kprobe_page_fault(regs, esr))
>   return 0;
> @@ -590,6 +610,12 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   vma_end_read(vma);
>   goto lock_mmap;
>   }
> +
> + if (fault_from_pkey(esr, vma, mm_flags)) {
> + vma_end_read(vma);
> + goto lock_mmap;
> + }
> +
>   fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, 
> regs);
>   if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
>   vma_end_read(vma);
> @@ -617,6 +643,11 @@ static int __kprobes do_page_fault(unsigned long far, 
> unsigned long esr,
>   goto done;
>   }
>  
> + if (fault_from_pkey(esr, vma, mm_flags)) {
> + pkey_fault = true;
> + pkey = vma_pkey(vma);
> + }

I was wondering if we actually need to test this again. We know the
fault was from a pkey already above but I guess it matches what we do
with the vma->vm_flags check in case it races with some mprotect() call.

> +
>   fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs);

You'll need to rebase this on 6.10-rcX since this function disappeared.

Otherwise the patch looks fine.

Reviewed-by: Catalin Marinas

Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> as the boot CPU will enable POE if it has it, so secondary CPUs must also
> have this feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Adding some acks otherwise I'll forget what I reviewed.

Acked-by: Catalin Marinas

Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> as the boot CPU will enable POE if it has it, so secondary CPUs must also
> have this feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas

Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap

2024-06-21 Thread Catalin Marinas

On Fri, Jun 21, 2024 at 06:01:52PM +0100, Catalin Marinas wrote:
> On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote:
> > This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE
> > as the boot CPU will enable POE if it has it, so secondary CPUs must also
> > have this feature.
> > 
> > Signed-off-by: Joey Gouly 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> 
> Acked-by: Catalin Marinas 

One ack is sufficient, ignore this one ;)

-- 
Catalin

Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote:
> POR_EL0 is a register that can be modified by userspace directly,
> so it must be context switched.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas

Re: [PATCH v4 10/29] arm64: enable the Permission Overlay Extension for EL0

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:28PM +0100, Joey Gouly wrote:
> Expose a HWCAP and ID_AA64MMFR3_EL1_S1POE to userspace, so they can be used to
> check if the CPU supports the feature.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas

Re: [PATCH v4 11/29] arm64: re-order MTE VM_ flags

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:29PM +0100, Joey Gouly wrote:
> To make it easier to share the generic PKEYs flags, move the MTE flag.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas

Re: [PATCH v4 12/29] arm64: add POIndex defines

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:30PM +0100, Joey Gouly wrote:
> The 3-bit POIndex is stored in the PTE at bits 60..62.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas

Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote:
> +static void flush_poe(void)
> +{
> + if (!system_supports_poe())
> + return;
> +
> + write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0);
> + /* ISB required for kernel uaccess routines when chaning POR_EL0 */

Nit: s/chaning/changing/

-- 
Catalin

Re: [PATCH v4 16/29] arm64: add pte_access_permitted_no_overlay()

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:34PM +0100, Joey Gouly wrote:
> We do not want take POE into account when clearing the MTE tags.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas

Re: [PATCH v4 20/29] arm64: enable POE and PIE to coexist

2024-06-21 Thread Catalin Marinas

On Fri, May 03, 2024 at 02:01:38PM +0100, Joey Gouly wrote:
> Set the EL0/userspace indirection encodings to be the overlay enabled
> variants of the permissions.
> 
> Signed-off-by: Joey Gouly 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Edgecombe, Rick P

On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> 
> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> to make it Dirty=1,Write=0?  What prevents that from being
> misinterpreted by the hardware as being a valid 1G shadow stack mapping?

Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
this.

Re: [PATCH 0/5] PCI: endpoint: Add EPC 'deinit' event and dw_pcie_ep_linkdown() API

2024-06-21 Thread Krzysztof Wilczyński

Hello,

> Applied patch 2/5 to pci/endpoint! Krzysztof, please apply patches 1/5 and 5/5
> to controller/dwc (patches 3/5 and 4/5 are already applied by you).

Applied to controller/dwc, thank you!

[01/02] PCI: dwc: ep: Remove dw_pcie_ep_init_notify() wrapper
https://git.kernel.org/pci/pci/c/9eba2f70362f

[02/02] PCI: layerscape-ep: Use the generic dw_pcie_ep_linkdown() API to handle 
Link Down event
https://git.kernel.org/pci/pci/c/14638af66309

Krzysztof

Re: [musl] Re: [PATCH 09/15] sh: rework sync_file_range ABI

2024-06-21 Thread Rich Felker

On Fri, Jun 21, 2024 at 10:44:39AM +0200, John Paul Adrian Glaubitz wrote:
> Hi Arnd,
> 
> thanks for your patch!
> 
> On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > The unusual function calling conventions on superh ended up causing
>   ^^
>It's spelled SuperH
> 
> > sync_file_range to have the wrong argument order, with the 'flags'
> > argument getting sorted before 'nbytes' by the compiler.
> > 
> > In userspace, I found that musl, glibc, uclibc and strace all expect the
> > normal calling conventions with 'nbytes' last, so changing the kernel
> > to match them should make all of those work.
> > 
> > In order to be able to also fix libc implementations to work with existing
> > kernels, they need to be able to tell which ABI is used. An easy way
> > to do this is to add yet another system call using the sync_file_range2
> > ABI that works the same on all architectures.
> > 
> > Old user binaries can now work on new kernels, and new binaries can
> > try the new sync_file_range2() to work with new kernels or fall back
> > to the old sync_file_range() version if that doesn't exist.
> > 
> > Cc: sta...@vger.kernel.org
> > Fixes: 75c92acdd5b1 ("sh: Wire up new syscalls.")
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  arch/sh/kernel/sys_sh32.c   | 11 +++
> >  arch/sh/kernel/syscalls/syscall.tbl |  3 ++-
> >  2 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c
> > index 9dca568509a5..d5a4f7c697d8 100644
> > --- a/arch/sh/kernel/sys_sh32.c
> > +++ b/arch/sh/kernel/sys_sh32.c
> > @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 
> > offset0, u32 offset1,
> >  (u64)len0 << 32 | len1, advice);
> >  #endif
> >  }
> > +
> > +/*
> > + * swap the arguments the way that libc wants it instead of
> 
> I think "swap the arguments to the order that libc wants them" would
> be easier to understand here.
> 
> > + * moving flags ahead of the 64-bit nbytes argument
> > + */
> > +SYSCALL_DEFINE6(sh_sync_file_range6, int, fd, SC_ARG64(offset),
> > +SC_ARG64(nbytes), unsigned int, flags)
> > +{
> > +return ksys_sync_file_range(fd, SC_VAL64(loff_t, offset),
> > +SC_VAL64(loff_t, nbytes), flags);
> > +}
> > diff --git a/arch/sh/kernel/syscalls/syscall.tbl 
> > b/arch/sh/kernel/syscalls/syscall.tbl
> > index bbf83a2db986..c55fd7696d40 100644
> > --- a/arch/sh/kernel/syscalls/syscall.tbl
> > +++ b/arch/sh/kernel/syscalls/syscall.tbl
> > @@ -321,7 +321,7 @@
> >  311common  set_robust_list sys_set_robust_list
> >  312common  get_robust_list sys_get_robust_list
> >  313common  splice  sys_splice
> > -314common  sync_file_range sys_sync_file_range
> > +314common  sync_file_range sys_sh_sync_file_range6
>  ^^ Why 
> the suffix 6 here?
> 
> >  315common  tee sys_tee
> >  316common  vmsplicesys_vmsplice
> >  317common  move_pages  sys_move_pages
> > @@ -395,6 +395,7 @@
> >  385common  pkey_alloc  sys_pkey_alloc
> >  386common  pkey_free   sys_pkey_free
> >  387common  rseqsys_rseq
> > +388common  sync_file_range2sys_sync_file_range2
> >  # room for arch specific syscalls
> >  393common  semget  sys_semget
> >  394common  semctl  sys_semctl
> 
> I wonder how you discovered this bug. Did you look up the calling convention 
> on SuperH
> and compare the argument order for the sys_sync_file_range system call 
> documented there
> with the order in the kernel?
> 
> Did you also check what order libc uses? I would expect libc on SuperH 
> misordering the
> arguments as well unless I am missing something. Or do we know that the code 
> is actually
> currently broken?

No, there's no reason libc would misorder them because syscalls aren't
function calls, and aren't subject to function call ABI. We have to
explicitly bind the arguments to registers and make a syscall
instruction.

The only reason this bug happened on the kernel side is that someone
thought it would be a smart idea to save maybe 10 instructions by
treating the register state on entry as directly suitable to jump from
asm to a C function rather than explicitly marshalling the arguments
out of the user-kernel syscall ABI positions into actual arguments to
a C function call.

Rich

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Peter Xu

On Fri, Jun 21, 2024 at 07:36:30PM +, Edgecombe, Rick P wrote:
> On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> > 
> > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > to make it Dirty=1,Write=0?  What prevents that from being
> > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> 
> Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning
> like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> this.

I can definitely look into that, but this check only happens when zapping,
and IIUC it means there can still be outliers floating around.  I wonder
whether it should rely on page_table_check_pxx_set() from that regard.

Thanks,

-- 
Peter Xu

[PATCH 1/1] dt-bindings: soc: fsl: Convert q(b)man-* to yaml format

2024-06-21 Thread Frank Li

Convert qman, bman, qman-portals, bman-portals to yaml format.

Additional Chang for fsl,q(b)man-portal:
- Only keep one example.
- Add fsl,qman-channel-id property.
- Use interrupt type macro.
- Remove top level qman-portals@ff420 at example.

Additional change for fsl,q(b)man:
- Fixed example error.
- Remove redundent part, only keep fsl,qman node.
- Change memory-regions to memory-region.
- fsl,q(b)man-portals is not required property

Additional change for fsl,qman-fqd.yaml:
- Fixed example error.
- Only keep one example.
- Ref to reserve-memory.yaml
- Merge fsl,bman reserver memory part

Signed-off-by: Frank Li 
---
 .../bindings/soc/fsl/bman-portals.txt |  56 --
 .../devicetree/bindings/soc/fsl/bman.txt  | 137 -
 .../bindings/soc/fsl/fsl,bman-portal.yaml |  51 +
 .../devicetree/bindings/soc/fsl/fsl,bman.yaml |  83 
 .../bindings/soc/fsl/fsl,qman-fqd.yaml|  68 +++
 .../bindings/soc/fsl/fsl,qman-portal.yaml | 110 +++
 .../devicetree/bindings/soc/fsl/fsl,qman.yaml |  97 +
 .../bindings/soc/fsl/qman-portals.txt | 134 -
 .../devicetree/bindings/soc/fsl/qman.txt  | 187 --
 9 files changed, 409 insertions(+), 514 deletions(-)
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman.txt
 create mode 100644 
Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman-fqd.yaml
 create mode 100644 
Documentation/devicetree/bindings/soc/fsl/fsl,qman-portal.yaml
 create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman.yaml
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman-portals.txt
 delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman.txt

diff --git a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt 
b/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
deleted file mode 100644
index 2a00e14e11e02..0
--- a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
+++ /dev/null
@@ -1,56 +0,0 @@
-QorIQ DPAA Buffer Manager Portals Device Tree Binding
-
-Copyright (C) 2008 - 2014 Freescale Semiconductor Inc.
-
-CONTENTS
-
-   - BMan Portal
-   - Example
-
-BMan Portal Node
-
-Portals are memory mapped interfaces to BMan that allow low-latency, lock-less
-interaction by software running on processor cores, accelerators and network
-interfaces with the BMan
-
-PROPERTIES
-
-- compatible
-   Usage:  Required
-   Value type: 
-   Definition: Must include "fsl,bman-portal-"
-   May include "fsl,-bman-portal" or "fsl,bman-portal"
-
-- reg
-   Usage:  Required
-   Value type: 
-   Definition: Two regions. The first is the cache-enabled region of
-   the portal. The second is the cache-inhibited region of
-   the portal
-
-- interrupts
-   Usage:  Required
-   Value type: 
-   Definition: Standard property
-
-EXAMPLE
-
-The example below shows a (P4080) BMan portals container/bus node with two 
portals
-
-   bman-portals@ff400 {
-   #address-cells = <1>;
-   #size-cells = <1>;
-   compatible = "simple-bus";
-   ranges = <0 0xf 0xf400 0x20>;
-
-   bman-portal@0 {
-   compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal";
-   reg = <0x0 0x4000>, <0x10 0x1000>;
-   interrupts = <105 2 0 0>;
-   };
-   bman-portal@4000 {
-   compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal";
-   reg = <0x4000 0x4000>, <0x101000 0x1000>;
-   interrupts = <107 2 0 0>;
-   };
-   };
diff --git a/Documentation/devicetree/bindings/soc/fsl/bman.txt 
b/Documentation/devicetree/bindings/soc/fsl/bman.txt
deleted file mode 100644
index 48eed140765b0..0
--- a/Documentation/devicetree/bindings/soc/fsl/bman.txt
+++ /dev/null
@@ -1,137 +0,0 @@
-QorIQ DPAA Buffer Manager Device Tree Bindings
-
-Copyright (C) 2008 - 2014 Freescale Semiconductor Inc.
-
-CONTENTS
-
-   - BMan Node
-   - BMan Private Memory Node
-   - Example
-
-BMan Node
-
-The Buffer Manager is part of the Data-Path Acceleration Architecture (DPAA).
-BMan supports hardware allocation and deallocation of buffers belonging to 
pools
-originally created by software with configurable depletion thresholds. This
-binding covers the CCSR space programming model
-
-PROPERTIES
-
-- compatible
-   Usage:  Required
-   Value type: 
-   Definition: Must include "fsl,bman"
-   May include "fsl,-bman"
-
-- reg
-

Re: [PATCH 6/7] mm/x86: Add missing pud helpers

2024-06-21 Thread Edgecombe, Rick P

On Fri, 2024-06-21 at 16:27 -0400, Peter Xu wrote:
> On Fri, Jun 21, 2024 at 07:36:30PM +, Edgecombe, Rick P wrote:
> > On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote:
> > > 
> > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it
> > > to make it Dirty=1,Write=0?  What prevents that from being
> > > misinterpreted by the hardware as being a valid 1G shadow stack mapping?
> > 
> > Hmm, it looks like we could use an arch_check_zapped_pud() that does a
> > warning
> > like arch_check_zapped_pte/pmd() too. Not that we had no use for one before
> > this.
> 
> I can definitely look into that, but this check only happens when zapping,
> and IIUC it means there can still be outliers floating around.  I wonder
> whether it should rely on page_table_check_pxx_set() from that regard.

Yes, it's not perfect. Hmm, it looks like the page_table_check would catch a lot
more cases, but it would have to be changed to plumb the vma in.

Re: [PATCH 1/1] dt-bindings: soc: fsl: Convert q(b)man-* to yaml format

2024-06-21 Thread Rob Herring (Arm)



On Fri, 21 Jun 2024 17:30:14 -0400, Frank Li wrote:
> Convert qman, bman, qman-portals, bman-portals to yaml format.
> 
> Additional Chang for fsl,q(b)man-portal:
> - Only keep one example.
> - Add fsl,qman-channel-id property.
> - Use interrupt type macro.
> - Remove top level qman-portals@ff420 at example.
> 
> Additional change for fsl,q(b)man:
> - Fixed example error.
> - Remove redundent part, only keep fsl,qman node.
> - Change memory-regions to memory-region.
> - fsl,q(b)man-portals is not required property
> 
> Additional change for fsl,qman-fqd.yaml:
> - Fixed example error.
> - Only keep one example.
> - Ref to reserve-memory.yaml
> - Merge fsl,bman reserver memory part
> 
> Signed-off-by: Frank Li 
> ---
>  .../bindings/soc/fsl/bman-portals.txt |  56 --
>  .../devicetree/bindings/soc/fsl/bman.txt  | 137 -
>  .../bindings/soc/fsl/fsl,bman-portal.yaml |  51 +
>  .../devicetree/bindings/soc/fsl/fsl,bman.yaml |  83 
>  .../bindings/soc/fsl/fsl,qman-fqd.yaml|  68 +++
>  .../bindings/soc/fsl/fsl,qman-portal.yaml | 110 +++
>  .../devicetree/bindings/soc/fsl/fsl,qman.yaml |  97 +
>  .../bindings/soc/fsl/qman-portals.txt | 134 -
>  .../devicetree/bindings/soc/fsl/qman.txt  | 187 --
>  9 files changed, 409 insertions(+), 514 deletions(-)
>  delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman-portals.txt
>  delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman.txt
>  create mode 100644 
> Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.yaml
>  create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman.yaml
>  create mode 100644 
> Documentation/devicetree/bindings/soc/fsl/fsl,qman-fqd.yaml
>  create mode 100644 
> Documentation/devicetree/bindings/soc/fsl/fsl,qman-portal.yaml
>  create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman.yaml
>  delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman-portals.txt
>  delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman.txt
> 

My bot found errors running 'make dt_binding_check' on your patch:

yamllint warnings/errors:

dtschema/dtc warnings/errors:
Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.example.dtb: 
/example-0/bman-portal@0: failed to match any schema with compatible: 
['fsl,bman-portal-1.0.0', 'fsl,bman-portal']

doc reference errors (make refcheckdocs):

See 
https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240621213031.2759046-1-frank...@nxp.com

The base for the series is generally the latest rc1. A different dependency
should be noted in *this* patch.

If you already ran 'make dt_binding_check' and didn't see the above
error(s), then make sure 'yamllint' is installed and dt-schema is up to
date:

pip3 install dtschema --upgrade

Please check and re-submit after running the above command yourself. Note
that DT_SCHEMA_FILES can be set to your schema file to speed up checking
your schema. However, it must be unset to test all examples with your schema.

Re: [V4 00/16] Add data type profiling support for powerpc

2024-06-21 Thread Namhyung Kim

Hello,

On Thu, Jun 20, 2024 at 09:01:01PM +0530, Athira Rajeev wrote:
> 
> 
> > On 14 Jun 2024, at 10:56 PM, Athira Rajeev  
> > wrote:
> > 
> > The patchset from Namhyung added support for data type profiling
> > in perf tool. This enabled support to associate PMU samples to data
> > types they refer using DWARF debug information. With the upstream
> > perf, currently it possible to run perf report or perf annotate to
> > view the data type information on x86.
> > 
> > Initial patchset posted here had changes need to enable data type
> > profiling support for powerpc.
> > 
> > https://lore.kernel.org/all/6e09dc28-4a2e-49d8-a2b5-ffb3396a9...@csgroup.eu/T/
> > 
> > Main change were:
> > 1. powerpc instruction nmemonic table to associate load/store
> > instructions with move_ops which is use to identify if instruction
> > is a memory access one.
> > 2. To get register number and access offset from the given
> > instruction, code uses fields from "struct arch" -> objump.
> > Added entry for powerpc here.
> > 3. A get_arch_regnum to return register number from the
> > register name string.
> > 
> > But the apporach used in the initial patchset used parsing of
> > disassembled code which the current perf tool implementation does.
> > 
> > Example: lwz r10,0(r9)
> > 
> > This line "lwz r10,0(r9)" is parsed to extract instruction name,
> > registers names and offset. Also to find whether there is a memory
> > reference in the operands, "memory_ref_char" field of objdump is used.
> > For x86, "(" is used as memory_ref_char to tackle instructions of the
> > form "mov  (%rax), %rcx".
> > 
> > In case of powerpc, not all instructions using "(" are the only memory
> > instructions. Example, above instruction can also be of extended form (X
> > form) "lwzx r10,0,r19". Inorder to easy identify the instruction category
> > and extract the source/target registers, second patchset added support to 
> > use
> > raw instruction. With raw instruction, macros are added to extract opcode
> > and register fields.
> > Link to second patchset:
> > https://lore.kernel.org/all/20240506121906.76639-1-atraj...@linux.vnet.ibm.com/
> > 
> > Example representation using --show-raw-insn in objdump gives result:
> > 
> > 38 01 81 e8 ld  r4,312(r1)
> > 
> > Here "38 01 81 e8" is the raw instruction representation. In powerpc,
> > this translates to instruction form: "ld RT,DS(RA)" and binary code
> > as:
> >  _
> >  | 58 |  RT  |  RA |  DS   | |
> >  -
> > 06 1116  30 31
> > 
> > Second patchset used "objdump" again to read the raw instruction.
> > But since there is no need to disassemble and binary code can be read
> > directly from the DSO, third patchset (ie this patchset) uses below
> > apporach. The apporach preferred in powerpc to parse sample for data
> > type profiling in V3 patchset is:
> > - Read directly from DSO using dso__data_read_offset
> > - If that fails for any case, fallback to using libcapstone
> > - If libcapstone is not supported, approach will use objdump
> > 
> > Patchset adds support to pick the opcode and reg fields from this
> > raw/binary instruction code. This approach came in from review comment
> > by Segher Boessenkool and Christophe for the initial patchset.
> > 
> > Apart from that, instruction tracking is enabled for powerpc and
> > support function is added to find variables defined as registers
> > Example, in powerpc, below two registers are
> > defined to represent variable:
> > 1. r13: represents local_paca
> > register struct paca_struct *local_paca asm("r13");
> > 
> > 2. r1: represents stack_pointer
> > register void *__stack_pointer asm("r1");
> > 
> > These are handled in this patchset.
> > 
> > - Patch 1 is to rearrange register state type structures to header file
> > so that it can referred from other arch specific files
> > - Patch 2 is to make instruction tracking as a callback to"struct arch"
> > so that it can be implemented by other archs easily and defined in arch
> > specific files
> > - Patch 3 adds support to capture and parse raw instruction in powerpc
> > using dso__data_read_offset utility
> > - Patch 4 adds logic to support using objdump when doing default "perf
> > report" or "perf annotate" since it that needs disassembled instruction.
> > - Patch 5 adds disasm_line__parse to parse raw instruction for powerpc
> > - Patch 6 update parameters for reg extract functions to use raw
> > instruction on powerpc
> > - Patch 7 add support to identify memory instructions of opcode 31 in
> > powerpc
> > - Patch 8 adds more instructions to support instruction tracking in powerpc
> > - Patch 9 and 10 handles instruction tracking for powerpc.
> > - Patch 11, 12 and 13 add support to use libcapstone in powerpc
> > - Patch 14 and patch 15 handles support to find global register variables
> > - Patch 16 handles insn-stat option for perf annotate
> > 
> > Note:
> > - There are rema

52 matches

Mail list logo