Re: [PATCH 01/15] ftruncate: pass a signed offset
On Thu, Jun 20, 2024 at 06:23:02PM GMT, Arnd Bergmann wrote: > From: Arnd Bergmann > > The old ftruncate() syscall, using the 32-bit off_t misses a sign > extension when called in compat mode on 64-bit architectures. As a > result, passing a negative length accidentally succeeds in truncating > to file size between 2GiB and 4GiB. > > Changing the type of the compat syscall to the signed compat_off_t > changes the behavior so it instead returns -EINVAL. > > The native entry point, the truncate() syscall and the corresponding > loff_t based variants are all correct already and do not suffer > from this mistake. > > Fixes: 3f6d078d4acc ("fix compat truncate/ftruncate") > Cc: sta...@vger.kernel.org > Signed-off-by: Arnd Bergmann > --- Looks good to me, Reviewed-by: Christian Brauner
Re: [PATCH] powerpc/pseries: Whitelist dtl slub object for copying to userspace
Hi Michael On 18/06/24 12:41, Michael Ellerman wrote: > I guess there isn't a kmem_cache_create_user_readonly() ? Thank you for your review. My understanding of the question is whether there's a way to whitelist a region such that it can be copied to userspace, but not written to using copy_from_user(). No, we don't have a function to whitelist only for copy_to_user() and not copy_from_user(). Thank you Anjali K
Re: [PATCH 03/15] mips: fix compat_sys_lseek syscall
On Thu, Jun 20, 2024 at 06:23:04PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > This is almost compatible, but passing a negative offset should result > in a EINVAL error, but on mips o32 compat mode would seek to a large > 32-bit byte offset. > > Use compat_sys_lseek() to correctly sign-extend the argument. > > Signed-off-by: Arnd Bergmann > --- > arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl > b/arch/mips/kernel/syscalls/syscall_o32.tbl > index 85751c9b9cdb..2439a2491cff 100644 > --- a/arch/mips/kernel/syscalls/syscall_o32.tbl > +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl > @@ -27,7 +27,7 @@ > 17 o32 break sys_ni_syscall > # 18 was sys_stat > 18 o32 unused18sys_ni_syscall > -19 o32 lseek sys_lseek > +19 o32 lseek sys_lseek > compat_sys_lseek > 20 o32 getpid sys_getpid > 21 o32 mount sys_mount > 22 o32 umount sys_oldumount > -- > 2.39.2 applied to mips-fixes. Thomas. -- Crap can work. Given enough thrust pigs will fly, but it's not necessarily a good idea.[ RFC1925, 2.3 ]
Re: [Patch v4 08/10] mtd: rawnand: lpx32xx: Request DMA channels using DT entries
Hi Piotr, piotr.wojtaszc...@timesys.com wrote on Thu, 20 Jun 2024 19:56:39 +0200: > Move away from pl08x platform data towards device tree. > > Signed-off-by: Piotr Wojtaszczyk I don't see any change regarding the NAND controller node in the device tree, is there any dependency with other patches from the same patchset or may I apply this directly to nand/next? Thanks, Miquèl
Re: [PATCH 09/15] sh: rework sync_file_range ABI
Hi Arnd, thanks for your patch! On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > The unusual function calling conventions on superh ended up causing ^^ It's spelled SuperH > sync_file_range to have the wrong argument order, with the 'flags' > argument getting sorted before 'nbytes' by the compiler. > > In userspace, I found that musl, glibc, uclibc and strace all expect the > normal calling conventions with 'nbytes' last, so changing the kernel > to match them should make all of those work. > > In order to be able to also fix libc implementations to work with existing > kernels, they need to be able to tell which ABI is used. An easy way > to do this is to add yet another system call using the sync_file_range2 > ABI that works the same on all architectures. > > Old user binaries can now work on new kernels, and new binaries can > try the new sync_file_range2() to work with new kernels or fall back > to the old sync_file_range() version if that doesn't exist. > > Cc: sta...@vger.kernel.org > Fixes: 75c92acdd5b1 ("sh: Wire up new syscalls.") > Signed-off-by: Arnd Bergmann > --- > arch/sh/kernel/sys_sh32.c | 11 +++ > arch/sh/kernel/syscalls/syscall.tbl | 3 ++- > 2 files changed, 13 insertions(+), 1 deletion(-) > > diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c > index 9dca568509a5..d5a4f7c697d8 100644 > --- a/arch/sh/kernel/sys_sh32.c > +++ b/arch/sh/kernel/sys_sh32.c > @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 > offset0, u32 offset1, >(u64)len0 << 32 | len1, advice); > #endif > } > + > +/* > + * swap the arguments the way that libc wants it instead of I think "swap the arguments to the order that libc wants them" would be easier to understand here. > + * moving flags ahead of the 64-bit nbytes argument > + */ > +SYSCALL_DEFINE6(sh_sync_file_range6, int, fd, SC_ARG64(offset), > +SC_ARG64(nbytes), unsigned int, flags) > +{ > +return ksys_sync_file_range(fd, SC_VAL64(loff_t, offset), > +SC_VAL64(loff_t, nbytes), flags); > +} > diff --git a/arch/sh/kernel/syscalls/syscall.tbl > b/arch/sh/kernel/syscalls/syscall.tbl > index bbf83a2db986..c55fd7696d40 100644 > --- a/arch/sh/kernel/syscalls/syscall.tbl > +++ b/arch/sh/kernel/syscalls/syscall.tbl > @@ -321,7 +321,7 @@ > 311 common set_robust_list sys_set_robust_list > 312 common get_robust_list sys_get_robust_list > 313 common splice sys_splice > -314 common sync_file_range sys_sync_file_range > +314 common sync_file_range sys_sh_sync_file_range6 ^^ Why the suffix 6 here? > 315 common tee sys_tee > 316 common vmsplicesys_vmsplice > 317 common move_pages sys_move_pages > @@ -395,6 +395,7 @@ > 385 common pkey_alloc sys_pkey_alloc > 386 common pkey_free sys_pkey_free > 387 common rseqsys_rseq > +388 common sync_file_range2sys_sync_file_range2 > # room for arch specific syscalls > 393 common semget sys_semget > 394 common semctl sys_semctl I wonder how you discovered this bug. Did you look up the calling convention on SuperH and compare the argument order for the sys_sync_file_range system call documented there with the order in the kernel? Did you also check what order libc uses? I would expect libc on SuperH misordering the arguments as well unless I am missing something. Or do we know that the code is actually currently broken? Thanks, Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
Hi Helge and Arnd, On Thu, 2024-06-20 at 23:21 +0200, Helge Deller wrote: > The patch looks good at first sight. > I'll pick it up in my parisc git tree and will do some testing the > next few days and then push forward for 6.11 when it opens Isn't this supposed to go in as one series or can arch maintainers actually pick the patches for their architecture and merge them individually? If yes, I would prefer to do that for the SuperH patch as well as I usually prefer merging SuperH patches in my own tree. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
Hi, On Fri, 2024-06-21 at 08:28 +0200, Arnd Bergmann wrote: > It's more likely to be related to the upward growing stack. > I checked the gcc sources and found that out of the 50 supported > architectures, ARGS_GROW_DOWNWARD is set on everything except > for gcn, stormy16 and 32-bit parisc. The other two are > little-endian though. STACK_GROWS_DOWNWARD in turn is set on > everything other than parisc (both 32-bit and 64-bit). Wait a second! Does that mean that on 64-bit PA-RISC, the stack is actually growing downwards? If yes, that would be a strong argument for creating a 64-bit PA-RISC port in Debian and replacing the 32-bit port. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
On Fri, Jun 21, 2024, at 10:52, John Paul Adrian Glaubitz wrote: > Hi Helge and Arnd, > > On Thu, 2024-06-20 at 23:21 +0200, Helge Deller wrote: >> The patch looks good at first sight. >> I'll pick it up in my parisc git tree and will do some testing the >> next few days and then push forward for 6.11 when it opens > > Isn't this supposed to go in as one series or can arch maintainers actually > pick the patches for their architecture and merge them individually? > > If yes, I would prefer to do that for the SuperH patch as well as I usually > prefer merging SuperH patches in my own tree. The patches are all independent of one another, except for a couple of context changes where multiple patches touch the same lines. Feel free to pick up the sh patch directly, I'll just merge whatever is left in the end. I mainly want to ensure we can get all the bugfixes done for v6.10 so I can build my longer cleanup series on top of it for 6.11. Arnd
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote: > The patches are all independent of one another, except for a couple > of context changes where multiple patches touch the same lines. OK. > Feel free to pick up the sh patch directly, I'll just merge whatever > is left in the end. I mainly want to ensure we can get all the bugfixes > done for v6.10 so I can build my longer cleanup series on top of it > for 6.11. This series is still for 6.10? Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback
On Wed, Jun 19, 2024 at 11:28:13AM +0200, Vlastimil Babka wrote: > On 6/18/24 7:53 PM, Paul E. McKenney wrote: > > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: > >> On 6/18/24 6:48 PM, Paul E. McKenney wrote: > >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> >> > >> + > >> >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> >> > >> + > >> >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing > >> >> > >> here > >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new > >> >> > > API, which i > >> >> > > wanted to avoid initially. > >> >> > > >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this > >> >> > would > >> >> > be achieved. The barrier is used internally so I don't consider that > >> >> > an > >> >> > API to avoid. How difficult is the implementation is another question, > >> >> > depending on how the current batching works. Once (if) we have sheaves > >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> >> > also look different and hopefully easier. So maybe it's not worth to > >> >> > invest too much into that barrier and just go for the potentially > >> >> > longer, but easier to implement? > >> >> > > >> >> Right. I agree here. If the cache is not empty, OK, we just defer the > >> >> work, even we can use a big 21 seconds delay, after that we just "warn" > >> >> if it is still not empty and leave it as it is, i.e. emit a warning and > >> >> we are done. > >> >> > >> >> Destroying the cache is not something that must happen right away. > >> > > >> > OK, I have to ask... > >> > > >> > Suppose that the cache is created and destroyed by a module and > >> > init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> > then very quickly insmod'ed. > >> > > >> > Do we need to fail the insmod if the kmem_cache has not yet been fully > >> > cleaned up? > >> > >> We don't have any such link between kmem_cache and module to detect that, > >> so > >> we would have to start tracking that. Probably not worth the trouble. > > > > Fair enough! > > > >> > If not, do we have two versions of the same kmem_cache in > >> > /proc during the overlap time? > >> > >> Hm could happen in /proc/slabinfo but without being harmful other than > >> perhaps confusing someone. We could filter out the caches being destroyed > >> trivially. > > > > Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script > > breakage from flagging the name somehow, for example, trailing "/" > > character. > > Yeah I've been resisting such changes to the layout and this wouldn't be > worth it, apart from changing the name itself but not in a dangerous way > like with "/" :) > > >> Sysfs and debugfs might be more problematic as I suppose directory names > >> would clash. I'll have to check... might be even happening now when we do > >> detect leaked objects and just leave the cache around... thanks for the > >> question. > > > > "It is a service that I provide." ;-) > > > > But yes, we might be living with it already and there might already > > be ways people deal with it. > > So it seems if the sysfs/debugfs directories already exist, they will > silently not be created. Wonder if we have such cases today already because > caches with same name exist. I think we do with the zsmalloc using 32 caches > with same name that we discussed elsewhere just recently. > > Also indeed if the cache has leaked objects and won't be thus destroyed, > these directories indeed stay around, as well as the slabinfo entry, and can > prevent new ones from being created (slabinfo lines with same name are not > prevented). > > But it wouldn't be great to introduce this possibility to happen for the > temporarily delayed removal due to kfree_rcu() and a module re-insert, since > that's a legitimate case and not buggy state due to leaks. > > The debugfs directory we could remove immediately before handing over to the > scheduled workfn, but if it turns out there was a leak and the workfn leaves > the cache around, debugfs dir will be gone and we can't check the > alloc_traces/free_traces files there (but we have the per-object info > including the traces in the dmesg splat). > > The sysfs directory is currently removed only with the whole cache being > destryed due to sysfs/kobject lifetime model. I'd love to untangle it for > other reasons too, but haven't investigated it yet. But again it might be > useful for sysfs dir to stay around for inspection, as for the debugfs. > > We could rename the sysfs/debugfs directories before queuing the work? Add > some prefix like GOING_AWAY-$name. If leak is detected and cache stays > forever, another rename to LEAKED-$name. (and same for the slabinfo). But > multiple ones with same name might pile up, so try adding a counter then? > Probably messy to implement,
Re: [PATCH 09/15] sh: rework sync_file_range ABI
On Fri, Jun 21, 2024, at 10:44, John Paul Adrian Glaubitz wrote: > On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote: >> From: Arnd Bergmann >> >> The unusual function calling conventions on superh ended up causing > ^^ >It's spelled SuperH Fixed now. >> diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c >> index 9dca568509a5..d5a4f7c697d8 100644 >> --- a/arch/sh/kernel/sys_sh32.c >> +++ b/arch/sh/kernel/sys_sh32.c >> @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 >> offset0, u32 offset1, >> (u64)len0 << 32 | len1, advice); >> #endif >> } >> + >> +/* >> + * swap the arguments the way that libc wants it instead of > > I think "swap the arguments to the order that libc wants them" would > be easier to understand here. Done >> diff --git a/arch/sh/kernel/syscalls/syscall.tbl >> b/arch/sh/kernel/syscalls/syscall.tbl >> index bbf83a2db986..c55fd7696d40 100644 >> --- a/arch/sh/kernel/syscalls/syscall.tbl >> +++ b/arch/sh/kernel/syscalls/syscall.tbl >> @@ -321,7 +321,7 @@ >> 311 common set_robust_list sys_set_robust_list >> 312 common get_robust_list sys_get_robust_list >> 313 common splice sys_splice >> -314 common sync_file_range sys_sync_file_range >> +314 common sync_file_range sys_sh_sync_file_range6 > ^^ > Why the suffix 6 here? In a later part of my cleanup, I'm consolidating all the copies of this function (arm64, mips, parisc, powerpc, s390, sh, sparc, x86) and picked the name sys_sync_file_range6() for common implementation. I end up with four entry points here, so the naming is a bit confusing: - sys_sync_file_range() is only used on 64-bit architectures, on x32 and on mips-n32. This uses four arguments, including two 64-bit wide ones. - sys_sync_file_range2() continues to be used on arm, powerpc, xtensa and now on sh, hexagon and csky. I change the implementation to take six 32-bit arguments, but the ABI remains the same as before, with the flags before offset. - sys_sync_file_range6() is used for most other 32-bit ABIs: arc, m68k, microblaze, nios2, openrisc, parisc, s390, sh, sparc and x86. This also has six 32-bit arguments but in the default order (fd, offset, nbytes, flags). - sys_sync_file_range7() is exclusive to mips-o32, this one has an unused argument and is otherwise the same as sys_sync_file_range6(). My plan is to then have some infrastructure to ensure userspace tools (libc, strace, qemu, rust, ...) use the same calling conventions as the kernel. I'm doing the same thing for all other syscalls that have architecture specific calling conventions, so far I'm using fadvise64_64_7 fanotify_mark6 truncate3 truncate4 ftruncate3 ftruncate4 fallocate6 pread5 pread6 pwrite5 pwrite6 preadv5 preadv6 pwritev5 pwritev6 sync_file_range6 fadvise64_64_2 fadvise64_64_6 fadvise64_5 fadvise64_6 readahead4 readahead5 The last number here is usually the number of 32-bit arguments, except for fadvise64_64_2 that uses the same argument reordering trick as sync_file_range2. I'm not too happy with the naming but couldn't come up with anything clearer either, so let me know if you have any ideas there. >> 315 common tee sys_tee >> 316 common vmsplicesys_vmsplice >> 317 common move_pages sys_move_pages >> @@ -395,6 +395,7 @@ >> 385 common pkey_alloc sys_pkey_alloc >> 386 common pkey_free sys_pkey_free >> 387 common rseqsys_rseq >> +388 common sync_file_range2sys_sync_file_range2 >> # room for arch specific syscalls >> 393 common semget sys_semget >> 394 common semctl sys_semctl > > I wonder how you discovered this bug. Did you look up the calling > convention on SuperH > and compare the argument order for the sys_sync_file_range system call > documented there > with the order in the kernel? I had to categorize all architectures based on their calling conventions to see if 64-bit arguments need aligned pairs or not, so I wrote a set of simple C files that I compiled for all architectures to see in which cases they insert unused arguments or swap the order of the upper and lower halves. SuperH, parisc and s390 are each slightly different from all the others here, so I ended up reading the ELF psABI docs and/or the compiler sources to be sure. I also a lot of git history. > Did you also check what order libc uses? I would expect libc on SuperH > misordering the > arguments as well unless I am missing something. Or do we know that the > code is actually > currently broken? Yes, I checked glibc, musl and uclibc-ng for all
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
On Fri, Jun 21, 2024, at 11:03, John Paul Adrian Glaubitz wrote: > On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote: >> Feel free to pick up the sh patch directly, I'll just merge whatever >> is left in the end. I mainly want to ensure we can get all the bugfixes >> done for v6.10 so I can build my longer cleanup series on top of it >> for 6.11. > > This series is still for 6.10? Yes, these are all the bugfixes that I think we want to backport to stable kernels, so it makes sense to merge them as quickly as possible. The actual stuff I'm working on will come as soon as I have it in a state for public review and won't need to be backported. Arnd
Re: [PATCH] powerpc/pseries: Whitelist dtl slub object for copying to userspace
Anjali K writes: > Hi Michael > > On 18/06/24 12:41, Michael Ellerman wrote: >> I guess there isn't a kmem_cache_create_user_readonly() ? > Thank you for your review. > > My understanding of the question is whether there's a way to whitelist a > region such that it can be copied to userspace, but not written to using > copy_from_user(). Yes that's what I meant, and I pretty much worked that out from looking at the implementation, but was hoping Kees would tell me it was there somewhere, or implement it :) Apologies for being cryptic. > No, we don't have a function to whitelist only for copy_to_user() and not > copy_from_user(). Yep. I'll take this patch as-is, I think we've established that it's pretty low risk to whitelist the whole cache. cheers
Re: [Patch v4 10/10] i2x: pnx: Use threaded irq to fix warning from del_timer_sync()
Hi Andi, On Fri, Jun 21, 2024 at 12:57 AM Andi Shyti wrote: > On Thu, Jun 20, 2024 at 07:56:41PM GMT, Piotr Wojtaszczyk wrote: > > When del_timer_sync() is called in an interrupt context it throws a warning > > because of potential deadlock. Threaded irq handler fixes the potential > > problem. > > > > Signed-off-by: Piotr Wojtaszczyk > > did you run into a lockdep splat? > > Anything against using del_timer(), instead? Have you tried? I didn't get a lockdep splat but console was flooded with warnings from https://github.com/torvalds/linux/blob/v6.10-rc4/kernel/time/timer.c#L1655 In the linux kernel v5.15 I didn't see these warnings. I'm not a maintainer of the driver and I didn't do any research on what kind of impact would have using del_timer() instad. Maybe Vladimir Zapolskiy will know that. -- Piotr Wojtaszczyk Timesys
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
On 2024-06-21 4:54 a.m., John Paul Adrian Glaubitz wrote: Hi, On Fri, 2024-06-21 at 08:28 +0200, Arnd Bergmann wrote: It's more likely to be related to the upward growing stack. I checked the gcc sources and found that out of the 50 supported architectures, ARGS_GROW_DOWNWARD is set on everything except for gcn, stormy16 and 32-bit parisc. The other two are little-endian though. STACK_GROWS_DOWNWARD in turn is set on everything other than parisc (both 32-bit and 64-bit). Wait a second! Does that mean that on 64-bit PA-RISC, the stack is actually growing downwards? If yes, that would be a strong argument for creating a 64-bit PA-RISC port in Debian and replacing the 32-bit port. No, the stack grows upward on both 32 and 64-bit parisc. But stack arguments grow upwards on 64-bit parisc. The argument pointer is needed to access these arguments. In 32-bit parisc, the argument pointer is at a fixed offset relative to the stack pointer and it can be eliminated. Dave -- John David Anglin dave.ang...@bell.net
Re: [Patch v4 08/10] mtd: rawnand: lpx32xx: Request DMA channels using DT entries
On Fri, Jun 21, 2024 at 10:30 AM Miquel Raynal wrote: > > Hi Piotr, > > piotr.wojtaszc...@timesys.com wrote on Thu, 20 Jun 2024 19:56:39 +0200: > > > Move away from pl08x platform data towards device tree. > > > > Signed-off-by: Piotr Wojtaszczyk > > I don't see any change regarding the NAND controller node in the device > tree, is there any dependency with other patches from the same patchset > or may I apply this directly to nand/next? > > Thanks, > Miquèl Yes, this patch depends on "[v4,04/10] ARM: dts: lpc32xx: Add missing dma and i2s properties" which will be splitted into two or more separate patches per request in the comments. I'd like to keep driver changes and corresponding changes in DTS in the same patch but I've made a separate patch for DTS per request from v2 of the patch set. -- Piotr Wojtaszczyk Timesys
Re: [PATCH 12/15] s390: remove native mmap2() syscall
On Thu, Jun 20, 2024 at 06:23:13PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > The mmap2() syscall has never been used on 64-bit s390x and should > have been removed as part of 5a79859ae0f3 ("s390: remove 31 bit > support"). > > Remove it now. > > Signed-off-by: Arnd Bergmann > --- > arch/s390/kernel/syscall.c | 27 --- > 1 file changed, 27 deletions(-) Acked-by: Heiko Carstens
Re: [PATCH 02/15] syscalls: fix compat_sys_io_pgetevents_time64 usage
On Thu, Jun 20, 2024 at 06:23:03PM +0200, Arnd Bergmann wrote: > From: Arnd Bergmann > > Using sys_io_pgetevents() as the entry point for compat mode tasks > works almost correctly, but misses the sign extension for the min_nr > and nr arguments. > > This was addressed on parisc by switching to > compat_sys_io_pgetevents_time64() in commit 6431e92fc827 ("parisc: > io_pgetevents_time64() needs compat syscall in 32-bit compat mode"), > as well as by using more sophisticated system call wrappers on x86 and > s390. However, arm64, mips, powerpc, sparc and riscv still have the > same bug. > > Changes all of them over to use compat_sys_io_pgetevents_time64() > like parisc already does. This was clearly the intention when the > function was originally added, but it got hooked up incorrectly in > the tables. > > Cc: sta...@vger.kernel.org > Fixes: 48166e6ea47d ("y2038: add 64-bit time_t syscalls to all 32-bit > architectures") > Signed-off-by: Arnd Bergmann > --- > arch/arm64/include/asm/unistd32.h | 2 +- > arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +- > arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +- > arch/powerpc/kernel/syscalls/syscall.tbl | 2 +- > arch/s390/kernel/syscalls/syscall.tbl | 2 +- > arch/sparc/kernel/syscalls/syscall.tbl| 2 +- > arch/x86/entry/syscalls/syscall_32.tbl| 2 +- > include/uapi/asm-generic/unistd.h | 2 +- > 8 files changed, 8 insertions(+), 8 deletions(-) Acked-by: Heiko Carstens # s390
[PATCH 0/7] mm/mprotect: Fix dax puds
[Based on mm-unstable, commit a53138cdbe3e] Dax supports pud pages for a while, but mprotect on puds was missing since the start. This series tries to fix that by providing pud handling in mprotect(), while my real goal is adding more types of pud mappings like hugetlb or pfnmaps, it's just that we probably want pud to work already and build the rest on top. Considering nobody reported this until when I looked at those other types of pud mappings, I am thinking maybe it doesn't need to be a fix for stable and this may not need to be backported. I would guess whoever cares about mprotect() won't care 1G dax puds yet, vice versa. I hope fixing that in new kernels would be fine, but I'm open to suggestions. There're quite a few small things changed here and there to teach mprotect work on PUDs. E.g. it will start with dropping NUMA_HUGE_PTE_UPDATES which may stop making much sense when there can be more than one type of huge pte (meanwhile it doesn't sound right at all to account non-numa operations too.. more in the commit message of the relevant patch). OTOH, we'll also need to push the mmu notifiers from pmd to pud layers, which might need some attention but so far I think it's safe. For these small details, please refer to each patch's commit message. The mprotect() pud process is hopefully straightforward enough, as I kept it as simple as possible. There's no NUMA handled as dax simply doesn't support that. There's also no userfault involvements as file memory (even if work with userfault-wp async mode) will need to split a pud, so pud entry doesn't need to yet know userfault's existance (but hugetlb entries will; that's also for later). Tests = What I did test: - cross-build tests that I normally cover [1], except an known issue elsewhere on hugetlb [2] - smoke tested on x86_64 the simplest program [3] on dev_dax 1G PUD mprotect() using QEMU's nvdimm emulations [4] and ndctl to create namespaces with proper alignments, which used to throw "bad pud" but now it'll run through all fine. Also I checked sigbus happens if with illegal access on protected puds. What I didn't test: - fsdax: I wanted to also give it a shot, but only until then I noticed it doesn't seem to be supported (according to dax_iomap_fault(), which will always fallback on PUD_ORDER). I did remember it was supported before, I could miss something important there.. please shoot if so. - userfault wp-async: I also wanted to test userfault-wp async be able to split huge puds (here it's simply a clear_pud.. though), but it won't work for devdax anyway due to not allowed to do smaller than 1G faults in this case. So skip too. - Power, as no hardware on hand. Thanks, [1] https://gitlab.com/peterx/lkb-harness/-/blob/main/config.json [2] https://lore.kernel.org/all/202406190956.9j1ucie5-...@intel.com [2] https://github.com/xzpeter/clibs/blob/master/misc/dax.c [3] https://github.com/qemu/qemu/blob/master/docs/nvdimm.txt Peter Xu (7): mm/dax: Dump start address in fault handler mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES mm/mprotect: Push mmu notifier to PUDs mm/powerpc: Add missing pud helpers mm/x86: Make pud_leaf() only cares about PSE bit mm/x86: Add missing pud helpers mm/mprotect: fix dax pud handlings arch/powerpc/include/asm/book3s/64/pgtable.h | 3 + arch/powerpc/mm/book3s64/pgtable.c | 20 ++ arch/x86/include/asm/pgtable.h | 39 ++- arch/x86/mm/pgtable.c| 11 +++ drivers/dax/device.c | 6 +- include/linux/huge_mm.h | 24 +++ include/linux/vm_event_item.h| 1 - mm/huge_memory.c | 52 ++ mm/mprotect.c| 74 mm/vmstat.c | 1 - 10 files changed, 195 insertions(+), 36 deletions(-) -- 2.45.0
[PATCH 1/7] mm/dax: Dump start address in fault handler
Currently the dax fault handler dumps the vma range when dynamic debugging enabled. That's mostly not useful. Dump the (aligned) address instead with the order info. Signed-off-by: Peter Xu --- drivers/dax/device.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/dax/device.c b/drivers/dax/device.c index eb61598247a9..714174844ca5 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -235,9 +235,9 @@ static vm_fault_t dev_dax_huge_fault(struct vm_fault *vmf, unsigned int order) int id; struct dev_dax *dev_dax = filp->private_data; - dev_dbg(&dev_dax->dev, "%s: %s (%#lx - %#lx) order:%d\n", current->comm, - (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", - vmf->vma->vm_start, vmf->vma->vm_end, order); + dev_dbg(&dev_dax->dev, "%s: op=%s addr=%#lx order=%d\n", current->comm, + (vmf->flags & FAULT_FLAG_WRITE) ? "write" : "read", + vmf->address & ~((1UL << (order + PAGE_SHIFT)) - 1), order); id = dax_read_lock(); if (order == 0) -- 2.45.0
[PATCH 2/7] mm/mprotect: Remove NUMA_HUGE_PTE_UPDATES
In 2013, commit 72403b4a0fbd ("mm: numa: return the number of base pages altered by protection changes") introduced "numa_huge_pte_updates" vmstat entry, trying to capture how many huge ptes (in reality, PMD thps at that time) are marked by NUMA balancing. This patch proposes to remove it for some reasons. Firstly, the name is misleading. We can have more than one way to have a "huge pte" at least nowadays, and that's also the major goal of this patch, where it paves way for PUD handling in change protection code paths. PUDs are coming not only for dax (which has already came and yet broken..), but also for pfnmaps and hugetlb pages. The name will simply stop making sense when PUD will start to be involved in mprotect() world. It'll also make it not reasonable either if we boost the counter for both pmd/puds. In short, current accounting won't be right when PUD comes, so the scheme was only suitable at that point in time where PUD wasn't even possible. Secondly, the accounting was simply not right from the start as long as it was also affected by other call sites besides NUMA. mprotect() is one, while userfaultfd-wp also leverages change protection path to modify pgtables. If it wants to do right it needs to check the caller but it never did; at least mprotect() should be there even in 2013. It gives me the impression that nobody is seriously using this field, and it's also impossible to be serious. We may want to do it right if any NUMA developers would like it to exist, but we should do that with all above resolved, on both considering PUDs, but also on correct accountings. That should be able to be done on top when there's a real need of such. Cc: Huang Ying Cc: Mel Gorman Cc: Alex Thorlton Cc: Rik van Riel Signed-off-by: Peter Xu --- include/linux/vm_event_item.h | 1 - mm/mprotect.c | 8 +--- mm/vmstat.c | 1 - 3 files changed, 1 insertion(+), 9 deletions(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 73fa5fbf33a3..a76b75d6a8f4 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -59,7 +59,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, OOM_KILL, #ifdef CONFIG_NUMA_BALANCING NUMA_PTE_UPDATES, - NUMA_HUGE_PTE_UPDATES, NUMA_HINT_FAULTS, NUMA_HINT_FAULTS_LOCAL, NUMA_PAGE_MIGRATE, diff --git a/mm/mprotect.c b/mm/mprotect.c index 222ab434da54..21172272695e 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -363,7 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, pmd_t *pmd; unsigned long next; long pages = 0; - unsigned long nr_huge_updates = 0; struct mmu_notifier_range range; range.start = 0; @@ -411,11 +410,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb, ret = change_huge_pmd(tlb, vma, pmd, addr, newprot, cp_flags); if (ret) { - if (ret == HPAGE_PMD_NR) { + if (ret == HPAGE_PMD_NR) pages += HPAGE_PMD_NR; - nr_huge_updates++; - } - /* huge pmd was handled */ goto next; } @@ -435,8 +431,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (range.start) mmu_notifier_invalidate_range_end(&range); - if (nr_huge_updates) - count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates); return pages; } diff --git a/mm/vmstat.c b/mm/vmstat.c index 6e3347789eb2..be774893f69e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1313,7 +1313,6 @@ const char * const vmstat_text[] = { #ifdef CONFIG_NUMA_BALANCING "numa_pte_updates", - "numa_huge_pte_updates", "numa_hint_faults", "numa_hint_faults_local", "numa_pages_migrated", -- 2.45.0
[PATCH 3/7] mm/mprotect: Push mmu notifier to PUDs
mprotect() does mmu notifiers in PMD levels. It's there since 2014 of commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to change_pmd_range"). At that time, the issue was that NUMA balancing can be applied on a huge range of VM memory, even if nothing was populated. The notification can be avoided in this case if no valid pmd detected, which includes either THP or a PTE pgtable page. Now to pave way for PUD handling, this isn't enough. We need to generate mmu notifications even on PUD entries properly. mprotect() is currently broken on PUD (e.g., one can easily trigger kernel error with dax 1G mappings already), this is the start to fix it. To fix that, this patch proposes to push such notifications to the PUD layers. There is risk on regressing the problem Rik wanted to resolve before, but I think it shouldn't really happen, and I still chose this solution because of a few reasons: 1) Consider a large VM that should definitely contain more than GBs of memory, it's highly likely that PUDs are also none. In this case there will have no regression. 2) KVM has evolved a lot over the years to get rid of rmap walks, which might be the major cause of the previous soft-lockup. At least TDP MMU already got rid of rmap as long as not nested (which should be the major use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM pgtable (e.g. EPT on x86), the invalidation of a full empty region in most cases could be pretty fast now, comparing to 2014. 3) KVM has explicit code paths now to even give way for mmu notifiers just like this one, e.g. in commit d02c357e5bfa ("KVM: x86/mmu: Retry fault before acquiring mmu_lock if mapping is changing"). It'll also avoid contentions that may also contribute to a soft-lockup. 4) Stick with PMD layer simply don't work when PUD is there... We need one way or another to fix PUD mappings on mprotect(). Pushing it to PUD should be the safest approach as of now, e.g. there's yet no sign of huge P4D coming on any known archs. Cc: k...@vger.kernel.org Cc: Sean Christopherson Cc: Paolo Bonzini Cc: David Rientjes Cc: Rik van Riel Signed-off-by: Peter Xu --- mm/mprotect.c | 26 -- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/mm/mprotect.c b/mm/mprotect.c index 21172272695e..fb8bf3ff7cd9 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -363,9 +363,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, pmd_t *pmd; unsigned long next; long pages = 0; - struct mmu_notifier_range range; - - range.start = 0; pmd = pmd_offset(pud, addr); do { @@ -383,14 +380,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, if (pmd_none(*pmd)) goto next; - /* invoke the mmu notifier if the pmd is populated */ - if (!range.start) { - mmu_notifier_range_init(&range, - MMU_NOTIFY_PROTECTION_VMA, 0, - vma->vm_mm, addr, end); - mmu_notifier_invalidate_range_start(&range); - } - _pmd = pmdp_get_lockless(pmd); if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) { if ((next - addr != HPAGE_PMD_SIZE) || @@ -428,9 +417,6 @@ static inline long change_pmd_range(struct mmu_gather *tlb, cond_resched(); } while (pmd++, addr = next, addr != end); - if (range.start) - mmu_notifier_invalidate_range_end(&range); - return pages; } @@ -438,10 +424,13 @@ static inline long change_pud_range(struct mmu_gather *tlb, struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr, unsigned long end, pgprot_t newprot, unsigned long cp_flags) { + struct mmu_notifier_range range; pud_t *pud; unsigned long next; long pages = 0, ret; + range.start = 0; + pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); @@ -450,10 +439,19 @@ static inline long change_pud_range(struct mmu_gather *tlb, return ret; if (pud_none_or_clear_bad(pud)) continue; + if (!range.start) { + mmu_notifier_range_init(&range, + MMU_NOTIFY_PROTECTION_VMA, 0, + vma->vm_mm, addr, end); + mmu_notifier_invalidate_range_start(&range); + } pages += change_pmd_range(tlb, vma, pud, addr, next, newprot, cp_flags); } while (pud++, addr = next, addr != end); + if (range.start) + mmu_notifier_invalidate_range_end(&range); + return pages;
[PATCH 4/7] mm/powerpc: Add missing pud helpers
These new helpers will be needed for pud entry updates soon. Namely: - pudp_invalidate() - pud_modify() Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: linuxppc-dev@lists.ozlabs.org Cc: Aneesh Kumar K.V Signed-off-by: Peter Xu --- arch/powerpc/include/asm/book3s/64/pgtable.h | 3 +++ arch/powerpc/mm/book3s64/pgtable.c | 20 2 files changed, 23 insertions(+) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index 8f9432e3855a..fc628a984669 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -1108,6 +1108,7 @@ extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot); extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot); extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot); extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot); +extern pud_t pud_modify(pud_t pud, pgprot_t newprot); extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd); extern void set_pud_at(struct mm_struct *mm, unsigned long addr, @@ -1368,6 +1369,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, #define __HAVE_ARCH_PMDP_INVALIDATE extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +extern pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, +pud_t *pudp); #define pmd_move_must_withdraw pmd_move_must_withdraw struct spinlock; diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index 2975ea0841ba..c6ae969020e0 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -176,6 +176,17 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, return __pmd(old_pmd); } +pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp) +{ + unsigned long old_pud; + + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); + old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); + return __pmd(old_pmd); +} + pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp, int full) { @@ -259,6 +270,15 @@ pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) pmdv &= _HPAGE_CHG_MASK; return pmd_set_protbits(__pmd(pmdv), newprot); } + +pud_t pud_modify(pud_t pud, pgprot_t newprot) +{ + unsigned long pudv; + + pudv = pud_val(pud); + pudv &= _HPAGE_CHG_MASK; + return pud_set_protbits(__pud(pudv), newprot); +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ /* For use by kexec, called with MMU off */ -- 2.45.0
[PATCH 6/7] mm/x86: Add missing pud helpers
These new helpers will be needed for pud entry updates soon. Namely: - pudp_invalidate() - pud_modify() Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Signed-off-by: Peter Xu --- arch/x86/include/asm/pgtable.h | 36 ++ arch/x86/mm/pgtable.c | 11 +++ 2 files changed, 47 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 25fc6d809572..3c23077adca6 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -775,6 +775,12 @@ static inline pmd_t pmd_mkinvalid(pmd_t pmd) __pgprot(pmd_flags(pmd) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); } +static inline pud_t pud_mkinvalid(pud_t pud) +{ + return pfn_pud(pud_pfn(pud), + __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE))); +} + static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask); static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) @@ -839,6 +845,21 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot) return pmd_result; } +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot) +{ + pudval_t val = pud_val(pud), oldval = val; + + /* +* NOTE: no need to consider shadow stack complexities because it +* doesn't support 1G mappings. +*/ + val &= _HPAGE_CHG_MASK; + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK; + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK); + + return __pud(val); +} + /* * mprotect needs to preserve PAT and encryption bits when updating * vm_page_prot @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma, } #endif +static inline pud_t pudp_establish(struct vm_area_struct *vma, + unsigned long address, pud_t *pudp, pud_t pud) +{ + if (IS_ENABLED(CONFIG_SMP)) { + return xchg(pudp, pud); + } else { + pud_t old = *pudp; + WRITE_ONCE(*pudp, pud); + return old; + } +} + #define __HAVE_ARCH_PMDP_INVALIDATE_AD extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp); +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, + pud_t *pudp); + /* * Page table pages are page-aligned. The lower half of the top * level is used for userspace and the top half for the kernel. diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index 93e54ba91fbf..4e245a1526ad 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -641,6 +641,17 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, +pud_t *pudp) +{ + VM_WARN_ON_ONCE(!pud_present(*pudp)); + pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp)); + flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE); + return old; +} +#endif + /** * reserve_top_address - reserves a hole in the top of kernel address space * @reserve - size of hole to reserve -- 2.45.0
[PATCH 7/7] mm/mprotect: fix dax pud handlings
This is only relevant to the two archs that support PUD dax, aka, x86_64 and ppc64. PUD THPs do not yet exist elsewhere, and hugetlb PUDs do not count in this case. DAX have had PUD mappings for years, but change protection path never worked. When the path is triggered in any form (a simple test program would be: call mprotect() on a 1G dev_dax mapping), the kernel will report "bad pud". This patch should fix that. The new change_huge_pud() tries to keep everything simple. For example, it doesn't optimize write bit as that will need even more PUD helpers. It's not too bad anyway to have one more write fault in the worst case once for 1G range; may be a bigger thing for each PAGE_SIZE, though. Neither does it support userfault-wp bits, as there isn't such PUD mappings that is supported; file mappings always need a split there. The same to TLB shootdown: the pmd path (which was for x86 only) has the trick of using _ad() version of pmdp_invalidate*() which can avoid one redundant TLB, but let's also leave that for later. Again, the larger the mapping, the smaller of such effect. Another thing worth mention is this path needs to be careful on handling "retry" event for change_huge_pud() (where it can return 0): it isn't like change_huge_pmd(), as the pmd version is safe with all conditions handled in change_pte_range() later, thanks to Hugh's new pte_offset_map_lock(). In short, change_pte_range() is simply smarter than change_pmd_range() now after the shmem thp collapse rework. For that reason, change_pud_range() will need proper retry if it races with something else when a huge PUD changed from under us. Cc: Dan Williams Cc: Matthew Wilcox Cc: Dave Jiang Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: Michael Ellerman Cc: Aneesh Kumar K.V Cc: Oscar Salvador Cc: x...@kernel.org Cc: linuxppc-dev@lists.ozlabs.org Fixes: a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent hugepages") Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage") Signed-off-by: Peter Xu --- include/linux/huge_mm.h | 24 +++ mm/huge_memory.c| 52 + mm/mprotect.c | 40 --- 3 files changed, 108 insertions(+), 8 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 212cca384d7e..b46673689f19 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -345,6 +345,17 @@ void split_huge_pmd_address(struct vm_area_struct *vma, unsigned long address, void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, unsigned long address); +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags); +#else +static inline int +change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) { return 0; } +#endif + #define split_huge_pud(__vma, __pud, __address) \ do {\ pud_t *pud = (__pud); \ @@ -588,6 +599,19 @@ static inline int next_order(unsigned long *orders, int prev) { return 0; } + +static inline void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud, + unsigned long address) +{ +} + +static inline int change_huge_pud(struct mmu_gather *tlb, + struct vm_area_struct *vma, pud_t *pudp, + unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) +{ + return 0; +} #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ static inline int split_folio_to_list_to_order(struct folio *folio, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0fffaa58a47a..af8d83f4ce02 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2091,6 +2091,53 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, return ret; } +/* + * Returns: + * + * - 0: if pud leaf changed from under us + * - 1: if pud can be skipped + * - HPAGE_PUD_NR: if pud was successfully processed + */ +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, + pud_t *pudp, unsigned long addr, pgprot_t newprot, + unsigned long cp_flags) +{ + struct mm_struct *mm = vma->vm_mm; + pud_t oldpud, entry; + spinlock_t *ptl; + + tlb_change_page_size(tlb, HPAGE_PUD_SIZE); + + /* NUMA balancing doesn't apply to dax */ + if (cp_flags & MM_CP_PROT_NUMA) +
[PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit
An entry should be reported as PUD leaf even if it's PROT_NONE, in which case PRESENT bit isn't there. I hit bad pud without this when testing dax 1G on zapping a PROT_NONE PUD. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: x...@kernel.org Signed-off-by: Peter Xu --- arch/x86/include/asm/pgtable.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 65b8e5bb902c..25fc6d809572 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -1073,8 +1073,7 @@ static inline pmd_t *pud_pgtable(pud_t pud) #define pud_leaf pud_leaf static inline bool pud_leaf(pud_t pud) { - return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) == - (_PAGE_PSE | _PAGE_PRESENT); + return pud_val(pud) & _PAGE_PSE; } static inline int pud_bad(pud_t pud) -- 2.45.0
Re: [PATCH 5/7] mm/x86: Make pud_leaf() only cares about PSE bit
On 6/21/24 07:25, Peter Xu wrote: > An entry should be reported as PUD leaf even if it's PROT_NONE, in which > case PRESENT bit isn't there. I hit bad pud without this when testing dax > 1G on zapping a PROT_NONE PUD. Yeah, looks like pmd_leaf() got fixed up because of its involvement in THP, but PUDs got missed. This patch also realigns pmd_leaf() and pud_leaf() behavior, which is important. Acked-by: Dave Hansen
Re: [PATCH] tools/perf: Handle perftool-testsuite_probe testcases fail when kernel debuginfo is not present
On Mon, 17 Jun 2024 17:51:21 +0530, Athira Rajeev wrote: > Running "perftool-testsuite_probe" fails as below: > > ./perf test -v "perftool-testsuite_probe" > 83: perftool-testsuite_probe : FAILED > > There are three fails: > > [...] Applied to perf-tools-next, thanks! Best regards, Namhyung
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On 6/21/24 07:25, Peter Xu wrote: > These new helpers will be needed for pud entry updates soon. Namely: > > - pudp_invalidate() > - pud_modify() I think it's also definitely worth noting where you got this code from. Presumably you copied, pasted and modified the PMD code. That's fine, but it should be called out. ... > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot) > +{ > + pudval_t val = pud_val(pud), oldval = val; > + > + /* > + * NOTE: no need to consider shadow stack complexities because it > + * doesn't support 1G mappings. > + */ > + val &= _HPAGE_CHG_MASK; > + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK; > + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK); > + > + return __pud(val); > +} First of all, the comment to explain what you didn't do here is as many lines as the code to _actually_ implement it. Second, I believe this might have missed the purpose of the "shadow stack complexities". The pmd/pte code is there not to support modifying shadow stack mappings, it's there to avoid inadvertent shadow stack mapping creation. That "NOTE:" is ambiguous as to whether the shadow stacks aren't supported on 1G mappings in Linux or the hardware (I just checked the hardware docs and don't see anything making 1G mappings special, btw). But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it to make it Dirty=1,Write=0? What prevents that from being misinterpreted by the hardware as being a valid 1G shadow stack mapping? > /* > * mprotect needs to preserve PAT and encryption bits when updating > * vm_page_prot > @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct > vm_area_struct *vma, > } > #endif > > +static inline pud_t pudp_establish(struct vm_area_struct *vma, > + unsigned long address, pud_t *pudp, pud_t pud) > +{ > + if (IS_ENABLED(CONFIG_SMP)) { > + return xchg(pudp, pud); > + } else { > + pud_t old = *pudp; > + WRITE_ONCE(*pudp, pud); > + return old; > + } > +} Why is there no: page_table_check_pud_set(vma->vm_mm, pudp, pud); ? Sure, it doesn't _do_ anything today. But the PMD code has it today. So leaving it out creates a divergence that honestly can only serve to bite us in the future and will create a head-scratching delta for anyone that is comparing PUD and PMD implementations in the future.
Re: [PATCH 4/7] mm/powerpc: Add missing pud helpers
On Fri, Jun 21, 2024 at 10:25:01AM -0400, Peter Xu wrote: > +pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, > + pud_t *pudp) > +{ > + unsigned long old_pud; > + > + VM_WARN_ON_ONCE(!pmd_present(*pmdp)); > + old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, > _PAGE_INVALID); > + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > + return __pmd(old_pmd); > +} I'll need an amend at least here, and my test harness won't catch it even if it's a mistake as silly as it could be.. My apologies for such noise. Below is what I plan to squash into this patch when I repost v2. ===8<=== diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c index c6ae969020e0..ea2c83634434 100644 --- a/arch/powerpc/mm/book3s64/pgtable.c +++ b/arch/powerpc/mm/book3s64/pgtable.c @@ -176,15 +176,15 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, return __pmd(old_pmd); } -pmd_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address, pud_t *pudp) { unsigned long old_pud; - VM_WARN_ON_ONCE(!pmd_present(*pmdp)); - old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, _PAGE_INVALID); - flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); - return __pmd(old_pmd); + VM_WARN_ON_ONCE(!pud_present(*pudp)); + old_pud = pud_hugepage_update(vma->vm_mm, address, pudp, _PAGE_PRESENT, _PAGE_INVALID); + flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE); + return __pud(old_pud); } pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma, ===8<=== -- Peter Xu
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote: > On 6/21/24 07:25, Peter Xu wrote: > > These new helpers will be needed for pud entry updates soon. Namely: > > > > - pudp_invalidate() > > - pud_modify() > > I think it's also definitely worth noting where you got this code from. > Presumably you copied, pasted and modified the PMD code. That's fine, > but it should be called out. Yes that's from PMD ones. Sure, I will add that. > > ... > > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot) > > +{ > > + pudval_t val = pud_val(pud), oldval = val; > > + > > + /* > > +* NOTE: no need to consider shadow stack complexities because it > > +* doesn't support 1G mappings. > > +*/ > > + val &= _HPAGE_CHG_MASK; > > + val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK; > > + val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK); > > + > > + return __pud(val); > > +} > > First of all, the comment to explain what you didn't do here is as many > lines as the code to _actually_ implement it. > > Second, I believe this might have missed the purpose of the "shadow > stack complexities". The pmd/pte code is there not to support modifying > shadow stack mappings, it's there to avoid inadvertent shadow stack > mapping creation. > > That "NOTE:" is ambiguous as to whether the shadow stacks aren't > supported on 1G mappings in Linux or the hardware (I just checked the > hardware docs and don't see anything making 1G mappings special, btw). Right this could be ambiguous indeed; I was trying to refer to the fact where shadow stack is only supported on anon, and anon doesn't support 1G. But looks like I'm more than wrong than that.. > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it > to make it Dirty=1,Write=0? What prevents that from being > misinterpreted by the hardware as being a valid 1G shadow stack mapping? Thanks for pointing that out. I think I was thinking it will only take effect on VM_SHADOW_STACK first, so it's not? I was indeed trying to find more information on shadow stack at that time but I can't find as much on the pgtable implications, on e.g. whether "D=1 + W=0" globally will be recognized as shadow stack. At least on SDM March 2024 version Vol3 Chap4 pgtable entries still don't explain these details, or maybe I missed it. Please let me know if there's suggestion on what I can read before I post a v2. So if it's globally taking effect, indeed we'll need to handle them in PUDs too. Asides, not sure whether it's off-topic to ask here, but... why shadow stack doesn't reuse an old soft-bit to explicitly mark "this is shadow stack ptes" when designing the spec? Now it consumed bit 58 anyway for caching dirty. IIUC we can avoid all these "move back and forth" issue on dirty bit if so. > > > /* > > * mprotect needs to preserve PAT and encryption bits when updating > > * vm_page_prot > > @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct > > vm_area_struct *vma, > > } > > #endif > > > > +static inline pud_t pudp_establish(struct vm_area_struct *vma, > > + unsigned long address, pud_t *pudp, pud_t pud) > > +{ > > + if (IS_ENABLED(CONFIG_SMP)) { > > + return xchg(pudp, pud); > > + } else { > > + pud_t old = *pudp; > > + WRITE_ONCE(*pudp, pud); > > + return old; > > + } > > +} > > Why is there no: > > page_table_check_pud_set(vma->vm_mm, pudp, pud); > > ? Sure, it doesn't _do_ anything today. But the PMD code has it today. > So leaving it out creates a divergence that honestly can only serve to > bite us in the future and will create a head-scratching delta for anyone > that is comparing PUD and PMD implementations in the future. Good question, I really don't remember why I didn't have that, since I should have referenced the pmd helper. I'll add them and see whether I'll hit something otherwise. Thanks for the review. -- Peter Xu
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On 6/21/24 08:45, Peter Xu wrote: > On Fri, Jun 21, 2024 at 07:51:26AM -0700, Dave Hansen wrote: ... >> But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it >> to make it Dirty=1,Write=0? What prevents that from being >> misinterpreted by the hardware as being a valid 1G shadow stack mapping? > > Thanks for pointing that out. I think I was thinking it will only take > effect on VM_SHADOW_STACK first, so it's not? > > I was indeed trying to find more information on shadow stack at that time > but I can't find as much on the pgtable implications, on e.g. whether "D=1 > + W=0" globally will be recognized as shadow stack. At least on SDM March > 2024 version Vol3 Chap4 pgtable entries still don't explain these details, > or maybe I missed it. Please let me know if there's suggestion on what I > can read before I post a v2. It's in the "Determination of Access Rights" section. A linear address is a shadow-stack address if the following are true of the translation of the linear address: (1) the R/W flag (bit 1) is 0 and the dirty flag (bit 6) is 1 in the paging- structure entry that maps the page containing the linear address; and (2) the R/W flag is 1 in every other paging- structure entry controlling the translation of the linear address. > So if it's globally taking effect, indeed we'll need to handle them in PUDs > too. > > Asides, not sure whether it's off-topic to ask here, but... why shadow > stack doesn't reuse an old soft-bit to explicitly mark "this is shadow > stack ptes" when designing the spec? Now it consumed bit 58 anyway for > caching dirty. IIUC we can avoid all these "move back and forth" issue on > dirty bit if so. The design accommodates "other" OSes that are using all the software bits for other things. For Linux, you're right, we just ended up consuming a software bit _anyway_ so we got all the complexity of the goofy permissions *AND* lost a bit in the end. Lose, lose. >>> /* >>> * mprotect needs to preserve PAT and encryption bits when updating >>> * vm_page_prot >>> @@ -1377,10 +1398,25 @@ static inline pmd_t pmdp_establish(struct >>> vm_area_struct *vma, >>> } >>> #endif >>> >>> +static inline pud_t pudp_establish(struct vm_area_struct *vma, >>> + unsigned long address, pud_t *pudp, pud_t pud) >>> +{ >>> + if (IS_ENABLED(CONFIG_SMP)) { >>> + return xchg(pudp, pud); >>> + } else { >>> + pud_t old = *pudp; >>> + WRITE_ONCE(*pudp, pud); >>> + return old; >>> + } >>> +} >> >> Why is there no: >> >> page_table_check_pud_set(vma->vm_mm, pudp, pud); >> >> ? Sure, it doesn't _do_ anything today. But the PMD code has it today. >> So leaving it out creates a divergence that honestly can only serve to >> bite us in the future and will create a head-scratching delta for anyone >> that is comparing PUD and PMD implementations in the future. > > Good question, I really don't remember why I didn't have that, since I > should have referenced the pmd helper. I'll add them and see whether I'll > hit something otherwise. > > Thanks for the review. One big thing I did in this review was make sure that the PMD and PUD helpers were doing the same thing. Would you mind circling back and double-checking the same before you repost this?
Re: [PATCH 07/15] parisc: use generic sys_fanotify_mark implementation
On 6/21/24 11:52, Arnd Bergmann wrote: On Fri, Jun 21, 2024, at 11:03, John Paul Adrian Glaubitz wrote: On Fri, 2024-06-21 at 10:56 +0200, Arnd Bergmann wrote: Feel free to pick up the sh patch directly, I'll just merge whatever is left in the end. I mainly want to ensure we can get all the bugfixes done for v6.10 so I can build my longer cleanup series on top of it for 6.11. This series is still for 6.10? Yes, these are all the bugfixes that I think we want to backport to stable kernels, so it makes sense to merge them as quickly as possible. The actual stuff I'm working on will come as soon as I have it in a state for public review and won't need to be backported. Ah, OK in that case would you please keep the two parisc patches in your git tree? I didn't plan to send a new pull request during v6.10, so it's easier for me if you keep them and send them together with your other remaining patches. (I'll drop them now from the parisc tree) I tested both patches, so you may add: Tested-by: Helge Deller Acked-by: Helge Deller Thank you! Helge
Re: [PATCH v4 15/29] arm64: handle PKEY/POE faults
On Fri, May 03, 2024 at 02:01:33PM +0100, Joey Gouly wrote: > @@ -529,6 +547,8 @@ static int __kprobes do_page_fault(unsigned long far, > unsigned long esr, > unsigned int mm_flags = FAULT_FLAG_DEFAULT; > unsigned long addr = untagged_addr(far); > struct vm_area_struct *vma; > + bool pkey_fault = false; > + int pkey = -1; > > if (kprobe_page_fault(regs, esr)) > return 0; > @@ -590,6 +610,12 @@ static int __kprobes do_page_fault(unsigned long far, > unsigned long esr, > vma_end_read(vma); > goto lock_mmap; > } > + > + if (fault_from_pkey(esr, vma, mm_flags)) { > + vma_end_read(vma); > + goto lock_mmap; > + } > + > fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, > regs); > if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > vma_end_read(vma); > @@ -617,6 +643,11 @@ static int __kprobes do_page_fault(unsigned long far, > unsigned long esr, > goto done; > } > > + if (fault_from_pkey(esr, vma, mm_flags)) { > + pkey_fault = true; > + pkey = vma_pkey(vma); > + } I was wondering if we actually need to test this again. We know the fault was from a pkey already above but I guess it matches what we do with the vma->vm_flags check in case it races with some mprotect() call. > + > fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs); You'll need to rebase this on 6.10-rcX since this function disappeared. Otherwise the patch looks fine. Reviewed-by: Catalin Marinas
Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap
On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote: > This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE > as the boot CPU will enable POE if it has it, so secondary CPUs must also > have this feature. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Adding some acks otherwise I'll forget what I reviewed. Acked-by: Catalin Marinas
Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap
On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote: > This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE > as the boot CPU will enable POE if it has it, so secondary CPUs must also > have this feature. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Acked-by: Catalin Marinas
Re: [PATCH v4 05/29] arm64: cpufeature: add Permission Overlay Extension cpucap
On Fri, Jun 21, 2024 at 06:01:52PM +0100, Catalin Marinas wrote: > On Fri, May 03, 2024 at 02:01:23PM +0100, Joey Gouly wrote: > > This indicates if the system supports POE. This is a CPUCAP_BOOT_CPU_FEATURE > > as the boot CPU will enable POE if it has it, so secondary CPUs must also > > have this feature. > > > > Signed-off-by: Joey Gouly > > Cc: Catalin Marinas > > Cc: Will Deacon > > Acked-by: Catalin Marinas One ack is sufficient, ignore this one ;) -- Catalin
Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register
On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote: > POR_EL0 is a register that can be modified by userspace directly, > so it must be context switched. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Reviewed-by: Catalin Marinas
Re: [PATCH v4 10/29] arm64: enable the Permission Overlay Extension for EL0
On Fri, May 03, 2024 at 02:01:28PM +0100, Joey Gouly wrote: > Expose a HWCAP and ID_AA64MMFR3_EL1_S1POE to userspace, so they can be used to > check if the CPU supports the feature. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Reviewed-by: Catalin Marinas
Re: [PATCH v4 11/29] arm64: re-order MTE VM_ flags
On Fri, May 03, 2024 at 02:01:29PM +0100, Joey Gouly wrote: > To make it easier to share the generic PKEYs flags, move the MTE flag. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Acked-by: Catalin Marinas
Re: [PATCH v4 12/29] arm64: add POIndex defines
On Fri, May 03, 2024 at 02:01:30PM +0100, Joey Gouly wrote: > The 3-bit POIndex is stored in the PTE at bits 60..62. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Acked-by: Catalin Marinas
Re: [PATCH v4 06/29] arm64: context switch POR_EL0 register
On Fri, May 03, 2024 at 02:01:24PM +0100, Joey Gouly wrote: > +static void flush_poe(void) > +{ > + if (!system_supports_poe()) > + return; > + > + write_sysreg_s(POR_EL0_INIT, SYS_POR_EL0); > + /* ISB required for kernel uaccess routines when chaning POR_EL0 */ Nit: s/chaning/changing/ -- Catalin
Re: [PATCH v4 16/29] arm64: add pte_access_permitted_no_overlay()
On Fri, May 03, 2024 at 02:01:34PM +0100, Joey Gouly wrote: > We do not want take POE into account when clearing the MTE tags. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Reviewed-by: Catalin Marinas
Re: [PATCH v4 20/29] arm64: enable POE and PIE to coexist
On Fri, May 03, 2024 at 02:01:38PM +0100, Joey Gouly wrote: > Set the EL0/userspace indirection encodings to be the overlay enabled > variants of the permissions. > > Signed-off-by: Joey Gouly > Cc: Catalin Marinas > Cc: Will Deacon Reviewed-by: Catalin Marinas
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote: > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it > to make it Dirty=1,Write=0? What prevents that from being > misinterpreted by the hardware as being a valid 1G shadow stack mapping? Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning like arch_check_zapped_pte/pmd() too. Not that we had no use for one before this.
Re: [PATCH 0/5] PCI: endpoint: Add EPC 'deinit' event and dw_pcie_ep_linkdown() API
Hello, > Applied patch 2/5 to pci/endpoint! Krzysztof, please apply patches 1/5 and 5/5 > to controller/dwc (patches 3/5 and 4/5 are already applied by you). Applied to controller/dwc, thank you! [01/02] PCI: dwc: ep: Remove dw_pcie_ep_init_notify() wrapper https://git.kernel.org/pci/pci/c/9eba2f70362f [02/02] PCI: layerscape-ep: Use the generic dw_pcie_ep_linkdown() API to handle Link Down event https://git.kernel.org/pci/pci/c/14638af66309 Krzysztof
Re: [musl] Re: [PATCH 09/15] sh: rework sync_file_range ABI
On Fri, Jun 21, 2024 at 10:44:39AM +0200, John Paul Adrian Glaubitz wrote: > Hi Arnd, > > thanks for your patch! > > On Thu, 2024-06-20 at 18:23 +0200, Arnd Bergmann wrote: > > From: Arnd Bergmann > > > > The unusual function calling conventions on superh ended up causing > ^^ >It's spelled SuperH > > > sync_file_range to have the wrong argument order, with the 'flags' > > argument getting sorted before 'nbytes' by the compiler. > > > > In userspace, I found that musl, glibc, uclibc and strace all expect the > > normal calling conventions with 'nbytes' last, so changing the kernel > > to match them should make all of those work. > > > > In order to be able to also fix libc implementations to work with existing > > kernels, they need to be able to tell which ABI is used. An easy way > > to do this is to add yet another system call using the sync_file_range2 > > ABI that works the same on all architectures. > > > > Old user binaries can now work on new kernels, and new binaries can > > try the new sync_file_range2() to work with new kernels or fall back > > to the old sync_file_range() version if that doesn't exist. > > > > Cc: sta...@vger.kernel.org > > Fixes: 75c92acdd5b1 ("sh: Wire up new syscalls.") > > Signed-off-by: Arnd Bergmann > > --- > > arch/sh/kernel/sys_sh32.c | 11 +++ > > arch/sh/kernel/syscalls/syscall.tbl | 3 ++- > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > diff --git a/arch/sh/kernel/sys_sh32.c b/arch/sh/kernel/sys_sh32.c > > index 9dca568509a5..d5a4f7c697d8 100644 > > --- a/arch/sh/kernel/sys_sh32.c > > +++ b/arch/sh/kernel/sys_sh32.c > > @@ -59,3 +59,14 @@ asmlinkage int sys_fadvise64_64_wrapper(int fd, u32 > > offset0, u32 offset1, > > (u64)len0 << 32 | len1, advice); > > #endif > > } > > + > > +/* > > + * swap the arguments the way that libc wants it instead of > > I think "swap the arguments to the order that libc wants them" would > be easier to understand here. > > > + * moving flags ahead of the 64-bit nbytes argument > > + */ > > +SYSCALL_DEFINE6(sh_sync_file_range6, int, fd, SC_ARG64(offset), > > +SC_ARG64(nbytes), unsigned int, flags) > > +{ > > +return ksys_sync_file_range(fd, SC_VAL64(loff_t, offset), > > +SC_VAL64(loff_t, nbytes), flags); > > +} > > diff --git a/arch/sh/kernel/syscalls/syscall.tbl > > b/arch/sh/kernel/syscalls/syscall.tbl > > index bbf83a2db986..c55fd7696d40 100644 > > --- a/arch/sh/kernel/syscalls/syscall.tbl > > +++ b/arch/sh/kernel/syscalls/syscall.tbl > > @@ -321,7 +321,7 @@ > > 311common set_robust_list sys_set_robust_list > > 312common get_robust_list sys_get_robust_list > > 313common splice sys_splice > > -314common sync_file_range sys_sync_file_range > > +314common sync_file_range sys_sh_sync_file_range6 > ^^ Why > the suffix 6 here? > > > 315common tee sys_tee > > 316common vmsplicesys_vmsplice > > 317common move_pages sys_move_pages > > @@ -395,6 +395,7 @@ > > 385common pkey_alloc sys_pkey_alloc > > 386common pkey_free sys_pkey_free > > 387common rseqsys_rseq > > +388common sync_file_range2sys_sync_file_range2 > > # room for arch specific syscalls > > 393common semget sys_semget > > 394common semctl sys_semctl > > I wonder how you discovered this bug. Did you look up the calling convention > on SuperH > and compare the argument order for the sys_sync_file_range system call > documented there > with the order in the kernel? > > Did you also check what order libc uses? I would expect libc on SuperH > misordering the > arguments as well unless I am missing something. Or do we know that the code > is actually > currently broken? No, there's no reason libc would misorder them because syscalls aren't function calls, and aren't subject to function call ABI. We have to explicitly bind the arguments to registers and make a syscall instruction. The only reason this bug happened on the kernel side is that someone thought it would be a smart idea to save maybe 10 instructions by treating the register state on entry as directly suitable to jump from asm to a C function rather than explicitly marshalling the arguments out of the user-kernel syscall ABI positions into actual arguments to a C function call. Rich
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On Fri, Jun 21, 2024 at 07:36:30PM +, Edgecombe, Rick P wrote: > On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote: > > > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it > > to make it Dirty=1,Write=0? What prevents that from being > > misinterpreted by the hardware as being a valid 1G shadow stack mapping? > > Hmm, it looks like we could use an arch_check_zapped_pud() that does a warning > like arch_check_zapped_pte/pmd() too. Not that we had no use for one before > this. I can definitely look into that, but this check only happens when zapping, and IIUC it means there can still be outliers floating around. I wonder whether it should rely on page_table_check_pxx_set() from that regard. Thanks, -- Peter Xu
[PATCH 1/1] dt-bindings: soc: fsl: Convert q(b)man-* to yaml format
Convert qman, bman, qman-portals, bman-portals to yaml format. Additional Chang for fsl,q(b)man-portal: - Only keep one example. - Add fsl,qman-channel-id property. - Use interrupt type macro. - Remove top level qman-portals@ff420 at example. Additional change for fsl,q(b)man: - Fixed example error. - Remove redundent part, only keep fsl,qman node. - Change memory-regions to memory-region. - fsl,q(b)man-portals is not required property Additional change for fsl,qman-fqd.yaml: - Fixed example error. - Only keep one example. - Ref to reserve-memory.yaml - Merge fsl,bman reserver memory part Signed-off-by: Frank Li --- .../bindings/soc/fsl/bman-portals.txt | 56 -- .../devicetree/bindings/soc/fsl/bman.txt | 137 - .../bindings/soc/fsl/fsl,bman-portal.yaml | 51 + .../devicetree/bindings/soc/fsl/fsl,bman.yaml | 83 .../bindings/soc/fsl/fsl,qman-fqd.yaml| 68 +++ .../bindings/soc/fsl/fsl,qman-portal.yaml | 110 +++ .../devicetree/bindings/soc/fsl/fsl,qman.yaml | 97 + .../bindings/soc/fsl/qman-portals.txt | 134 - .../devicetree/bindings/soc/fsl/qman.txt | 187 -- 9 files changed, 409 insertions(+), 514 deletions(-) delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman-portals.txt delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman.txt create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.yaml create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman.yaml create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman-fqd.yaml create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman-portal.yaml create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman.yaml delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman-portals.txt delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman.txt diff --git a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt b/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt deleted file mode 100644 index 2a00e14e11e02..0 --- a/Documentation/devicetree/bindings/soc/fsl/bman-portals.txt +++ /dev/null @@ -1,56 +0,0 @@ -QorIQ DPAA Buffer Manager Portals Device Tree Binding - -Copyright (C) 2008 - 2014 Freescale Semiconductor Inc. - -CONTENTS - - - BMan Portal - - Example - -BMan Portal Node - -Portals are memory mapped interfaces to BMan that allow low-latency, lock-less -interaction by software running on processor cores, accelerators and network -interfaces with the BMan - -PROPERTIES - -- compatible - Usage: Required - Value type: - Definition: Must include "fsl,bman-portal-" - May include "fsl,-bman-portal" or "fsl,bman-portal" - -- reg - Usage: Required - Value type: - Definition: Two regions. The first is the cache-enabled region of - the portal. The second is the cache-inhibited region of - the portal - -- interrupts - Usage: Required - Value type: - Definition: Standard property - -EXAMPLE - -The example below shows a (P4080) BMan portals container/bus node with two portals - - bman-portals@ff400 { - #address-cells = <1>; - #size-cells = <1>; - compatible = "simple-bus"; - ranges = <0 0xf 0xf400 0x20>; - - bman-portal@0 { - compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal"; - reg = <0x0 0x4000>, <0x10 0x1000>; - interrupts = <105 2 0 0>; - }; - bman-portal@4000 { - compatible = "fsl,bman-portal-1.0.0", "fsl,bman-portal"; - reg = <0x4000 0x4000>, <0x101000 0x1000>; - interrupts = <107 2 0 0>; - }; - }; diff --git a/Documentation/devicetree/bindings/soc/fsl/bman.txt b/Documentation/devicetree/bindings/soc/fsl/bman.txt deleted file mode 100644 index 48eed140765b0..0 --- a/Documentation/devicetree/bindings/soc/fsl/bman.txt +++ /dev/null @@ -1,137 +0,0 @@ -QorIQ DPAA Buffer Manager Device Tree Bindings - -Copyright (C) 2008 - 2014 Freescale Semiconductor Inc. - -CONTENTS - - - BMan Node - - BMan Private Memory Node - - Example - -BMan Node - -The Buffer Manager is part of the Data-Path Acceleration Architecture (DPAA). -BMan supports hardware allocation and deallocation of buffers belonging to pools -originally created by software with configurable depletion thresholds. This -binding covers the CCSR space programming model - -PROPERTIES - -- compatible - Usage: Required - Value type: - Definition: Must include "fsl,bman" - May include "fsl,-bman" - -- reg -
Re: [PATCH 6/7] mm/x86: Add missing pud helpers
On Fri, 2024-06-21 at 16:27 -0400, Peter Xu wrote: > On Fri, Jun 21, 2024 at 07:36:30PM +, Edgecombe, Rick P wrote: > > On Fri, 2024-06-21 at 07:51 -0700, Dave Hansen wrote: > > > > > > But, still, what if you take a Dirty=1,Write=1 pud and pud_modify() it > > > to make it Dirty=1,Write=0? What prevents that from being > > > misinterpreted by the hardware as being a valid 1G shadow stack mapping? > > > > Hmm, it looks like we could use an arch_check_zapped_pud() that does a > > warning > > like arch_check_zapped_pte/pmd() too. Not that we had no use for one before > > this. > > I can definitely look into that, but this check only happens when zapping, > and IIUC it means there can still be outliers floating around. I wonder > whether it should rely on page_table_check_pxx_set() from that regard. Yes, it's not perfect. Hmm, it looks like the page_table_check would catch a lot more cases, but it would have to be changed to plumb the vma in.
Re: [PATCH 1/1] dt-bindings: soc: fsl: Convert q(b)man-* to yaml format
On Fri, 21 Jun 2024 17:30:14 -0400, Frank Li wrote: > Convert qman, bman, qman-portals, bman-portals to yaml format. > > Additional Chang for fsl,q(b)man-portal: > - Only keep one example. > - Add fsl,qman-channel-id property. > - Use interrupt type macro. > - Remove top level qman-portals@ff420 at example. > > Additional change for fsl,q(b)man: > - Fixed example error. > - Remove redundent part, only keep fsl,qman node. > - Change memory-regions to memory-region. > - fsl,q(b)man-portals is not required property > > Additional change for fsl,qman-fqd.yaml: > - Fixed example error. > - Only keep one example. > - Ref to reserve-memory.yaml > - Merge fsl,bman reserver memory part > > Signed-off-by: Frank Li > --- > .../bindings/soc/fsl/bman-portals.txt | 56 -- > .../devicetree/bindings/soc/fsl/bman.txt | 137 - > .../bindings/soc/fsl/fsl,bman-portal.yaml | 51 + > .../devicetree/bindings/soc/fsl/fsl,bman.yaml | 83 > .../bindings/soc/fsl/fsl,qman-fqd.yaml| 68 +++ > .../bindings/soc/fsl/fsl,qman-portal.yaml | 110 +++ > .../devicetree/bindings/soc/fsl/fsl,qman.yaml | 97 + > .../bindings/soc/fsl/qman-portals.txt | 134 - > .../devicetree/bindings/soc/fsl/qman.txt | 187 -- > 9 files changed, 409 insertions(+), 514 deletions(-) > delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman-portals.txt > delete mode 100644 Documentation/devicetree/bindings/soc/fsl/bman.txt > create mode 100644 > Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.yaml > create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,bman.yaml > create mode 100644 > Documentation/devicetree/bindings/soc/fsl/fsl,qman-fqd.yaml > create mode 100644 > Documentation/devicetree/bindings/soc/fsl/fsl,qman-portal.yaml > create mode 100644 Documentation/devicetree/bindings/soc/fsl/fsl,qman.yaml > delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman-portals.txt > delete mode 100644 Documentation/devicetree/bindings/soc/fsl/qman.txt > My bot found errors running 'make dt_binding_check' on your patch: yamllint warnings/errors: dtschema/dtc warnings/errors: Documentation/devicetree/bindings/soc/fsl/fsl,bman-portal.example.dtb: /example-0/bman-portal@0: failed to match any schema with compatible: ['fsl,bman-portal-1.0.0', 'fsl,bman-portal'] doc reference errors (make refcheckdocs): See https://patchwork.ozlabs.org/project/devicetree-bindings/patch/20240621213031.2759046-1-frank...@nxp.com The base for the series is generally the latest rc1. A different dependency should be noted in *this* patch. If you already ran 'make dt_binding_check' and didn't see the above error(s), then make sure 'yamllint' is installed and dt-schema is up to date: pip3 install dtschema --upgrade Please check and re-submit after running the above command yourself. Note that DT_SCHEMA_FILES can be set to your schema file to speed up checking your schema. However, it must be unset to test all examples with your schema.
Re: [V4 00/16] Add data type profiling support for powerpc
Hello, On Thu, Jun 20, 2024 at 09:01:01PM +0530, Athira Rajeev wrote: > > > > On 14 Jun 2024, at 10:56 PM, Athira Rajeev > > wrote: > > > > The patchset from Namhyung added support for data type profiling > > in perf tool. This enabled support to associate PMU samples to data > > types they refer using DWARF debug information. With the upstream > > perf, currently it possible to run perf report or perf annotate to > > view the data type information on x86. > > > > Initial patchset posted here had changes need to enable data type > > profiling support for powerpc. > > > > https://lore.kernel.org/all/6e09dc28-4a2e-49d8-a2b5-ffb3396a9...@csgroup.eu/T/ > > > > Main change were: > > 1. powerpc instruction nmemonic table to associate load/store > > instructions with move_ops which is use to identify if instruction > > is a memory access one. > > 2. To get register number and access offset from the given > > instruction, code uses fields from "struct arch" -> objump. > > Added entry for powerpc here. > > 3. A get_arch_regnum to return register number from the > > register name string. > > > > But the apporach used in the initial patchset used parsing of > > disassembled code which the current perf tool implementation does. > > > > Example: lwz r10,0(r9) > > > > This line "lwz r10,0(r9)" is parsed to extract instruction name, > > registers names and offset. Also to find whether there is a memory > > reference in the operands, "memory_ref_char" field of objdump is used. > > For x86, "(" is used as memory_ref_char to tackle instructions of the > > form "mov (%rax), %rcx". > > > > In case of powerpc, not all instructions using "(" are the only memory > > instructions. Example, above instruction can also be of extended form (X > > form) "lwzx r10,0,r19". Inorder to easy identify the instruction category > > and extract the source/target registers, second patchset added support to > > use > > raw instruction. With raw instruction, macros are added to extract opcode > > and register fields. > > Link to second patchset: > > https://lore.kernel.org/all/20240506121906.76639-1-atraj...@linux.vnet.ibm.com/ > > > > Example representation using --show-raw-insn in objdump gives result: > > > > 38 01 81 e8 ld r4,312(r1) > > > > Here "38 01 81 e8" is the raw instruction representation. In powerpc, > > this translates to instruction form: "ld RT,DS(RA)" and binary code > > as: > > _ > > | 58 | RT | RA | DS | | > > - > > 06 1116 30 31 > > > > Second patchset used "objdump" again to read the raw instruction. > > But since there is no need to disassemble and binary code can be read > > directly from the DSO, third patchset (ie this patchset) uses below > > apporach. The apporach preferred in powerpc to parse sample for data > > type profiling in V3 patchset is: > > - Read directly from DSO using dso__data_read_offset > > - If that fails for any case, fallback to using libcapstone > > - If libcapstone is not supported, approach will use objdump > > > > Patchset adds support to pick the opcode and reg fields from this > > raw/binary instruction code. This approach came in from review comment > > by Segher Boessenkool and Christophe for the initial patchset. > > > > Apart from that, instruction tracking is enabled for powerpc and > > support function is added to find variables defined as registers > > Example, in powerpc, below two registers are > > defined to represent variable: > > 1. r13: represents local_paca > > register struct paca_struct *local_paca asm("r13"); > > > > 2. r1: represents stack_pointer > > register void *__stack_pointer asm("r1"); > > > > These are handled in this patchset. > > > > - Patch 1 is to rearrange register state type structures to header file > > so that it can referred from other arch specific files > > - Patch 2 is to make instruction tracking as a callback to"struct arch" > > so that it can be implemented by other archs easily and defined in arch > > specific files > > - Patch 3 adds support to capture and parse raw instruction in powerpc > > using dso__data_read_offset utility > > - Patch 4 adds logic to support using objdump when doing default "perf > > report" or "perf annotate" since it that needs disassembled instruction. > > - Patch 5 adds disasm_line__parse to parse raw instruction for powerpc > > - Patch 6 update parameters for reg extract functions to use raw > > instruction on powerpc > > - Patch 7 add support to identify memory instructions of opcode 31 in > > powerpc > > - Patch 8 adds more instructions to support instruction tracking in powerpc > > - Patch 9 and 10 handles instruction tracking for powerpc. > > - Patch 11, 12 and 13 add support to use libcapstone in powerpc > > - Patch 14 and patch 15 handles support to find global register variables > > - Patch 16 handles insn-stat option for perf annotate > > > > Note: > > - There are rema